Prev | Next |
Longest Common Subsequence
Given two strings S[1..m] and T[1..n], find the longest subsequence that occurs both in S and T.
Example
Let S = a b c d e f and T = a x c e d
(1) Brute-force solution
The brute-force way to solve this problem is to enumerate all possible subsequences of S and T and find the longest one that occurs in both.
Subsequences(S) = {ε, a, b, c d, e, f, ab, ac, ae, ae, af, bc, bd, be, bf, cd, ce, cf, de, df, ef, abc, abd, ...(all 3-length subsequences)..., abcd, ...(4-length subsequences)..., etc. till abcdef}
Subsequences(T) = {ε, a, x, c, e, d, ax, ac, ae, ad, xc, xe, xd, ... etc. till axced}
There exists 2m subsequences for S and 2n subsequences for T. Now compare equal length subsequences of the first set with that of the second set pairwise and pick the longest of them. Clearly this approach takes exponential time.
The brute-force solution can be implemented using recursion. Let LCS(m,n) denote the length of the longest common subsequence of S[1..m] and T[1..n].
// Longest Common Subsequence - Recursive solution char S[1..m]; char T[1..n]; LCS(int m, int n) { if (m == 0 || n == 0) return 0; if (S[m] == T[n]) return 1 + LCS(m-1, n-1); else return max( LCS(m, n-1), LCS(m-1,n) ); } }
The tree below depicts the recursive formulation.
L(S,T,m,n) | | S[m] = T[n] | S[m] ≠ T[n] --------------------------------------------------------- | | | -----------------MAX----------------- | | | 1+L(m-1,n-1) L(m,n-1) L(m-1,n) | | | | | | | | | S[m-1] = T[n-1] | S[m-1] ≠ T[n-1] S[m] = T[n-1] | S[m] ≠ T[n-1] S[m-1] = T[n] | S[m-1] ≠ T[n] ----------------------- ------------------- -------------------- | | | | | | | -----MAX----- | ----MAX---- | ----MAX---- | | | | | | | | | 1+L(m-2,n-2) L(m-1,n-2) L(m-2,n-1) 1+L(m-1,n-2) L(m,n-2) L(m-1,n-1) 1+L(m-2,n-1) L(m-1,n-1) L(m-2,n) : : : : : : : : : : : : : : : : : : : : : : : : : : :
Time complexity
Since there are 2m and 2n subsequences for S and T respectively, the time complexity is O(2m + 2n).
(2) Greedy Solution - Does it exist?
To apply greedy strategy we need to first check if the problem exhibits (i) optimal substructure property and (ii) greedy choice property.
Optimal substructure: The recursive fomulation above reveals the optimal substructure. The problem of computing LCS(m, n) contains within itself the problem of computing LCS(m-1, n-1), LCS(m, n-1) and LCS(m-1, n).
Greedy choice property: In the above recursive solution, we reduced the problem of size m+n to 3 subproblems of sizes m+n-2, m+n-1 and m-1+n respectively. This is done at every step of the recursion. Is there a way to determine which would lead to optimal solution and eliminate the other 2 subproblems? In other words, can we make a sequence of locally optimal choices that leads to globally optimal solution?
Not really. Looking into the recursive formulation we observe 2 cases at each recursive call.
Case 1: L(i,j) reduces to 1 + L(i-1,j-1). This happens ith char of S matches with the jth char of T. i.e. A larger problem reduces to one smaller problem. The recurrence equation is
T(n) = T(n-1) + 1 ............(takes polynomial time)
Case 2: L(i,j) reduces to either L(i,j-1) or L(i-1,j). This happens when ith char of S does not match with the jth char of T. A larger problem reduces to 2 smaller problems. The recurrence is
T(n) = 2T(n-1) + 1 ............(takes exponential time)
Case 1 happens only if S = T. Extremely rare. In practical scenarios, it is going to be Case 2 most of the times. And there is no easy formula to determine which of the subproblems, L(i,j-1) or L(i-1,j) will lead to the longest common subsequence. Hence, greedy choice property does not hold.
(3) Dynamic Programming Solution
To apply dynamic programming, we need to check if the problem exhibits (i) optimal substructure and (ii) overlapping subproblems.
(i) We already saw above that optimal substructure property holds. i.e. optimal solution to the LCS problem contains within itself optimal solution to the subproblems.
(ii) The recursive formulation also reveals the existence of overlapping subproblems. It can be observed that LCS(m-1,n-1) repeats itself 4 times even in partial recursion tree. In a complete tree, it is easy to infer that a number of overlaps occur. For large values of m and n, the number of overlpapping subproblems can turn out to be far more.
The dynamic programming works in a bottom-up fashion. Subproblems are solved and their solutions are used to find solutions to larger problems. Hence, we work as follows.
Let LCS[i,j] denote the length of the longest common subsequence for S[1..i] and T[1..j]. We note the following base conditions.
LCS[0,j] = 0 for 0 <= j <= n. i.e. the length of the longest common subsequence with no part of S with every prefix of T is 0.
LCS[i,0] = 0 for 0 <= i <= m. i.e. the length of the longest common subsequence with no part of T with every prefix of S is 0.
For the given example the table looks as follows after initialization.
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
|||||
b i = 2 |
LCS[2,0] = 0 |
|||||
c i = 3 |
LCS[3,0] = 0 |
|||||
d i = 4 |
LCS[4,0] = 0 |
|||||
e i = 5 |
LCS[5,0] = 0 |
|||||
f i = 6 |
LCS[6,0] = 0 |
The rest of the computation proceeds as follows.
- In the first iteration, we find the LCS for S[1] with T[1], ... T[1..n] incrementally.
i.e. LCS[1,1], LCS[1,2], ..., LCS[1,n]. (n computations) - In the second iteration, we find the LCS for S[1..2] with T[1],... T[1..n] incrementally.
i.e. LCS[2,1], LCS[2,2], ..., LCS[2,n]. (n computations) - ....
- ....
- In the last (mth) iteration, we find the LCS for S[1..m] with T[1], ..., T[1..n] incrementally.
i.e. LCS[m,1], LCS[m,2], ..., LCS[m,n]. (n computations)
LCS[m,n] gives the length of the longest common subsequence. The entire computation worked out above can be accomplished using the following nested loop.
for (i = 1; i <= m; i++) {
for (j = 1; j <= n; j++) {
// Compute LCS[i,j]
.............
.............
}
}
LCS[i,j] for each i and j are computed based on the formula derived in recursive solution above.
LCS[i , j] = 1 + LCS[i-1,j-1] if S[i] = T[j]
= max{ LCS[i,j-1], LCS[i-1,j] } if S[i] ≠ T[j]
Let us now work out for the given example. m = 6 and n = 5.
Iteration 1: i = 1; j = 1 to 5;
S[1] = a = a = T[1]. LCS[1,1] = LCS[0,0] + 1 = 1
S[1] = a ≠ x = T[2]. LCS[1,2] = max{ LCS[0,2], LCS[1,1] } max{0,1} = 1
S[1] = a ≠ c = T[3]. LCS[1,3] = max{ LCS[0,3], LCS[1,2] } max{0,1} = 1
S[1] = a ≠ e = T[4]. LCS[1,4] = max{ LCS[0,4], LCS[1,3] } max{0,1} = 1
S[1] = a ≠ d = T[5]. LCS[1,5] = max{ LCS[0,5], LCS[1,4] } max{0,1} = 1
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
|||||
c i = 3 |
LCS[3,0] = 0 |
|||||
d i = 4 |
LCS[4,0] = 0 |
|||||
e i = 5 |
LCS[5,0] = 0 |
|||||
f i = 6 |
LCS[6,0] = 0 |
Iteration 2: i = 2; j = 1 to 5;
S[2] = b ≠ a = T[1]. LCS[2,1] = max{ LCS[1,1], LCS[2,0] } = max{1,0} = 1
S[2] = b ≠ x = T[2]. LCS[2,2] = max{ LCS[1,2], LCS[2,1] } = max{1,1} = 1
S[2] = b ≠ c = T[3]. LCS[2,3] = max{ LCS[1,3], LCS[2,2] } = max{1,1} = 1
S[2] = b ≠ e = T[4]. LCS[2,4] = max{ LCS[1,4], LCS[2,3] } = max{1,1} = 1
S[2] = b ≠ d = T[5]. LCS[2,5] = max{ LCS[1,5], LCS[2,4] } = max{1,1} = 1
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
LCS[2,1] = 1 |
LCS[2,2] = 1 |
LCS[2,3] = 1 |
LCS[2,4] = 1 |
LCS[2,5] = 1 |
c i = 3 |
LCS[3,0] = 0 |
|||||
d i = 4 |
LCS[4,0] = 0 |
|||||
e i = 5 |
LCS[5,0] = 0 |
|||||
f i = 6 |
LCS[6,0] = 0 |
Iteration 3: i = 3; j = 1 to 5;
S[3] = c ≠ a = T[1]. LCS[3,1] = max{ LCS[2,1], LCS[3,0] } = max{1,0} = 1
S[3] = c ≠ x = T[2]. LCS[3,2] = max{ LCS[2,2], LCS[3,1] } = max{1,1} = 1
S[3] = c = c = T[3]. LCS[3,3] = LCS[2,2] + 1 = 2
S[3] = c ≠ e = T[4]. LCS[3,4] = max{ LCS[2,4], LCS[3,3] } = max{1,2} = 2
S[3] = c ≠ d = T[5]. LCS[3,5] = max{ LCS[2,5], LCS[3,4] } = max{1,2} = 2
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
LCS[2,1] = 1 |
LCS[2,2] = 1 |
LCS[2,3] = 1 |
LCS[2,4] = 1 |
LCS[2,5] = 1 |
c i = 3 |
LCS[3,0] = 0 |
LCS[3,1] = 1 |
LCS[3,2] = 1 |
LCS[3,3] = 2 |
LCS[3,4] = 2 |
LCS[3,5] = 2 |
d i = 4 |
LCS[4,0] = 0 |
|||||
e i = 5 |
LCS[5,0] = 0 |
|||||
f i = 6 |
LCS[6,0] = 0 |
Iteration 4: i = 4; j = 1 to 5;
S[4] = d ≠ a = T[1]. LCS[4,1] = max{ LCS[3,1], LCS[4,0] } = max{1,0} = 1
S[4] = d ≠ x = T[2]. LCS[4,2] = max{ LCS[3,2], LCS[4,1] } = max{1,1} = 1
S[4] = d ≠ c = T[3]. LCS[4,3] = max{ LCS[3,3], LCS[4,2] } = max{2,1} = 2
S[4] = d ≠ e = T[4]. LCS[4,4] = max{ LCS[3,4], LCS[4,3] } = max{2,2} = 2
S[4] = d = d = T[5]. LCS[4,5] = LCS[3,4] + 1 = 2 + 1 = 3
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
LCS[2,1] = 1 |
LCS[2,2] = 1 |
LCS[2,3] = 1 |
LCS[2,4] = 1 |
LCS[2,5] = 1 |
c i = 3 |
LCS[3,0] = 0 |
LCS[3,1] = 1 |
LCS[3,2] = 1 |
LCS[3,3] = 2 |
LCS[3,4] = 2 |
LCS[3,5] = 2 |
d i = 4 |
LCS[4,0] = 0 |
LCS[4,1] = 1 |
LCS[4,2] = 1 |
LCS[4,3] = 2 |
LCS[4,4] = 2 |
LCS[4,5] = 3 |
e i = 5 |
LCS[5,0] = 0 |
|||||
f i = 6 |
LCS[6,0] = 0 |
Iteration 5: i = 5; j = 1 to 5;
S[5] = e ≠ a = T[1]. LCS[5,1] = max{ LCS[4,1], LCS[5,0] } = max{1,0} = 1
S[5] = e ≠ x = T[2]. LCS[5,2] = max{ LCS[4,2], LCS[5,1] } = max{1,1} = 1
S[5] = e ≠ c = T[3]. LCS[5,3] = max{ LCS[4,3], LCS[5,2] } = max{2,1} = 2
S[5] = e = T[4]. LCS[4,5] = LCS[3,4] + 1 = 2 + 1 = 3
S[5] = e ≠ d = T[5]. LCS[5,5] = max{ LCS[4,5], LCS[5,4] } = max{3,3} = 3
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
LCS[2,1] = 1 |
LCS[2,2] = 1 |
LCS[2,3] = 1 |
LCS[2,4] = 1 |
LCS[2,5] = 1 |
c i = 3 |
LCS[3,0] = 0 |
LCS[3,1] = 1 |
LCS[3,2] = 1 |
LCS[3,3] = 2 |
LCS[3,4] = 2 |
LCS[3,5] = 2 |
d i = 4 |
LCS[4,0] = 0 |
LCS[4,1] = 1 |
LCS[4,2] = 1 |
LCS[4,3] = 2 |
LCS[4,4] = 2 |
LCS[4,5] = 3 |
e i = 5 |
LCS[5,0] = 0 |
LCS[5,1] = 1 |
LCS[5,2] = 1 |
LCS[5,3] = 2 |
LCS[5,4] = 3 |
LCS[5,5] = 3 |
f i = 6 |
LCS[6,0] = 0 |
Iteration 6: i = 6; j = 1 to 5;
S[6] = f ≠ a = T[1]. LCS[6,1] = max{ LCS[5,1], LCS[6,0] } = max{1,0} = 1
S[6] = f ≠ x = T[2]. LCS[6,2] = max{ LCS[5,2], LCS[6,1] } = max{1,1} = 1
S[6] = f ≠ c = T[3]. LCS[6,3] = max{ LCS[5,3], LCS[5,2] } = max{2,1} = 2
S[6] = f ≠ e = T[4]. LCS[6,5] = max{ LCS[5,4], LCS[5,3] } = max{3,2} = 3
S[6] = f ≠ d = T[5]. LCS[6,5] = max{ LCS[5,5], LCS[5,4] } = max{3,3} = 3
S[ i ] \ T[ j ] |
ε j = 0 |
a j = 1 |
x j = 2 |
c j = 3 |
e j = 4 |
d j = 5 |
ε i = 0 |
LCS[0,0] = 0 |
LCS[0,1] = 0 |
LCS[0,2] = 0 |
LCS[0,3] = 0 |
LCS[0,4] = 0 |
LCS[0,5] = 0 |
a i = 1 |
LCS[1,0] = 0 |
LCS[1,1] = 1 |
LCS[1,2] = 1 |
LCS[1,3] = 1 |
LCS[1,4] = 1 |
LCS[1,5] = 1 |
b i = 2 |
LCS[2,0] = 0 |
LCS[2,1] = 1 |
LCS[2,2] = 1 |
LCS[2,3] = 1 |
LCS[2,4] = 1 |
LCS[2,5] = 1 |
c i = 3 |
LCS[3,0] = 0 |
LCS[3,1] = 1 |
LCS[3,2] = 1 |
LCS[3,3] = 2 |
LCS[3,4] = 2 |
LCS[3,5] = 2 |
d i = 4 |
LCS[4,0] = 0 |
LCS[4,1] = 1 |
LCS[4,2] = 1 |
LCS[4,3] = 2 |
LCS[4,4] = 2 |
LCS[4,5] = 3 |
e i = 5 |
LCS[5,0] = 0 |
LCS[5,1] = 1 |
LCS[5,2] = 1 |
LCS[5,3] = 2 |
LCS[5,4] = 3 |
LCS[5,5] = 3 |
f i = 6 |
LCS[6,0] = 0 |
LCS[6,1] = 1 |
LCS[6,2] = 1 |
LCS[6,3] = 2 |
LCS[6,4] = 3 |
LCS[6,5] = 3 |
Hence, the length of the longest common subsequence is 3. The actual subsequence can be determined by starting at LCS[6,5] (in general case LCS[m,n]), traversing backwards, taking diagonal direction or left/up direction as appropriate. That is, we decrement either m or n or both at each step. Whenever diagonal direction is taken, S[i] is printed before proceeding.
For the above example, there are two possible paths.
LCS[6,5] -> LCS[5,5] -> LCS[4,5] (print d) -> LCS[3,4] -> LCS[3,3] (print c) -> LCS[2,2] -> LCS[1,2] -> LCS[1,1] (print a) -> LCS[0,0]
LCS[6,5] -> LCS[5,5] -> LCS[5,4] (print e) -> LCS[4,3] -> LCS[3,3] (print c) -> LCS[2,2] -> LCS[1,2] -> LCS[1,1] (print a) -> LCS[0,0]
Algorithm
The dynamic programming algorithm for task scheduling can be given as follows.
int LCS[m][n]; DPLCS(S[1..m], T[1..n]) { for (i = 0; i <= m; i++) LCS[i][0] = 0; for (j = 0; j <= n; j++) LCS[0][j] = 0; for (i = 1; i <= m; i++) { for (j = 1; j <= n; j++) { if (S[i] == T[j]) LCS[i][j] = LCS[i-1][j-1] + 1 else LCS[i][j] = max(LCS[i-1][j], LCS[i][j-1]); } } return LCS[m][n]; }
Time complexity
There are 2 loops. The outerloop runs m times. The inner loop runs n times. Thus, the time complexity is O(mn).