用后缀数组构造LCP数组的kasai算法-yiteyi-C++库

背景后缀数组： 后缀数组是给定字符串的所有后缀的排序数组。让给定的字符串为“香蕉”。

null

0 banana                          5 a1 anana     Sort the Suffixes     3 ana2 nana      ---------------->     1 anana  3 ana        alphabetically       0 banana  4 na                              4 na   5 a                               2 nana

“香蕉”的后缀数组：后缀[]={5,3,1,0,4,2} 我们讨论过后缀数组及其O（nLogn）构造 . 一旦建立了后缀数组，我们就可以使用它高效地搜索文本中的模式。例如，我们可以使用二进制搜索来查找模式（本文讨论了相同模式的完整代码）在这里 )

LCP阵列 讨论了基于二进制搜索的解决方案在这里需要O（m*Logn）时间，其中m是要搜索的模式的长度，n是文本的长度。借助LCP阵列，我们可以在O（m+logn）时间内搜索一个模式。例如，如果我们的任务是在“香蕉”中搜索“安娜”，那么m=3，n=5。 LCP数组是大小为n的数组（与后缀数组类似）。值lcp[i]表示由后缀[i]和后缀[i+1]索引的后缀的最长公共前缀的长度。 后缀[n-1]没有定义，因为它后面没有后缀。

txt[0..n-1] = "banana"suffix[]  = {5, 3, 1, 0, 4, 2| lcp[]     = {1, 3, 0, 0, 2, 0}Suffixes represented by suffix array in order are:{"a", "ana", "anana", "banana", "na", "nana"}lcp[0] = Longest Common Prefix of "a" and "ana"     = 1lcp[1] = Longest Common Prefix of "ana" and "anana" = 3lcp[2] = Longest Common Prefix of "anana" and "banana" = 0lcp[3] = Longest Common Prefix of "banana" and "na" = 0lcp[4] = Longest Common Prefix of "na" and "nana" = 2lcp[5] = Longest Common Prefix of "nana" and None = 0

如何构造LCP阵列？ LCP阵列构造有两种方式： 1）计算LCP数组作为后缀数组的副产品（Manber&Myers算法） 2）使用已构造的后缀数组来计算LCP值。（Kasai算法）。有一些算法可以在O（n）时间内构造后缀数组，因此我们总是可以在O（n）时间内构造LCP数组。但在下面的实现中，讨论了一个O（n logn）算法。

kasai算法 本文讨论了kasai的算法。该算法利用后缀数组构造LCP数组，并在O（n）时间内输入文本。这个想法基于以下事实：设以txt[i]开头的后缀的lcp为k。如果k大于0，则以txt[i+1]开头的后缀的lcp至少为k-1。原因是，字符的相对顺序保持不变。如果我们从两个后缀中删除第一个字符，我们知道至少有k个字符会匹配。例如，对于子字符串“ana”，lcp为3，因此对于字符串“na”，lcp将至少为2。参考这为了证明。

下面是Kasai算法的C++实现。

C++

                         // C++ program for building LCP array for given text                       
                         #include <bits/stdc++.h>                       
                         using                                     namespace                                     std;                       
           
                         // Structure to store information of a suffix                       
                         struct                                     suffix                       
                         {                       
                                                 int                                     index;                                     // To store original index                       
                                                 int                                     rank[2];                                     // To store ranks and next rank pair                       
                         };                       
           
                         // A comparison function used by sort() to compare two suffixes                       
                         // Compares two pairs, returns 1 if first pair is smaller                       
                         int                                     cmp(                                     struct                                     suffix a,                                     struct                                     suffix b)                       
                         {                       
                                                 return                                     (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):                       
                                                 (a.rank[0] < b.rank[0] ?1: 0);                       
                         }                       
           
                         // This is the main function that takes a string 'txt' of size n as an                       
                         // argument, builds and return the suffix array for the given string                       
                         vector<                                     int                                     > buildSuffixArray(string txt,                                     int                                     n)                       
                         {                       
                                                 // A structure to store suffixes and their indexes                       
                                                 struct                                     suffix suffixes[n];                       
           
                                                 // Store suffixes and their indexes in an array of structures.                       
                                                 // The structure is needed to sort the suffixes alphabetically                       
                                                 // and maintain their old indexes while sorting                       
                                                 for                                     (                                     int                                     i = 0; i < n; i++)                       
                                                 {                       
                                                 suffixes[i].index = i;                       
                                                 suffixes[i].rank[0] = txt[i] -                                     'a'                                     ;                       
                                                 suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] -                                     'a'                                     ): -1;                       
                                                 }                       
           
                                                 // Sort the suffixes using the comparison function                       
                                                 // defined above.                       
                                                 sort(suffixes, suffixes+n, cmp);                       
           
                                                 // At his point, all suffixes are sorted according to first                       
                                                 // 2 characters.  Let us sort suffixes according to first 4                       
                                                 // characters, then first 8 and so on                       
                                                 int                                     ind[n];                                     // This array is needed to get the index in suffixes[]                       
                                                 // from original index.  This mapping is needed to get                       
                                                 // next suffix.                       
                                                 for                                     (                                     int                                     k = 4; k < 2*n; k = k*2)                       
                                                 {                       
                                                 // Assigning rank and index values to first suffix                       
                                                 int                                     rank = 0;                       
                                                 int                                     prev_rank = suffixes[0].rank[0];                       
                                                 suffixes[0].rank[0] = rank;                       
                                                 ind[suffixes[0].index] = 0;                       
           
                                                 // Assigning rank to suffixes                       
                                                 for                                     (                                     int                                     i = 1; i < n; i++)                       
                                                 {                       
                                                 // If first rank and next ranks are same as that of previous                       
                                                 // suffix in array, assign the same new rank to this suffix                       
                                                 if                                     (suffixes[i].rank[0] == prev_rank &&                       
                                                 suffixes[i].rank[1] == suffixes[i-1].rank[1])                       
                                                 {                       
                                                 prev_rank = suffixes[i].rank[0];                       
                                                 suffixes[i].rank[0] = rank;                       
                                                 }                       
                                                 else                                     // Otherwise increment rank and assign                       
                                                 {                       
                                                 prev_rank = suffixes[i].rank[0];                       
                                                 suffixes[i].rank[0] = ++rank;                       
                                                 }                       
                                                 ind[suffixes[i].index] = i;                       
                                                 }                       
           
                                                 // Assign next rank to every suffix                       
                                                 for                                     (                                     int                                     i = 0; i < n; i++)                       
                                                 {                       
                                                 int                                     nextindex = suffixes[i].index + k/2;                       
                                                 suffixes[i].rank[1] = (nextindex < n)?                       
                                                 suffixes[ind[nextindex]].rank[0]: -1;                       
                                                 }                       
           
                                                 // Sort the suffixes according to first k characters                       
                                                 sort(suffixes, suffixes+n, cmp);                       
                                                 }                       
           
                                                 // Store indexes of all sorted suffixes in the suffix array                       
                                                 vector<                                     int                                     >suffixArr;                       
                                                 for                                     (                                     int                                     i = 0; i < n; i++)                       
                                                 suffixArr.push_back(suffixes[i].index);                       
           
                                                 // Return the suffix array                       
                                                 return                                     suffixArr;                       
                         }                       
           
                         /* To construct and return LCP */                       
                         vector<                                     int                                     > kasai(string txt, vector<                                     int                                     > suffixArr)                       
                         {                       
                                                 int                                     n = suffixArr.size();                       
           
                                                 // To store LCP array                       
                                                 vector<                                     int                                     > lcp(n, 0);                       
           
                                                 // An auxiliary array to store inverse of suffix array                       
                                                 // elements. For example if suffixArr[0] is 5, the                       
                                                 // invSuff[5] would store 0.  This is used to get next                       
                                                 // suffix string from suffix array.                       
                                                 vector<                                     int                                     > invSuff(n, 0);                       
           
                                                 // Fill values in invSuff[]                       
                                                 for                                     (                                     int                                     i=0; i < n; i++)                       
                                                 invSuff[suffixArr[i]] = i;                       
           
                                                 // Initialize length of previous LCP                       
                                                 int                                     k = 0;                       
           
                                                 // Process all suffixes one by one starting from                       
                                                 // first suffix in txt[]                       
                                                 for                                     (                                     int                                     i=0; i<n; i++)                       
                                                 {                       
                                                 /* If the current suffix is at n-1, then we don’t                       
                                                 have next substring to consider. So lcp is not                       
                                                 defined for this substring, we put zero. */                       
                                                 if                                     (invSuff[i] == n-1)                       
                                                 {                       
                                                 k = 0;                       
                                                 continue                                     ;                       
                                                 }                       
           
                                                 /* j contains index of the next substring to                       
                                                 be considered  to compare with the present                       
                                                 substring, i.e., next string in suffix array */                       
                                                 int                                     j = suffixArr[invSuff[i]+1];                       
           
                                                 // Directly start matching from k'th index as                       
                                                 // at-least k-1 characters will match                       
                                                 while                                     (i+k<n && j+k<n && txt[i+k]==txt[j+k])                       
                                                 k++;                       
           
                                                 lcp[invSuff[i]] = k;                                     // lcp for the present suffix.                       
           
                                                 // Deleting the starting character from the string.                       
                                                 if                                     (k>0)                       
                                                 k--;                       
                                                 }                       
           
                                                 // return the constructed lcp array                       
                                                 return                                     lcp;                       
                         }                       
           
                         // Utility function to print an array                       
                         void                                     printArr(vector<                                     int                                     >arr,                                     int                                     n)                       
                         {                       
                                                 for                                     (                                     int                                     i = 0; i < n; i++)                       
                                                 cout << arr[i] <<                                     " "                                     ;                       
                                                 cout << endl;                       
                         }                       
           
                         // Driver program                       
                         int                                     main()                       
                         {                       
                                                 string str =                                     "banana"                                     ;                       
           
                                                 vector<                                     int                                     >suffixArr = buildSuffixArray(str, str.length());                       
                                                 int                                     n = suffixArr.size();                       
           
                                                 cout <<                                     "Suffix Array : "                                     ;                       
                                                 printArr(suffixArr, n);                       
           
                                                 vector<                                     int                                     >lcp = kasai(str, suffixArr);                       
           
                                                 cout <<                                     "LCP Array : "                                     ;                       
                                                 printArr(lcp, n);                       
                                                 return                                     0;                       
                         }                       

输出：

Suffix Array : 5 3 1 0 4 2 LCP Array : 1 3 0 0 2 0

插图：

txt[]     = "banana",  suffix[]  = {5, 3, 1, 0, 4, 2| Suffix array represents{"a", "ana", "anana", "banana", "na", "nana"}Inverse Suffix Array would be invSuff[] = {3, 2, 5, 1, 4, 0}

LCP值按以下顺序进行评估我们首先计算文本中第一个后缀的LCP，即“ 香蕉 “。我们需要后缀数组中的下一个后缀来计算LCP（记住LCP[i]被定义为后缀[i]和后缀[i+1]的最长公共前缀）。 为了在suffixArr[]中找到下一个后缀，我们使用SuffInv[] .下一个后缀是“na”。由于“banana”和“na”之间没有公共前缀，“banana”的LCP值为0，位于后缀数组的索引3处，因此我们填充 lcp[3] 作为0。接下来我们计算第二个后缀的LCP，它是“ 安娜娜 “。后缀数组中“anana”的下一个后缀是“banana”。由于没有公共前缀，“anana”的LCP值为0，位于后缀数组中的索引2处，因此我们填充 lcp[2] 作为0。接下来我们计算第三个后缀的LCP，它是“ 娜娜 “。由于没有下一个后缀，“nana”的LCP值未定义。我们填满 lcp[5] 作为0。文本中的下一个后缀是“ana”。下一个后缀是“ 安娜后缀数组中的是“anana”。由于有一个长度为3的通用前缀，“ana”的LCP值为3。我们填满 lcp[1] 3。现在我们用lcp表示文本中的下一个后缀，即“ 不 “。这是Kasai算法使用的技巧，LCP值必须至少为2，因为之前的LCP值为3。由于“na”后没有字符，LCP的最终值为2。我们填满 lcp[4] 2。文本中的下一个后缀是“ A. “.LCP值必须至少为1，因为之前的值为2。由于“a”后没有字符，LCP的最终值为1。”。我们填满 lcp[0] 作为1。我们将很快讨论在LCP数组的帮助下实现搜索，以及LCP数组如何帮助将时间复杂度降低到O（m+logn）。

参考资料： http://web.stanford.edu/class/cs97si/suffix-array.pdf http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/suffix-array.pdf http://codeforces.com/blog/entry/12796

本文由 普拉哈尔·阿格拉瓦尔 。如果您发现任何不正确的地方，或者您想分享有关上述主题的更多信息，请发表评论

文章版权归作者所有，未经允许请勿转载。

THE END

技术文章