后缀树应用程序6–最长回文子串

给定一个字符串,找到最长的回文子字符串。 我们已经讨论过天真[O(n)] 3. )],二次[O(n 2. )]线性[O(n)]在 第一组 , 第二组 马纳彻算法 . 在本文中,我们将讨论另一种基于后缀树的线性时间方法。 如果给定字符串为S,则方法如下:

null
  • 反转字符串S(假设反转的字符串是R)
  • 收到 最长公共子串 S和R的 鉴于S和R中的LCS必须来自S中的相同位置

你明白我们为什么这么说吗 R和S中的LCS必须来自S中的同一位置 ? 让我们看看下面的例子:

  • 对于S= 沙巴巴伊兹 和R= 齐阿巴巴克斯 ,LCS和LP均为ababa(相同)
  • 对于S= abacdfgdcaba 和R= abacdfdcaba ,LCS是 阿巴德 LPS是 阿巴 (不同)
  • 对于S= pqrqpabcdfgdcba 和R= abcdgfdcbapqp ,LCS和LPS都是 pqrqp (相同)
  • 对于S= pqqpabcdfghfdcba 和R= abcdfhgfdcbapqqp ,LCS是 abcdf LPS是 pqqp (不同)

我们可以看到,LCS和LPS并不总是相同的。当他们不一样的时候? 当S中有一个非回文子串的反向拷贝,其长度与S中的LPS相同或更长,则LCS和LPS将不同 . 在2 以上示例= abacdfgdcaba ),用于子字符串 阿巴德 ,存在反向副本 dcaba 在S中,其长度比LPS长 阿巴 所以LPS和LCS在这里是不同的。4中的情况也是如此 th 实例 为了处理这种情况,我们说S中的LPS与S和R中的LCS相同 考虑到R和S中的LCS必须来自S中的相同位置 . 如果我们看2 再举一个例子,子字符串 阿巴 在R中,它和子串在S中的位置完全相同 阿巴 在S中,它是零(0 th 这就是LPS。 位置约束:

Suffix Tree Application

我们将字符串的索引称为正向索引 )字符串R索引作为反向索引(R ). 根据上图,长度为N的字符串S中索引为i(正向索引)的字符将位于其反向字符串R中的索引N-1-i(反向索引)。 如果我们在字符串S中取长度为L的子串,起始索引为i,结束索引为j(j=i+L-1),那么在它的反向字符串R中,相同长度的反向子串将从索引N-1-j开始,并在索引N-1-i结束。 如果在索引S处有一个长度为L的公共子串 (远期指数)和R (反向索引)在S和R中,则它们将来自S中的相同位置,如果 R =(N-1)–(S) +L–1) 其中N是字符串长度。 所以为了找到字符串S的LPS,我们找到了S和R的最长公共字符串,其中两个子字符串都满足上述约束,也就是说,如果S中的子字符串在索引S处 ,则相同的子字符串应位于索引(N–1)–(S)处的R中 +L–1)。如果不是这样,那么这个子字符串就不是LPS候选字符串。 天真的[O(N*M] 2. )]和动态规划[O(N*M)]方法来寻找两个字符串的LCS已经讨论过 在这里 它可以扩展为添加位置约束,以给出给定字符串的LPS。 现在我们将讨论后缀树方法,它只是 后缀树LCS方法 我们将在其中添加位置约束。 在查找两个字符串X和Y的LCS时,我们只需将标记为XY的最深节点(即具有两个字符串后缀的节点作为其子节点)。 在查找字符串S的LPS时,我们将再次查找S和R的LCS,条件是公共子字符串应满足位置约束(公共子字符串应来自S中的相同位置)。为了验证位置约束,我们需要知道每个内部节点上的所有正向和反向索引(即内部节点下所有叶子树的后缀索引)。 在里面 广义后缀树 属于 S#R$ ,如果内部节点具有来自字符串S和R的后缀,则从根到内部节点的路径上的子字符串是公共子字符串。通过查看各自叶节点的后缀索引,可以找到S和R中公共子字符串的索引。 如果字符串 S# 长度为N,那么:

  • 如果一片叶子的后缀索引小于N,则该后缀属于S,相同的后缀索引将成为所有祖先节点的前向索引
  • 如果一片叶子的后缀索引大于N,那么该后缀属于R,所有祖先节点的反向索引都将为 N–后缀索引

让我们拿绳子= 卡巴布 .下图为 广义后缀树 对于 卡巴布#巴巴布$ 其中,我们展示了所有内部节点(根节点除外)上所有子后缀的正向和反向索引。 正向索引在括号()中,反向索引在方括号[]中。

Suffix Tree Application

在上图中,所有叶节点都有一个正向或反向索引,这取决于它们所属的字符串(S或R)。然后,孩子们的正向或反向索引传播给父母。 看看这个图,了解一个给定后缀索引的叶子上的正向或反向索引是什么。图的底部显示,后缀索引为0到8的叶子将获得与S中的正向索引相同的值(0到8),后缀索引为9到17的叶子将获得R中从0到8的反向索引。 例如,高亮显示的内部节点有两个子节点,后缀索引为2和9。后缀索引为2的叶子来自S中的位置2,所以它的正向索引为2,如()所示。后缀索引为9的叶子来自R中的位置0,所以它的反向索引为0,如[]所示。这些指数传播到亲本,亲本有一片叶子,后缀指数为14,反向指数为4。在这个父节点上,正向索引是(2),反向索引是[0,4]。同样,我们应该能够理解如何在所有节点上计算正向和反向索引。 在上图中,所有内部节点都有来自字符串S和R的后缀,也就是说,它们都代表从根到它们自己的路径上的一个公共子字符串。现在我们需要找到满足位置约束的最深节点。为此,我们需要检查是否存在正向指数S 在一个节点上,则必须有一个反向索引R 具有值(N–2)–(S) +L–1)其中N是字符串的长度 S# L是节点深度(或子串长度)。如果是,则将此节点视为LPS候选,否则忽略它。在上图中,最深的节点突出显示,它将LPS表示为bbaabb。 我们在图中没有显示根节点上的正向和反向索引。因为根节点本身并不代表任何公共子字符串(在代码实现中,也不会在根节点上计算正向和反向索引) 如何实施这种方法来找到LPS?以下是我们需要的东西:

  • 我们需要知道每个节点上的正向和反向索引。
  • 对于给定的远期指数S 在内部节点上,我们需要知道是否反向索引 R =(N-2)–(S) +L–1) 也存在于同一节点上。
  • 跟踪满足上述条件的最深内部节点。

实现上述目标的一种方法是: 虽然DFS位于后缀树上,但我们可以以某种方式在每个节点上存储正向和反向索引(当我们需要知道节点上的正向和反向索引时,存储将有助于避免在树上重复遍历)。稍后,我们可以执行另一个DFS来查找满足位置约束的节点。对于位置约束检查,我们需要在索引列表中搜索。 什么样的数据结构适合以最快的方式完成所有这些工作?

  • 如果我们将索引存储在数组中,则需要线性搜索,这将使整体方法在时间上非线性。
  • 如果我们将索引存储在树中(在C++中设置,java中的树集),我们可以使用二进制搜索,但是总体的方法会在时间上是非线性的。
  • 如果我们将散列函数的索引存储在基于哈希函数的集合(C++中的无序排序集,java中的HashSet),它将提供一个平均的常数搜索,这将使得总体方法在时间上是线性的。 基于哈希函数的集合可能会占用更多空间,具体取决于存储的值。

在我们的实现中,我们将使用两个无序的_集(一个用于正向索引,另一个来自反向索引),它们作为成员变量添加到SuffExtreNode结构中。

C

// A C++ program to implement Ukkonen's Suffix Tree Construction
// Here we build generalized suffix tree for given string S
// and it's reverse R, then we find
// longest palindromic substring of given string S
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <iostream>
#include <unordered_set>
#define MAX_CHAR 256
using namespace std;
struct SuffixTreeNode {
struct SuffixTreeNode *children[MAX_CHAR];
//pointer to other node via suffix link
struct SuffixTreeNode *suffixLink;
/*(start, end) interval specifies the edge, by which the
node is connected to its parent node. Each edge will
connect two nodes,  one parent and one child, and
(start, end) interval of a given edge  will be stored
in the child node. Lets say there are two nods A and B
connected by an edge with indices (5, 8) then this
indices (5, 8) will be stored in node B. */
int start;
int *end;
/*for leaf nodes, it stores the index of suffix for
the path  from root to leaf*/
int suffixIndex;
//To store indices of children suffixes in given string
unordered_set< int > *forwardIndices;
//To store indices of children suffixes in reversed string
unordered_set< int > *reverseIndices;
};
typedef struct SuffixTreeNode Node;
char text[100]; //Input string
Node *root = NULL; //Pointer to root node
/*lastNewNode will point to newly created internal node,
waiting for it's suffix link to be set, which might get
a new suffix link (other than root) in next extension of
same phase. lastNewNode will be set to NULL when last
newly created internal node (if there is any) got it's
suffix link reset to new internal node created in next
extension of same phase. */
Node *lastNewNode = NULL;
Node *activeNode = NULL;
/*activeEdge is represented as input string character
index (not the character itself)*/
int activeEdge = -1;
int activeLength = 0;
// remainingSuffixCount tells how many suffixes yet to
// be added in tree
int remainingSuffixCount = 0;
int leafEnd = -1;
int *rootEnd = NULL;
int *splitEnd = NULL;
int size = -1; //Length of input string
int size1 = 0; //Size of 1st string
int reverseIndex; //Index of a suffix in reversed string
unordered_set< int >::iterator forwardIndex;
Node *newNode( int start, int *end)
{
Node *node =(Node*) malloc ( sizeof (Node));
int i;
for (i = 0; i < MAX_CHAR; i++)
node->children[i] = NULL;
/*For root node, suffixLink will be set to NULL
For internal nodes, suffixLink will be set to root
by default in  current extension and may change in
next extension*/
node->suffixLink = root;
node->start = start;
node->end = end;
/*suffixIndex will be set to -1 by default and
actual suffix index will be set later for leaves
at the end of all phases*/
node->suffixIndex = -1;
node->forwardIndices = new unordered_set< int >;
node->reverseIndices = new unordered_set< int >;
return node;
}
int edgeLength(Node *n) {
if (n == root)
return 0;
return *(n->end) - (n->start) + 1;
}
int walkDown(Node *currNode)
{
/*activePoint change for walk down (APCFWD) using
Skip/Count Trick  (Trick 1). If activeLength is greater
than current edge length, set next  internal node as
activeNode and adjust activeEdge and activeLength
accordingly to represent same activePoint*/
if (activeLength >= edgeLength(currNode))
{
activeEdge += edgeLength(currNode);
activeLength -= edgeLength(currNode);
activeNode = currNode;
return 1;
}
return 0;
}
void extendSuffixTree( int pos)
{
/*Extension Rule 1, this takes care of extending all
leaves created so far in tree*/
leafEnd = pos;
/*Increment remainingSuffixCount indicating that a
new suffix added to the list of suffixes yet to be
added in tree*/
remainingSuffixCount++;
/*set lastNewNode to NULL while starting a new phase,
indicating there is no internal node waiting for
it's suffix link reset in current phase*/
lastNewNode = NULL;
//Add all suffixes (yet to be added) one by one in tree
while (remainingSuffixCount > 0) {
if (activeLength == 0)
activeEdge = pos; //APCFALZ
// There is no outgoing edge starting with
// activeEdge from activeNode
if (activeNode->children]  == NULL)
{
//Extension Rule 2 (A new leaf edge gets created)
activeNode->children]  =
newNode(pos, &leafEnd);
/*A new leaf edge is created in above line starting
from  an existing node (the current activeNode), and
if there is any internal node waiting for it's suffix
link get reset, point the suffix link from that last
internal node to current activeNode. Then set lastNewNode
to NULL indicating no more node waiting for suffix link
reset.*/
if (lastNewNode != NULL)
{
lastNewNode->suffixLink = activeNode;
lastNewNode = NULL;
}
}
// There is an outgoing edge starting with activeEdge
// from activeNode
else
{
// Get the next node at the end of edge starting
// with activeEdge
Node *next = activeNode->children] ;
if (walkDown(next)) //Do walkdown
{
//Start from next node (the new activeNode)
continue ;
}
/*Extension Rule 3 (current character being processed
is already on the edge)*/
if (text[next->start + activeLength] == text[pos])
{
//APCFER3
activeLength++;
/*STOP all further processing in this phase
and move on to next phase*/
break ;
}
/*We will be here when activePoint is in middle of
the edge being traversed and current character
being processed is not  on the edge (we fall off
the tree). In this case, we add a new internal node
and a new leaf edge going out of that new node. This
is Extension Rule 2, where a new leaf edge and a new
internal node get created*/
splitEnd = ( int *) malloc ( sizeof ( int ));
*splitEnd = next->start + activeLength - 1;
//New internal node
Node *split = newNode(next->start, splitEnd);
activeNode->children]  = split;
//New leaf coming out of new internal node
split->children] = newNode(pos, &leafEnd);
next->start += activeLength;
split->children]  = next;
/*We got a new internal node here. If there is any
internal node created in last extensions of same
phase which is still waiting for it's suffix link
reset, do it now.*/
if (lastNewNode != NULL)
{
/*suffixLink of lastNewNode points to current newly
created internal node*/
lastNewNode->suffixLink = split;
}
/*Make the current newly created internal node waiting
for it's suffix link reset (which is pointing to root
at present). If we come across any other internal node
(existing or newly created) in next extension of same
phase, when a new leaf edge gets added (i.e. when
Extension Rule 2 applies is any of the next extension
of same phase) at that point, suffixLink of this node
will point to that internal node.*/
lastNewNode = split;
}
/* One suffix got added in tree, decrement the count of
suffixes yet to be added.*/
remainingSuffixCount--;
if (activeNode == root && activeLength > 0) //APCFER2C1
{
activeLength--;
activeEdge = pos - remainingSuffixCount + 1;
}
else if (activeNode != root) //APCFER2C2
{
activeNode = activeNode->suffixLink;
}
}
}
void print( int i, int j)
{
int k;
for (k=i; k<=j && text[k] != '#' ; k++)
printf ( "%c" , text[k]);
if (k<=j)
printf ( "#" );
}
//Print the suffix tree as well along with setting suffix index
//So tree will be printed in DFS manner
//Each edge along with it's suffix index will be printed
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
if (n == NULL) return ;
if (n->start != -1) //A non-root node
{
//Print the label on edge from parent to current node
//Uncomment below line to print suffix tree
//print(n->start, *(n->end));
}
int leaf = 1;
int i;
for (i = 0; i < MAX_CHAR; i++)
{
if (n->children[i] != NULL)
{
//Uncomment below two lines to print suffix index
//   if (leaf == 1 && n->start != -1)
//     printf(" [%d]", n->suffixIndex);
//Current node is not a leaf as it has outgoing
//edges from it.
leaf = 0;
setSuffixIndexByDFS(n->children[i], labelHeight +
edgeLength(n->children[i]));
if (n != root)
{
//Add children's suffix indices in parent
n->forwardIndices->insert(
n->children[i]->forwardIndices->begin(),
n->children[i]->forwardIndices->end());
n->reverseIndices->insert(
n->children[i]->reverseIndices->begin(),
n->children[i]->reverseIndices->end());
}
}
}
if (leaf == 1)
{
for (i= n->start; i<= *(n->end); i++)
{
if (text[i] == '#' )
{
n->end = ( int *) malloc ( sizeof ( int ));
*(n->end) = i;
}
}
n->suffixIndex = size - labelHeight;
if (n->suffixIndex < size1) //Suffix of Given String
n->forwardIndices->insert(n->suffixIndex);
else //Suffix of Reversed String
n->reverseIndices->insert(n->suffixIndex - size1);
//Uncomment below line to print suffix index
// printf(" [%d]", n->suffixIndex);
}
}
void freeSuffixTreeByPostOrder(Node *n)
{
if (n == NULL)
return ;
int i;
for (i = 0; i < MAX_CHAR; i++)
{
if (n->children[i] != NULL)
{
freeSuffixTreeByPostOrder(n->children[i]);
}
}
if (n->suffixIndex == -1)
free (n->end);
free (n);
}
/*Build the suffix tree and print the edge labels along with
suffixIndex. suffixIndex for leaf edges will be >= 0 and
for non-leaf edges will be -1*/
void buildSuffixTree()
{
size = strlen (text);
int i;
rootEnd = ( int *) malloc ( sizeof ( int ));
*rootEnd = - 1;
/*Root is a special node with start and end indices as -1,
as it has no parent from where an edge comes to root*/
root = newNode(-1, rootEnd);
activeNode = root; //First activeNode will be root
for (i=0; i<size; i++)
extendSuffixTree(i);
int labelHeight = 0;
setSuffixIndexByDFS(root, labelHeight);
}
void doTraversal(Node *n, int labelHeight, int * maxHeight,
int * substringStartIndex)
{
if (n == NULL)
{
return ;
}
int i=0;
int ret = -1;
if (n->suffixIndex < 0) //If it is internal node
{
for (i = 0; i < MAX_CHAR; i++)
{
if (n->children[i] != NULL)
{
doTraversal(n->children[i], labelHeight +
edgeLength(n->children[i]),
maxHeight, substringStartIndex);
if (*maxHeight < labelHeight
&& n->forwardIndices->size() > 0 &&
n->reverseIndices->size() > 0)
{
for (forwardIndex=n->forwardIndices->begin();
forwardIndex!=n->forwardIndices->end();
++forwardIndex)
{
reverseIndex = (size1 - 2) -
(*forwardIndex + labelHeight - 1);
//If reverse suffix comes from
//SAME position in given string
//Keep track of deepest node
if (n->reverseIndices->find(reverseIndex) !=
n->reverseIndices->end())
{
*maxHeight = labelHeight;
*substringStartIndex = *(n->end) -
labelHeight + 1;
break ;
}
}
}
}
}
}
}
void getLongestPalindromicSubstring()
{
int maxHeight = 0;
int substringStartIndex = 0;
doTraversal(root, 0, &maxHeight, &substringStartIndex);
int k;
for (k=0; k<maxHeight; k++)
printf ( "%c" , text[k + substringStartIndex]);
if (k == 0)
printf ( "No palindromic substring" );
else
printf ( ", of length: %d" ,maxHeight);
printf ( "" );
}
// driver program to test above functions
int main( int argc, char *argv[])
{
size1 = 9;
printf ( "Longest Palindromic Substring in cabbaabb is: " );
strcpy (text, "cabbaabb#bbaabbac$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 17;
printf ( "Longest Palindromic Substring in forgeeksskeegfor is: " );
strcpy (text, "forgeeksskeegfor#rofgeeksskeegrof$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf ( "Longest Palindromic Substring in abcde is: " );
strcpy (text, "abcde#edcba$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 7;
printf ( "Longest Palindromic Substring in abcdae is: " );
strcpy (text, "abcdae#eadcba$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf ( "Longest Palindromic Substring in abacd is: " );
strcpy (text, "abacd#dcaba$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf ( "Longest Palindromic Substring in abcdc is: " );
strcpy (text, "abcdc#cdcba$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 13;
printf ( "Longest Palindromic Substring in abacdfgdcaba is: " );
strcpy (text, "abacdfgdcaba#abacdgfdcaba$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 15;
printf ( "Longest Palindromic Substring in xyabacdfgdcaba is: " );
strcpy (text, "xyabacdfgdcaba#abacdgfdcabayx$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 9;
printf ( "Longest Palindromic Substring in xababayz is: " );
strcpy (text, "xababayz#zyababax$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
size1 = 6;
printf ( "Longest Palindromic Substring in xabax is: " );
strcpy (text, "xabax#xabax$" ); buildSuffixTree();
getLongestPalindromicSubstring();
//Free the dynamically allocated memory
freeSuffixTreeByPostOrder(root);
return 0;
}


输出:

Longest Palindromic Substring in cabbaabb is: bbaabb, of length: 6Longest Palindromic Substring in forgeeksskeegfor is: geeksskeeg, of length: 10Longest Palindromic Substring in abcde is: a, of length: 1Longest Palindromic Substring in abcdae is: a, of length: 1Longest Palindromic Substring in abacd is: aba, of length: 3Longest Palindromic Substring in abcdc is: cdc, of length: 3Longest Palindromic Substring in abacdfgdcaba is: aba, of length: 3Longest Palindromic Substring in xyabacdfgdcaba is: aba, of length: 3Longest Palindromic Substring in xababayz is: ababa, of length: 5Longest Palindromic Substring in xabax is: xabax, of length: 5

后续行动: 检测给定字符串中的所有回文。 e、 g.对于字符串abcddcbefgf,所有可能的回文都是a、b、c、d、e、f、g、dd、fgf、cddc、bcddcb。 我们发表了以下更多关于后缀树应用程序的文章:

本文由 导演 。如果您发现任何不正确的地方,或者您想分享有关上述主题的更多信息,请发表评论

© 版权声明
THE END
喜欢就支持一下吧
点赞11 分享