def build_suffix_array_naive(word):
suffixes = []
n = len(word)
for i in range(n):
suffixes.append((i, word[i:]))
suffixes.sort(key=lambda k: k[1])
suffix_arr = []
for i, _ in suffixes:
suffix_arr.append(i)
print("Built Suffix Array: {}".format(suffix_arr))
return suffix_arr
def search(pat, suffix_arr, word):
n = len(suffix_arr)
l = 0; r = n-1
flag = False
while(l<=r):
mid = (l+r)//2
if(pat == word[suffix_arr[mid]:]):
flag = True
break
if(pat < word[suffix_arr[mid]:]):
r = mid-1
else:
l = mid+1
if(flag):
print("Pattern '{}' Found at index: {}.".format(pat, mid))
else:
print("Pattern '{}' NOT Found.".format(pat))
print("Suffix Array Naive Example:")
suffix_arr = build_suffix_array_naive("banana")
search("ana", suffix_arr, "banana")
search("nana", suffix_arr, "banana")
search("axy", suffix_arr, "banana")
output:
LCP Array is an array of size n (like Suffix Array).
A value lcp[i] indicates length of the longest common prefix of the suffixes indexed by suffix[i] and suffix[i+1].
suffix[n-1] is not defined as there is no suffix after it.
Two methods of LCP array construction:
The algorithm constructs LCP array from suffix array and input text in O(n) time.
Idea of Algorithm:
def create_lcp_array_kasai(text, suffix_arr):
n = len(suffix_arr)
# LCP Araay to store LCPs.
lcp_arr = [0]*n
# Create an inverse of suffix array, it is used to get next suffix string from suffix array.
# Example:- if suffix_arr[0] is 5, the inv_suffix_arr[5] would store 0.
inv_suffix_arr = [0]*n
for i in range(n):
inv_suffix_arr[suffix_arr[i]] = i
# Initialize length of previous LCP
k = 0
# Process all suffixes one by one starting from first suffix in text[].
for i in range(n):
# If current suffix is at n-1, then we don’t have next substring to consider.
# So lcp is not defined for this substring, we put zero.
if inv_suffix_arr[i] == n-1:
k = 0
continue
# j contains index of the next substring to be considered to compare with the present
# substring, i.e., next string in suffix array.
j = suffix_arr[inv_suffix_arr[i] + 1]
# Directly start matching from k'th index as at-least k-1 characters will match
while (i+k<n and j+k<n and text[i+k]==text[j+k]):
k += 1
# LCP for the present suffix.
lcp_arr[inv_suffix_arr[i]] = k
# Delete the starting character from the string.
if k>0: k -= 1
print("Inverse Suffix Array: {}".format(inv_suffix_arr))
print("LCP Array: {}".format(lcp_arr))
print("Kasai - LCP Array Construction Example:")
text = "banana"
suffixes = ["banana", "anana", "nana", "ana", "na", "a"]
sorted_suffixes = ["a", "ana", "anana", "banana", "na", "nana"]
suffix_arr = [5, 3, 1, 0, 4, 2]
print("Given Text: '{}'".format(text))
print("Suffixes: {}".format(suffixes))
print("Sorted Suffixes: {}".format(sorted_suffixes))
print("\nSuffix Array: {}".format(suffix_arr))
create_lcp_array_kasai("banana", [5, 3, 1, 0, 4, 2])
Output:
Illustration of Algorithm:
We first compute LCP of first suffix in text which is “banana”.
Next we compute LCP of second suffix in text which is “anana”.
Next we compute LCP of third suffix in text which is “nana”.
Next we compute LCP of third suffix in text which is “ana”.
For “ana” the next suffix is “anana”.
Since there is a common prefix of length 3, the value of LCP for “ana” is 3. We fill lcp[1] as 3.
Now we compute LCP for fourth suffix in text which is “na”.
For “na” the next suffix is “nana”.
This is where Kasai’s algorithm uses the trick that LCP value must be at least 2 because previous LCP value was 3.
Since there is no character after “na”, final value of LCP is 2. We fill lcp[4] as 2.
Now we compute LCP for last suffix in text which is “a”.