Suffix Array

What is Suffix Array ?
Why Suffix Array ?

Applications of Suffix Array

Build the Suffix Array:
Search a pattern using Suffix Array
Implementation of Naive Method:
def build_suffix_array_naive(word):
    suffixes = []
    n = len(word)
    for i in range(n):
        suffixes.append((i, word[i:]))
    
    suffixes.sort(key=lambda k: k[1])

    suffix_arr = []
    for i, _ in suffixes:
        suffix_arr.append(i)

    print("Built Suffix Array: {}".format(suffix_arr))
    return suffix_arr


def search(pat, suffix_arr, word):
    n = len(suffix_arr)
    l = 0; r = n-1
    flag = False
    while(l<=r):
        mid = (l+r)//2
        if(pat == word[suffix_arr[mid]:]):
            flag = True
            break
        if(pat < word[suffix_arr[mid]:]):
            r = mid-1
        else:
            l = mid+1
    
    if(flag):
        print("Pattern '{}' Found at index: {}.".format(pat, mid))
    else:
        print("Pattern '{}' NOT Found.".format(pat))



print("Suffix Array Naive Example:")
suffix_arr = build_suffix_array_naive("banana")
search("ana", suffix_arr, "banana")
search("nana", suffix_arr, "banana")
search("axy", suffix_arr, "banana")

output:

Time Complexity:

Suffix Array: nLogn Algorithm to Build***

Problem with Naive Method:
New Enhanced and Awared Approach:

LCP Array:

Two methods of LCP array construction:

  1. Compute the LCP array as a byproduct to the suffix array (Manber & Myers Algorithm).
  2. Use an already constructed suffix array in order to compute the LCP values. (Kasai Algorithm).

Kasai Algorithm

The algorithm constructs LCP array from suffix array and input text in O(n) time.

Idea of Algorithm:

Implementation:
def create_lcp_array_kasai(text, suffix_arr):
    n = len(suffix_arr)

    # LCP Araay to store LCPs.
    lcp_arr = [0]*n     

    # Create an inverse of suffix array, it is used to get next suffix string from suffix array.
    # Example:- if suffix_arr[0] is 5, the inv_suffix_arr[5] would store 0.
    inv_suffix_arr = [0]*n
    for i in range(n):
        inv_suffix_arr[suffix_arr[i]] = i
    
    # Initialize length of previous LCP
    k = 0

    # Process all suffixes one by one starting from first suffix in text[].
    for i in range(n):
        # If current suffix is at n-1, then we don’t have next substring to consider.
        # So lcp is not defined for this substring, we put zero.
        if inv_suffix_arr[i] == n-1:
            k = 0
            continue

        # j contains index of the next substring to be considered  to compare with the present 
        # substring, i.e., next string in suffix array.
        j = suffix_arr[inv_suffix_arr[i] + 1]

        # Directly start matching from k'th index as at-least k-1 characters will match
        while (i+k<n and j+k<n and text[i+k]==text[j+k]):
            k += 1
        
        # LCP for the present suffix.
        lcp_arr[inv_suffix_arr[i]] = k

        # Delete the starting character from the string. 
        if k>0: k -= 1
    
    print("Inverse Suffix Array: {}".format(inv_suffix_arr))
    print("LCP Array: {}".format(lcp_arr))

    

print("Kasai - LCP Array Construction Example:")
text = "banana"
suffixes = ["banana", "anana", "nana", "ana", "na", "a"]
sorted_suffixes = ["a", "ana", "anana", "banana", "na", "nana"]
suffix_arr = [5, 3, 1, 0, 4, 2]
print("Given Text: '{}'".format(text))
print("Suffixes: {}".format(suffixes))
print("Sorted Suffixes: {}".format(sorted_suffixes))
print("\nSuffix Array: {}".format(suffix_arr))
create_lcp_array_kasai("banana", [5, 3, 1, 0, 4, 2])

Output:

Complexity:

Illustration of Algorithm:

Notes:




← Previous: AVL Tree

Next: Suffix Tree →