# Suffix Array

###### What is Suffix Array ?
• Invented independently by Manber & Myers in 1990 and Gaston Gonnet in 1992.
• Li, Zhize, Li, Jian and Huo, Hongwei gave the first in-place O(n) time suffix array construction algorithm in 2016.
• A suffix array is a sorted array of all suffixes of a given string.
• It is a simple, space efficient alternative to suffix tree.
###### Why Suffix Array ?
• A suffix array can be constructed from Suffix tree by doing a DFS traversal of the suffix tree.
• In fact Suffix array and suffix tree both can be constructed from each other in linear time.
• Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (Ukkonen’s algorithm) and improved cache locality. #### Applications of Suffix Array

• Suffix array is an extremely useful data structure, it can be used for a wide range of problems.
• Following are some famous problems where Suffix array can be used:
1. Pattern Searching
2. Finding the longest repeated substring
3. Finding the longest common substring
4. Finding the longest palindrome in a string

## Suffix Array: Naive Method to Build and Search***

###### Build the Suffix Array:
• A simple method to construct suffix array is to make an array of all suffixes and then sort the array.
• Time Complexity: O(n2logn)
• Sorting step itself takes O(n2Logn) time as every comparison is a comparison of two strings and the comparison takes O(n) time.
• There are many other efficient algorithms to build suffix array.
###### Search a pattern using Suffix Array
• Since we have a sorted array of all suffixes, we will use Binary Search for searching the pattern pat of length m.
• Time Complexity: O(mlogn)
• Logn to search in array and m time for comparing m length string.
• In fact there is a O(m) suffix array based algorithm to search a pattern.
###### Implementation of Naive Method:
``````def build_suffix_array_naive(word):
suffixes = []
n = len(word)
for i in range(n):
suffixes.append((i, word[i:]))

suffixes.sort(key=lambda k: k)

suffix_arr = []
for i, _ in suffixes:
suffix_arr.append(i)

print("Built Suffix Array: {}".format(suffix_arr))
return suffix_arr

def search(pat, suffix_arr, word):
n = len(suffix_arr)
l = 0; r = n-1
flag = False
while(l<=r):
mid = (l+r)//2
if(pat == word[suffix_arr[mid]:]):
flag = True
break
if(pat < word[suffix_arr[mid]:]):
r = mid-1
else:
l = mid+1

if(flag):
print("Pattern '{}' Found at index: {}.".format(pat, mid))
else:

print("Suffix Array Naive Example:")
suffix_arr = build_suffix_array_naive("banana")
search("ana", suffix_arr, "banana")
search("nana", suffix_arr, "banana")
search("axy", suffix_arr, "banana")
``````

output: ###### Time Complexity:
• Time Complexity: O(n2logn) - Building
• Time Complexity: O(mlogn) - Searching the built suffix array

## Suffix Array: nLogn Algorithm to Build***

###### Problem with Naive Method:
• The Naive algorithm is to consider all suffixes, sort them using a O(nLogn) sorting algorithm and while sorting, maintain original indexes.
• Time complexity: O(n2Logn) where n is the number of characters in the input string.
###### New Enhanced and Awared Approach:
• Let us first discuss a O(nLognLogn) algorithm for simplicity.
• The idea is to use the fact that strings that are to be sorted are suffixes of a single string.
• We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n.
• The important point is, if we have sorted suffixes according to first 2i characters, then we can sort suffixes according to first 2n+1 characters in O(nLogn) time using a nLogn sorting algorithm like Merge Sort.
• This is possible as two suffixes can be compared in O(1) time (we need to compare only two values).
• The sort function is called O(Logn) times (Note that we increase number of characters to be considered in powers of 2).
• Therefore overall time complexity becomes O(nLognLogn).

## LCP Array Construction from Suffix Array to Enhance Search***

• In suffix array, the Binary Search used to search for pat in text takes O(m*Logn) time where m is length of the pattern to be searched and n is length of the text.
• With the help of LCP array, we can search a pattern in O(m + Log n) time.

#### LCP Array:

• LCP Array is an array of size n (like Suffix Array).

• A value lcp[i] indicates length of the longest common prefix of the suffixes indexed by suffix[i] and suffix[i+1].

• suffix[n-1] is not defined as there is no suffix after it. Two methods of LCP array construction:

1. Compute the LCP array as a byproduct to the suffix array (Manber & Myers Algorithm).
2. Use an already constructed suffix array in order to compute the LCP values. (Kasai Algorithm).

### Kasai Algorithm

The algorithm constructs LCP array from suffix array and input text in O(n) time.

Idea of Algorithm:

• Let lcp of suffix beginning at txt[i] be k.
• If k is greater than 0, then lcp for suffix beginning at txt[i+1] will be atleast k-1.
• The reason is, relative order of characters remain same.
• If we delete the first character from both suffixes, we know that at least k characters will match.
• Example: For substring “ana”, lcp is 3, so for string “na” lcp will be atleast 2.
###### Implementation:
``````def create_lcp_array_kasai(text, suffix_arr):
n = len(suffix_arr)

# LCP Araay to store LCPs.
lcp_arr = *n

# Create an inverse of suffix array, it is used to get next suffix string from suffix array.
# Example:- if suffix_arr is 5, the inv_suffix_arr would store 0.
inv_suffix_arr = *n
for i in range(n):
inv_suffix_arr[suffix_arr[i]] = i

# Initialize length of previous LCP
k = 0

# Process all suffixes one by one starting from first suffix in text[].
for i in range(n):
# If current suffix is at n-1, then we don’t have next substring to consider.
# So lcp is not defined for this substring, we put zero.
if inv_suffix_arr[i] == n-1:
k = 0
continue

# j contains index of the next substring to be considered  to compare with the present
# substring, i.e., next string in suffix array.
j = suffix_arr[inv_suffix_arr[i] + 1]

# Directly start matching from k'th index as at-least k-1 characters will match
while (i+k<n and j+k<n and text[i+k]==text[j+k]):
k += 1

# LCP for the present suffix.
lcp_arr[inv_suffix_arr[i]] = k

# Delete the starting character from the string.
if k>0: k -= 1

print("Inverse Suffix Array: {}".format(inv_suffix_arr))
print("LCP Array: {}".format(lcp_arr))

print("Kasai - LCP Array Construction Example:")
text = "banana"
suffixes = ["banana", "anana", "nana", "ana", "na", "a"]
sorted_suffixes = ["a", "ana", "anana", "banana", "na", "nana"]
suffix_arr = [5, 3, 1, 0, 4, 2]
print("Given Text: '{}'".format(text))
print("Suffixes: {}".format(suffixes))
print("Sorted Suffixes: {}".format(sorted_suffixes))
print("\nSuffix Array: {}".format(suffix_arr))
create_lcp_array_kasai("banana", [5, 3, 1, 0, 4, 2])
``````

Output: ###### Complexity:
• Time: O(n) If the suffix array is already created
• Space: O(n) Extra space for storing inverse suffix array.

Illustration of Algorithm:

• We first compute LCP of first suffix in text which is “banana”.

• We need next suffix in suffix_arr to compute LCP.
• Fact Remember: lcp[i] is defined as Longest Common Prefix of suffix[i] and suffix[i+1].
• To find the next suffix in suffix_arr[], we use inv_suffix_arr[], the next suffix is “na”.
• Since there is no common prefix between “banana” and “na”, the value of LCP for “banana” is 0.
• “banana” is at index 3 in suffix array, so we fill lcp as 0.
• Next we compute LCP of second suffix in text which is “anana”.

• For “anana” the next suffix is “banana”.
• Since there is no common prefix, the value of LCP for “anana” is 0 and it is at index 2 in suffix array, so we fill lcp as 0.
• Next we compute LCP of third suffix in text which is “nana”.

• Since there is no next suffix, the value of LCP for “nana” is not defined. We fill lcp as 0.
• Next we compute LCP of third suffix in text which is “ana”.

• For “ana” the next suffix is “anana”.

• Since there is a common prefix of length 3, the value of LCP for “ana” is 3. We fill lcp as 3.

• Now we compute LCP for fourth suffix in text which is “na”.

• For “na” the next suffix is “nana”.

• This is where Kasai’s algorithm uses the trick that LCP value must be at least 2 because previous LCP value was 3.

• Since there is no character after “na”, final value of LCP is 2. We fill lcp as 2.

• Now we compute LCP for last suffix in text which is “a”.

• For “a” the next suffix is “ana”.
• Using Kasai, LCP value must be at least 1 because previous value was 2.
• Since there is no character after “a”, final value of LCP is 1. We fill lcp as 1.
##### Notes:
• Using binary search with the suffix array still takes O(mlogn) time.
• But we can reduce the search time complexity to O(m + Log n) using the constructed LCP array.

# Suffix Tree

###### What is Suffix Tree ?
• First introduced by Weiner (1973), which Donald Knuth subsequently characterized as “Algorithm of the Year 1973”.
• The construction was greatly simplified by McCreight (1976) , and also by Ukkonen (1995).
• A Suffix Tree for a given text is a compressed trie for all suffixes of the given text.
###### Why Suffix Tree ?
• Helps to search pattern pat[0..m-1] in a text txt[0…n-1] where m < n.
• We have different algorithms like KMP Algorithm, Rabin-Karp Algorithm, Finite Automata based Algorithm and Boyer Moore Algorithm for doing the same task.
• All of the above algorithms preprocess the pattern to make the pattern searching faster.
• The best time complexity that we could get by preprocessing pattern is O(n) where n is length of the text.
• But in case of suffix trees, once the suffix tree is built after the preprocessing, we can search any pattern in O(m) time where m is length of the pattern.
• Preprocessing of text may become costly if the text changes frequently, so it is good for fixed text or less frequently changing text though. ###### Abstract Steps to Build a Suffix Tree?
1. Generate all suffixes of given text.
2. Consider all suffixes as individual words and build a compressed trie.
###### Abstract Steps to Search in Suffix Tree
1. Starting from the first character of the pattern and root of Suffix Tree, do following for every character.
• For the current character of pattern, if there is an edge from the current node of suffix tree, follow the edge.
• If there is no edge, print “pattern doesn’t exist in text” and return.
2. If all characters of pattern have been processed, i.e., there is a path from root for characters of the given pattern, then print “Pattern found”.

#### Applications of Suffix Tree

Suffix tree can be used for a wide range of problems. Some famous problems where it provides optimal time complexity solution are:

1. Pattern Searching
2. Finding the longest repeated substring
3. Finding the longest common substring
4. Finding the longest palindrome in a string