ELI5 (Explain Like I’m 5)
A suffix array is a sorted list of all suffixes of a string. For “banana”, the suffixes are “banana”, “anana”, “nana”, “ana”, “na”, “a” — sort these alphabetically and you get the suffix array. It’s a compact structure that lets you answer “does substring X appear in this string?” incredibly fast using binary search.
Explanation
- A suffix array is an array of integers representing the starting indices of all suffixes of a string, sorted in lexicographical order
- Combined with an LCP (Longest Common Prefix) array, it can answer most string problems efficiently
- Building: naive O(n² log n), SA-IS algorithm O(n) — in interviews, naive construction is usually acceptable
- Key use: binary search on the suffix array to find pattern occurrences in O(m log n)
Keyword trigger: “longest repeated substring”, “number of distinct substrings”, “longest common substring of two strings” → Suffix Array.
When to use?
- Finding all occurrences of a pattern in a text (alternative to KMP / Z-algorithm)
- Longest repeated substring
- Number of distinct substrings
- Longest common substring between two or more strings
- Competitive programming — rare in general LeetCode, but appears in string-heavy hard problems
Approach
Construction (O(n log² n) — acceptable for interviews)
def build_suffix_array(s: str) -> list[int]:
n = len(s)
# all suffixes as (suffix_string, start_index)
suffixes = sorted(range(n), key=lambda i: s[i:])
return suffixes
LCP Array
def build_lcp(s: str, sa: list[int]) -> list[int]:
n = len(s)
rank = [0] * n
for i, suffix_idx in enumerate(sa):
rank[suffix_idx] = i # inverse of suffix array
lcp = [0] * n
h = 0
for i in range(n):
if rank[i] > 0:
j = sa[rank[i] - 1]
while i + h < n and j + h < n and s[i + h] == s[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
return lcp
Pattern Search
def search(s: str, sa: list[int], pattern: str) -> int:
# binary search on the suffix array
lo, hi = 0, len(sa)
while lo < hi:
mid = (lo + hi) // 2
if s[sa[mid]:sa[mid] + len(pattern)] < pattern:
lo = mid + 1
else:
hi = mid
return lo # first occurrence index in suffix array
Notes
- Time Complexity: O(n log² n) for naive SA construction, O(n) for SA-IS
- Space Complexity: O(n)
- In Python, the naive
sorted(range(n), key=lambda i: s[i:])is O(n² log n) due to string comparisons — acceptable for n ≤ 10⁴ - For practical competitive programming in Python, use the O(n log² n) doubling algorithm
- The Z-algorithm or KMP are simpler alternatives for pattern matching — use suffix arrays when you need multiple queries or more complex string operations
Data structures
- Suffix array — sorted indices of suffixes
- LCP array —
lcp[i]= longest common prefix betweensa[i]andsa[i-1] - Rank array — inverse of suffix array,
rank[i]= position of suffix starting at i in sorted order
How to Master This
Step-by-step
- Understand what a suffix is — enumerate all suffixes of “banana” manually
- Implement naive suffix array construction and verify on small inputs
- Build the LCP array and understand what it tells you
- Solve #1044 (longest duplicate substring) — suffix array + LCP is the intended O(n log n) solution
- Solve #1062 (longest repeating substring)
Key Resources
- 📹 William Fiset — Suffix Array
- 📝 CP-Algorithms — Suffix Array
- 📝 Stanford ICPC — Suffix Array
- 🔁 Drill: #1044 → #1062
Sample LeetCode Problems
| # | Problem | Difficulty | Interview Frequency | Must-Do |
|---|---|---|---|---|
| 1044 | Longest Duplicate Substring | Hard | Low | 📖 |
| 1062 | Longest Repeating Substring | Medium | Low | 📖 |
| 718 | Maximum Length of Repeated Subarray | Medium | Medium | ⚡ |
Code Samples
- Longest Duplicate Substring — suffix array + LCP
def longestDupSubstring(s: str) -> str:
n = len(s)
# build suffix array (naive O(n^2 log n) — acceptable here)
sa = sorted(range(n), key=lambda i: s[i:])
# build LCP array using Kasai's algorithm
rank = [0] * n
for i, idx in enumerate(sa):
rank[idx] = i
lcp = [0] * n
h = 0
for i in range(n):
if rank[i] > 0:
j = sa[rank[i] - 1]
while i + h < n and j + h < n and s[i + h] == s[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
# find maximum LCP and the corresponding substring
max_len = max(lcp)
if max_len == 0:
return ''
# find which suffix has this LCP length
for i, length in enumerate(lcp):
if length == max_len:
return s[sa[i]: sa[i] + max_len]
return ''
- Count Distinct Substrings — using suffix array + LCP
def countDistinctSubstrings(s: str) -> int:
"""
Total substrings = n*(n+1)/2
Subtract sum of LCP array (shared prefixes = duplicates)
"""
n = len(s)
sa = sorted(range(n), key=lambda i: s[i:])
# Kasai's LCP
rank = [0] * n
for i, idx in enumerate(sa):
rank[idx] = i
lcp = [0] * n
h = 0
for i in range(n):
if rank[i] > 0:
j = sa[rank[i] - 1]
while i + h < n and j + h < n and s[i + h] == s[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
total_substrings = n * (n + 1) // 2
return total_substrings - sum(lcp)