Tuesday, July 15, 2008

Interval Longest Common Prefix Problem

Consider a string $S$ over the alphabet $\sigma$. It is well known that the longest common prefix of any two given substrings of $S$ can be found in constant time, after a linear time preprocessing. The preprocessing involves constructing the suffix array $SA$ and the $LCP$ array, calculating the inverse of the suffix array $\overline{SA}$ and preparing the $LCP$ array for constant time range minimum queries (RMQ). After this, we can find the longest common prefix of two substrings $S[i_1:j_1]$ and $S[i_2:j_2]$ by
$\displaystyle{rmq_{\mathrm{LCP}}(\overline{SA}[i_1],\overline{SA}[i_2])}$

Of course, if the found value is greater than the length of either substring, the length of the longest common prefix is equal to the length of the shorter string. Since the end points of the given substrings play insignificant role, defining the longest common prefix problem on suffixes of $S$ is as general. We denote by $S_k$ the suffix of $S$ which starts at position $k$. Let $lcp(i,j)$ denote the longest common prefix of $S_i$ and $S_j$. (Note that this is different from the $LCP[]$ array above.)

In a paper called Substring Compression Problems by Cormode and Muthukrishnan, the following problem is introduced. Preprocess a given string $S$, so that queries of the form $ilcp([i:j],k)$ can be answered quickly, where
$\mathit{ilcp}([i:j],k)=\max_{i\leq l \leq j}\mathit{lcp}(l, k)$
For reasons I will mention later in this post, $ilcp()$ query is often called with $k=j+1$. This may admit more efficient solutions than $ilcp$, so we define the restricted longest common prefix problem as:
$\mathit{rilcp}([i:j])= \mathit{ilcp}([i:j],j+1)$

Let us make few observations. Note if $\overline{SA}[a]\le\overline{SA}[b]\le\overline{SA}[c]$, then
$\mathit{lcp}(a,b) \geq \mathit{lcp}(a,c)$
and also
$\mathit{lcp}(b,c) \geq \mathit{lcp}(a,c)$
This means that to compute a $ilcp$ query, it is sufficient to perform two $lcp$ operations
  1. $\mathit{lcp}(k,l)$, where $S_l$ is the lexicographically the greatest suffix in $S_i,S_{i+1},\ldots,S_j$ which is lexicographically smaller than $S_k$
  2. $\mathit{lcp}(k,r)$, where $S_r$ is the lexicographically the smallest suffix in $S_i,S_{i+1},\ldots,S_j$ which is lexicographically greater than $S_k$

This translates the $ilcp$ problem to finding the greatest value smaller than $k$ in substring $[i:j]$ of a permutation $P$ (which, in $ilcp$ case, is the inverse suffix array). Further, in $rilcp$ problem we know that $k = P[j+1]$. Let us consider some naive solutions. We say that an algorithm has complexity $\langle f(n), g(n), h(n) \rangle$ if the preprocessing takes $f(n)$ time and $g(n)$ space, and a query takes $h(n)$ time. Then precomputing all possible queries results in a $\langle O(n^2), O(n^2), O(1) \rangle$ algorithm for the $rilcp$ problem. It is also possible to obtain a $\langle O(n^2), O(n^2), O(1) \rangle$ algorithm for the $ilcp$ problem, by computing n RMQ arrays upper bounded by $k$ for $k=1,2,\ldots,n$.

Yet another naive method came across my mind is to divide $P$ into $\sqrt{n \log n}$ size non-overlapping chunks and prepare a sorted copy of each chunk so that in a full chunk, we can search for the greatest value smaller than $k$, for any $k$ in $\log n$ time. Given an interval $[i:j]$ and a $k$, we need to search at most $\frac{n}{\sqrt{n\log n}}$ full chunks and two partial chunks. To search in a partial chunk we just go through all elements one by one. This results in a $\langle O(n), O(n), O(\sqrt{n\log n}) \rangle$ algorithm.

Apparently, neither $O(n^2)$ space nor $O(\sqrt{n\log n})$ query time is feasible, so we seek for some method which uses $n\cdot polylog(n)$ space, while having decent query times. There seems to be no obvious way to find directly the greatest element smaller than $k$ in some sublist, but there is a way to check if there exist an element in the sublist $[i:j]$ with value in the range $[s:t]$. This is a well studied problem, called orthogonal range searching. Hence, to solve the former problem we can ask if there is an element in $[i:j]$ with value in $[1:k]$ and proceed with the value range $[k/2:k]$ and so on to perform a binary seach. It is very easy to design a data structure which allows $\log^2 n$ time orthogonal range searches using $O(n\log n)$ space. This way, we can find the greatest element smaller than $k$ in the sublist $[i:j]$ in time $O(\log^3 n)$, leading to a $\langle O(n\log n), O(n\log n), O(\log^3 n) \rangle$ overall algorithm for $ilcp$ problem. It turns out that orthogonal search can be done in $\log\log n$ time using linear space when all dimensions are integers, which leads to a $\langle O(n\log n), O(n), O(\log n\log\log n)\rangle$ algorithm for the $ilcp$ problem. This is also one of the results given in Substring Compression Problems.

I mentioned above that $\mathit{ilcp}([i:j],k)$ is often called with $k=j+1$. Suppose you want to LZSS compress a substring $S[i:j]$. (in LZSS, the longest prefix of the unprocessed input which exists earlier in the processed input is referenced.) What we need to do is letting $j_0 = i$ and $j_{h+1}=j_{h}+\max\big\{\mathit{ilcp}([i:j_h],j_h+1),1\big\}$ for $h = 0,1,2,\ldots$ The smallest $h$ for which $j_h \geq j$ is the LZSS compressed length of $S[i:j]$.

So far everything seems good except that I somehow convinced myself that $\mathit{rilcp}$ can be solved much more efficiently and this problem is bugging me. There are only $n^2$ queries and queries that end at the same position are higly correlated. Also, the array we work on is not only an integer size array but also a permutation. These somehow suggest me that a $o(\log n)$ time solution should be possible. I will be happy to hear about this problem.

No comments: