J Comb Optim (2007) 13:217–221 DOI 10.1007/s10878-006-9024-6
A 2-approximation for the preceding-and-crossing structured 2-interval pattern problem Minghui Jiang
Published online: 9 December 2006 C Springer Science + Business Media, LLC 2006
Abstract The 2-interval pattern problem over its various models and restrictions was proposed by Vialette (2004) for the application of RNA secondary structure prediction. We present an O(n 3 log n)-time 2-approximation algorithm for the problem of finding a largest {<, }-structured subset of 2-intervals given an input 2-interval set of size n. This greatly improves the previous best approximation ratio of 6 by Crochemore et al. (2005). Keywords 2-Interval . Approximation algorithms . RNA secondary structure prediction
1 Introduction Vialette (2004) proposed a geometric representation of the RNA secondary structure as a set of 2-intervals. Given a single-stranded RNA molecule, a sequence of contiguous bases of the molecule can be represented as an interval on a single line, and a possible pairing between two disjoint sequences can be represented as a 2-interval, which is the union of two disjoint intervals. A maximum disjoint subset of a candidate set of 2-intervals restricted to certain prespecified geometrical constraints is a valid approximation for the RNA secondary structure. We review some definitions (Vialette, 2004). A 2-interval D = (I, J ) consists of two disjoint (closed) intervals I and J such that I < J , that is, I is completely to the left of J . Consider two 2-intervals D1 = (I1 , J1 ) and D2 = (I2 , J2 ). D1 and D2 are disjoint if the four intervals I1 , J1 , I2 , and J2 are pairwise disjoint. Define three binary relations for disjoint pairs of 2-intervals: Preceding: D1 < D2 ⇐⇒ I1 < J1 < I2 < J2 . M. Jiang () Department of Computer Science, Utah State University, Logan, Utah 84322-4205, USA e-mail:
[email protected] Springer
218
J Comb Optim (2007) 13:217–221
Table 1 The complexities of the 2-interval pattern problem over its various models and restrictions Restriction on input Model
Unlimited Balanced Unitary
Point
{<, , } {, } {<, }
NP-complete (Vialette, 2004) NP-complete (Vialette, 2004) NP-complete (Blin et al., 2004)
√ O(n n) (Micali and Vazirrani, 1980; Vialette, 2004) O(n log n + L) (Yuan et al., 2005) ?
{<, } {} {} {<}
O(n log n + dn) (Yuan et al., 2005) O(n log n + L) (Yuan et al., 2005) O(n log n) (Blin et al., 2004) O(n log n) (Vialette, 2004)
Nesting: D1 D2 ⇐⇒ I2 < I1 < J1 < J2 . Crossing: D1 D2 ⇐⇒ I1 < I2 < J1 < J2 . The two 2-intervals D1 and D2 are R-comparable for some R ∈ {<, , } if either (D1 , D2 ) ∈ R or (D2 , D1 ) ∈ R. (For example, D1 and D2 are -comparable if either D1 D2 or D2 D1 .) Note that the set of binary relations {<, , } is complete in the sense that any two disjoint 2-intervals are R-comparable for some R ∈ {<, , }. Given a model R, which is a non-empty subset of {<, , } (there are 7 such subsets), a set D of 2-intervals is R-structured if any two distinct 2-intervals in D are Rcomparable for some R ∈ R. Given a set D of 2-intervals and a model R, the 2interval pattern problem is to find a maximum-size R-structured subset of 2-intervals in D. Beside the various models R, various restrictions can also be imposed on the input 2-interval set D for the 2-interval pattern problem. Define the support of a set D of 2-intervals as the set of intervals {I, J | (I, J ) ∈ D}. There are four common types of restrictions: Unlimited: No restrictions. Balanced: Every 2-interval in D consists of two intervals of equal length. Unitary: Every interval in the support of D has a unit length. Point: The intervals in the support of D are pairwise disjoint (therefore they can be considered as points). The three types of restrictions, unlimited, unitary, and point, were originally introduced by Vialette (2004); the balanced restriction was later proposed by Crochemore et al. (2005) because it is natural in the biological setting of the 2-interval problems. Since Vialette’s pioneering work (Vialette, 2004), the 2-interval pattern problem has been extensively studied. We summarize the complexities of the problem over its various models and restrictions in Table 1. For the three cases, {, }point, {<, }-unlimited, and {}-unlimited, the time complexities of the three algorithms by Yuan et al. (2005) are parameterized: we always have L ≤ dn ≤ n 2 , but it is possible that L = (n 2 ) and d = (n) and that both O(n log n + L) and O(n log n + dn) become (n 2 ) in the worst case. For the hardness results, we note that Bar-Yehuda et al. (2002) considered a more general problem of finding a maximum weight independent set (MWIS) in a t-interval graph. Their Springer
J Comb Optim (2007) 13:217–221 Table 2 The best approximation ratios for the 2-interval pattern problem. Our improvements are marked by “old → new”
219
Restriction on input Model
Unlimited
Balanced
Unitary
Point
{<, , } {, } {<, }
4 4 6→2
4 4 5→2
3 3 3→2
N/A N/A 2
APX-hardness result for the MWIS problem on (2, 2)-union graphs (Bar-Yehuda et al., 2002, Theorem 2.1) implies that the weighted version of the 2-interval pattern problem in the {, }-unitary case is APX-hard, which further implies that the weighted 2-interval pattern problem is APX-hard over both {, } and {<, , } models and the unlimited, balanced, and unitary restrictions. The complexity for the {<, }-point case is unknown. The NP-completeness of the 2-interval pattern problem over the three models {<, , }, {, }, and {<, } is not surprising. These three models all include the relation, which is indispensable for the representation (as 2-intervals) of the RNA secondary structures with pseudoknots. The computational hardness for these models is consistent with our knowledge that the RNA secondary structures with pseudoknots are difficult to predict in practice. Naturally, researchers have directed their attention to the design of efficient approximation algorithms. We refer to Table 2 for the best approximation ratios of polynomial-time approximation algorithms for the 2-interval pattern problem. Crochemore et al. (2005) designed the first non-trivial approximation algorithms for almost all the cases in Table 2 except the {<, , }-unlimited case and the {<, , }-unitary case, for which the approximations were derived from the results by Bar-Yehuda et al. (2002). In this paper, we present a polynomial-time 2approximation algorithm for the {<, }-unlimited case of the 2-interval pattern problem. This improves the previous best approximation ratios of 6, 5, and 3, respectively, for the three cases {<, }-unlimited, {<, }-balanced, and {<, }-unitary by Crochemore et al. (2005).
2 The algorithm For two 2-intervals D1 = (I1 , J1 ) and D2 = (I2 , J2 ), we define D1 < D2 ⇐⇒ D1 < D2 or D1 D2 . If D1 < D2 , then I1 < I2 and J1 < J2 . Therefore the < relation specifies a total order for any {<, }-structured set of 2-intervals. Let S be a {<, }-structured set of 2-intervals. Consider S as a sequence ordered by the < relation. Denote by S[i] the 2-interval with rank i in S. Denote by S[i, j] the subsequence S[i]S[i + 1] · · · S[ j]. For each index i = 1, . . . , |S|, define next(i) as the smallest index j, 1 ≤ j ≤ |S|, such that S[i] < S[ j]; if such an index j does not exist, define next(i) = |S| + 1. Define the backbone indices of S as the sequence of indices i 1 , i 2 , . . . , i k such that i 1 = 1, i s = next(i s−1 ) for s = 2, . . . , k, and next(i k ) = |S| + 1. (For conveSpringer
220
J Comb Optim (2007) 13:217–221
nience, we also define i k+1 = |S| + 1, and imagine a 2-interval S[i k+1 ] such that S[i] < S[i k+1 ] for all 1 ≤ i ≤ |S|.) Define the backbone elements of S as the 2intervals in S at the backbone indices. For each backbone index i s , 1 ≤ s ≤ k, define a stripe T (i s ) = S[i s + 1, i s+1 − 1]. The stripe is odd if s is odd; it is even if s is even. For each 2-interval γ ∈ T (i s ), we observe that S[i s ] γ < S[i s+1 ]. If there were two 2-intervals α, β ∈ T (i s ) such that α < β, then it would be impossible that both S[i s ] α and S[i s ] β. Therefore, every stripe of S is {}-structured. A {<, }-structured sequence S is striped if either its odd stripes are all empty or its even stripes are all empty. Although S itself may not be striped, it always contains two striped subsequences S[i 1 ]T (i 1 )S[i 2 ]S[i 3 ]T (i 3 )S[i 4 ] . . . , and S[i 1 ]S[i 2 ]T (i 2 )S[i 3 ]S[i 4 ]T (i 4 ) . . . . These two subsequences together cover the sequence S: the 2-intervals at the backbone indices are covered twice, each remaining 2-interval is covered once. It follows that one of two striped subsequences has a length of at least |S|/2. This observation immediately suggests that a 2-approximation for a largest {<, }-structured subset of D can be obtained by finding a longest striped sequence in D. We design the following algorithm that finds a longest striped sequence in D by dynamic programming: 1. Make a dummy 2-intervals ω such that γ < ω for all γ ∈ D. Set D¯ ← D ∪ {ω}. ¯ α < β, find the subset of 2-intervals 2. For each pair of 2-intervals α and β in D, ¯ αγ < D¯ αβ = {γ | γ ∈ D, β}, then compute Cαβ , a maximum-size {}-structured subset of D¯ αβ . 3. Process the 2-intervals in D¯ in an arbitrary order that conforms to the partial order specified by the < relation. For each 2-interval β, find the subset of 2-intervals ¯ α < β}. D¯ β = {α | α ∈ D, If D¯ β is empty, set Aβ ← {β} and Bβ ← {β}. Otherwise, compute Aβ and Bβ as follows: (a) Find α ∈ D¯ β such that |Bα | is maximum, and set Aβ ← Bα ∪ {β}; (b) Find α ∈ D¯ β such that |Aα | + |Cαβ | is maximum, and set Bβ ← Aα ∪ Cαβ ∪ {β}. 4. Let S be either Aω or Bω such that |S| is maximum. Return S\{ω}. In the algorithm, we use Aβ and Bβ to represent the longest striped sequence in the two different alternating patterns, with β as both the last element and the last Springer
J Comb Optim (2007) 13:217–221
221
backbone element. The 2-interval α in steps 3(a) and (b) represents the second-tolast backbone element in Aβ and Bβ . The subset Cαβ represents the maximum-size stripe between the two backbone elements α and β. The key idea of the algorithm is that, since a striped sequence has an alternating pattern, the 2-intervals in Cαβ do not “interfere” with the 2-intervals in the previous stripes. Suppose that Aα = Bσ ∪ {α} for some σ < α. Consider a 2-interval γb ∈ Bσ and a 2-interval γc ∈ Cαβ . We have σ < α γc . Therefore, γb < γc . It follows by induction that Aβ and Bβ are γb < ¯ {<, }-structured for all β ∈ D. Because of our choice of the dummy 2-interval ω, the longest striped sequence in D¯ must have ω as both the last element and the last backbone element. Therefore, the dynamic programming formulation guarantees that either Aω or Bω must be a longest ¯ To see that S\{ω} is a longest striped sequence in D, we note striped sequence in D. that, if D had a longer striped sequence S , then S ∪ {ω} would be a longer striped sequence in D¯ than S, which contradicts the optimality of S. The running time of the algorithm is dominated by step 2. For each of the O(n 2 ) subproblems, a maximum-size {}-structured subset of D¯ αβ is a maximum independent set in the corresponding trapezoid graph, which can be found in O(n log n) time (Felsner et al., 1997). We have the following theorem. Theorem 1. Our algorithm approximates the {<, }-structured 2-interval pattern problem with a ratio of 2 and runs in O(n 3 log n) time. Acknowledgment The author thanks the two anonymous referees for helpful comments.
References Bar-Yehuda R, Halld´orsson MM, Naor J(S), Shachnai H, Shapira I (2002) Scheduling split intervals. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms (SODA’02), pp 732– 741 Blin G, Fertin G, Vialette S (2004) New results for the 2-interval pattern problem. In: Proceedings of the 15th annual symposium on combinatorial pattern matching (CPM’04), LNCS 3109, pp 311–322 Crochemore M, Hermelin D, Landau GM, Vialette S (2005) Approximating the 2-interval pattern problem. In: Proceedings of the 13th annual European symposium on algorithms (ESA’05), LNCS 3669, pp 426–437 Felsner S, M¨uller R, Wernisch L (1997) Trapezoid graphs and generalizations, geometry and algorithms. Disc Appl Math, 74:13–32 √ Micali S, Vazirrani VV (1980) An O( |V ||E|) algorithm for finding maximum matching in general graphs. In: Proceedings of the 21st annual symposium on foundations of computer science (FOCS’80), pp 17–27 Vialette S (2004) On the computational complexity of 2-interval pattern matching problems. Theor Comp Sci 312:223–249 Yuan H, Yang L, Chen E (2005) Improved algorithms for largest cardinality 2-interval pattern problem. In: Proceedings of the 16th international symposium on algorithms and computation (ISAAC’05), LNCS 3827, pp. 412–421
Springer