Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2568
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Masami Hagiya Azuma Ohuchi (Eds.)
DNA Computing 8th International Workshop on DNA-Based Computers, DNA8 Sapporo, Japan, June 10-13, 2002 Revised Papers
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Masami Hagiya University of Toyko Graduate School of Information Science and Technology Department of Computer Science 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8565, Japan E-mail:
[email protected] Azuma Ohuchi Hokkaido University Graduate School of Engineering Division of Complex Systems Technology N-13, W-8, KITA-KU, Sapporo, 060, Japan E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): F.1, F.2.2, I.2.9, J.3 ISSN 0302-9743 ISBN 3-540-00531-5 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10872239 06/3142 543210
Preface
Biomolecular computing has emerged as an interdisciplinary field that draws together chemistry, computer science, mathematics, molecular biology, and physics. Our knowledge on DNA nanotechnology and biomolecular computing increases exponentially with every passing year. The international meeting on DNA Based Computers has been a forum where scientists with different backgrounds, yet sharing a common interest in biomolecular computing, meet and present their latest results. Continuing this tradition, the 8th International Meeting on DNA Based Computers (DNA8) focuses on the current theoretical and experimental results with the greatest impact. Papers and poster presentations were sought in all areas that relate to biomolecular computing, including (but not restricted to): algorithms and applications, analysis of laboratory techniques/theoretical models, computational processes in vitro and in vivo, DNA-computing-based biotechnological applications, DNA devices, error evaluation and correction, in vitro evolution, models of biomolecular computing (using DNA and/or other molecules), molecular design, nucleic acid chemistry, and simulation tools. Papers and posters with new experimental results were particularly encouraged. Authors who wished their work to be considered for either oral or poster presentation were asked to select from one of two submission “tracks”: – Track A - Full Paper – Track B - One-Page Abstract For authors with late-breaking results, or who were submitting their manuscript to a scientific journal, a one-page abstract, rather than a full paper, could be submitted in Track B. Authors could (optionally) include a preprint of their full paper, for consideration only by the program committee. We received 49 submissions in Track A, and 17 submissions in Track B. These submissions were then reviewed by the program committee members, and by some other, external reviewers. In principle, three committee members were allocated for each submission. In considering the returned review reports, all discussions pertaining to the final decisions were made online by the program committee members. The program committee wishes to thank the external reviewers, who spent much valuable time reading submissions. We finally selected 16 oral presentations from Track A. They have been included as full papers in the preliminary proceedings. Note that the competition was extremely tough for Track A submissions. We selected only one third of the submissions for oral presentations. In addition to the 16 oral presentations, we judged that 12 additional submissions in Track A were of good quality and relevance, and therefore decided to include these full papers in the preliminary proceedings. These papers were also selected for presentation as posters, during the conference. As for Track B, we selected 7 oral presentations.
VI
Preface
The meeting was held on June 10–13, 2002, in Hokkaido University, Japan. On the first day of the meeting, we had four tutorials in two tracks: one track on computer science and the other on biotechnology. Both tracks consisted of a basic tutorial held in the morning and an advanced one in the afternoon. The basic computer science tutorial was given by Takashi Yokomori (Waseda University) and was titled “Introduction to Natural Computing.” The advanced one was given by John A. Rose (University of Tokyo) and was titled “Nucleic Acid Sequence Design.” The basic biotechnology tutorial was given by Toshikazu Shiba (FUJIREBIO Inc.) and was titled “Introduction to Biotechnology.” The advanced one was given by Nadrian C. Seeman (University of New York) and was titled “DNA Self-assembly.” The next three days were for invited plenary lectures, and regular oral and poster presentations. Invited plenary lectures were by Willem P.C. Stemmer (Maxygen, Inc.), Tomoji Kawai (Osaka University), Akira Suyama (University of Tokyo), Tom Head (Binghamton University), and Bernard Yurke (Lucent Technologies). We were very honored to have these prominent researchers as plenary speakers. It is also our great pleasure to mention here that Tom Head, one of the invited speakers of the meeting, received the award The DNA Computing Scientist of 2002. The award ceremony was held at the banquet of the meeting. A paper by Tom Head, related to his invited lecture, is included in this volume. After the meeting, the full papers in the preliminary proceedings were revised by the authors and reviewed again by the program committee members and the external reviewers. The program committee thanks those reviewers again. This volume contains the final versions of the papers after the second review. Finally, the program committee wishes to thank all those who submitted papers and abstracts for consideration.
November 2002
Masami Hagiya
Organization
DNA8 was organized by Hokkaido University and CREST JST (Japan Science and Technology Corporation) in cooperation with NovusGene Inc. and the European Association for Theoretical Computer Science (EATCS).
Program Committee Masami Hagiya Anne Condon Lila Kari Thom LaBean Giancarlo Mauri John McCaskill
University of Tokyo, Japan (Program Chair) University of British Columbia, Canada University of Western Ontario, Canada Duke University, USA Universita degli Studi di Milano-Bicocca, Italy GMD-National Research Center for Information Technology, Germany Institute of Mathematics of the Romanian Academy, Romania University of Tokyo, Japan New York University, USA Tokyo Institute of Technology, Japan California Institute of Technology, USA University of Delaware, USA
Gheorghe Paun John Rose Ned Seeman Masayuki Yamamura Erik Winfree David Wood
Organizing Committee Azuma Ohuchi Masahito Yamamoto Toshikazu Shiba Hidenori Kawamura Masami Hagiya Akira Suyama Masayuki Yamamura
Hokkaido University, Japan (Organizing Committee Chair) Hokkaido University, Japan Hokkaido University, Japan Hokkaido University, Japan University of Tokyo, Japan University of Tokyo, Japan Tokyo Institute of Technology, Japan
External Reviewers J. Ackermann M. Arita D. Besozzi M. Daley R. Deaton
C. Ferretti A. Leporati E. Losseva V. Mitrana A. Nishikawa
A. Paun T. Ruecker Y. Sakakibara K. Sakamoto H. Someya
VIII
Organization
G. Stefan M. Takano H. Uejima
K. Wakabayashi G. Wozniak M. Yamamoto
T. Yokomori C. Zandron
Sponsoring Institutions Hokkaido University CREST JST (Japan Science and Technology Corporation) NovusGene Inc. European Association for Theoretical Computer Science (EATCS)
Table of Contents
Self-assembly and Autonomous Molecular Computation Self-assembling DNA Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phiset Sa-Ardyen, Nataˇsa Jonoska, Nadrian C. Seeman
1
DNA Nanotubes: Construction and Characterization of Filaments Composed of TX-tile Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Dage Liu, John H. Reif, Thomas H. LaBean The Design of Autonomous DNA Nanomechanical Devices: Walking and Rolling DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 John H. Reif Cascading Whiplash PCR with a Nicking Enzyme . . . . . . . . . . . . . . . . . . . . . . 38 Daisuke Matsuda, Masayuki Yamamura
Molecular Evolution and Application to Biotechnology A PNA-mediated Whiplash PCR-based Program for In Vitro Protein Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 John A. Rose, Mitsunori Takano, Akira Suyama Engineering Signal Processing in Cells: Towards Molecular Concentration Band Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Subhayu Basu, David Karig, Ron Weiss
Applications to Mathematical Problems Temperature Gradient-Based DNA Computing for Graph Problems with Weighted Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Ji Youn Lee, Soo-Yong Shin, Sirk June Augh, Tai Hyun Park, Byoung-Tak Zhang Shortening the Computational Time of the Fluorescent DNA Computing . . 85 Yoichi Takenaka and Akihiro Hashimoto How Efficiently Can Room at the Bottom Be Traded Away for Speed at the Top? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Pilar de la Torre Hierarchical DNA Memory Based on Nested PCR . . . . . . . . . . . . . . . . . . . . . . 112 Satoshi Kashiwamura, Masahito Yamamoto, Atsushi Kameda, Toshikazu Shiba, Azuma Ohuchi
X
Table of Contents
Binary Arithmetic for DNA Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Rana Barua, Janardan Misra Implementation of a Random Walk Method for Solving 3-SAT on Circular DNA Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Hubert Hug, Rainer Schuler Version Space Learning with DNA Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Hee-Woong Lim, Ji-Eun Yun, Hae-Man Jang, Young-Gyu Chai, Suk-In Yoo, Byoung-Tak Zhang DNA Implementation of Theorem Proving with Resolution Refutation in Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 In-Hee Lee, Ji-Yoon Park, Hae-Man Jang Young-Gyu Chai, Byoung-Tak Zhang Universal Biochip Readout of Directed Hamiltonian Path Problems . . . . . . 168 David Harlan Wood, Catherine L. Taylor Clelland, Carter Bancroft
Nucleic Acid Sequence Design Algorithms for Testing That Sets of DNA Words Concatenate without Secondary Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Mirela Andronescu, Danielle Dees, Laura Slaybaugh, Yinglei Zhao, Anne Condon, Barry Cohen, Steven Skiena A PCR-based Protocol for In Vitro Selection of Non-crosshybridizing Oligonucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Russell Deaton, Junghuei Chen, Hong Bi, Max Garzon, Harvey Rubin, David Harlan Wood On Template Method for DNA Sequence Design . . . . . . . . . . . . . . . . . . . . . . 205 Satoshi Kobayashi, Tomohiro Kondo, Masanori Arita From RNA Secondary Structure to Coding Theory: A Combinatorial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Christine E. Heitsch, Anne E. Condon, Holger H. Hoos Stochastic Local Search Algorithms for DNA Word Design . . . . . . . . . . . . . . 229 Dan C. Tulpan, Holger H. Hoos, Anne E. Condon NACST/Seq: A Sequence Design System with Multiobjective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Dongmin Kim, Soo-Yong Shin, In-Hee Lee, Byoung-Tak Zhang A Software Tool for Generating Non-crosshybridizing Libraries of DNA Oligonucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Russell Deaton, Junghuei Chen, Hong Bi, John A. Rose
Table of Contents
XI
Theory Splicing Systems: Regularity and Below . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Tom Head, Dennis Pixton, Elizabeth Goode On the Computational Power of Insertion-Deletion Systems . . . . . . . . . . . . . 269 Akihiro Takahara, Takashi Yokomori Unexpected Universality Results for Three Classes of P Systems with Symport/Antiport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Mihai Ionescu, Carlos Mart´ın-Vide, Andrei P˘ aun, Gheorghe P˘ aun Conformons-P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Pierluigi Frisco, Sungchul Ji Parallel Rewriting P Systems with Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Daniela Besozzi, Claudio Ferretti, Giancarlo Mauri, Claudio Zandron A DNA-based Computational Model Using a Specific Type of Restriction Enzyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Yasubumi Sakakibara, Hiroshi Imai Time-Varying Distributed H Systems of Degree 2 Can Carry Out Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Maurice Margenstern, Yurii Rogozhin, Sergey Verlan
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Self-assembling DNA Graphs Phiset Sa-Ardyen1 , Nataˇsa Jonoska2, and Nadrian C. Seeman1 1
2
New York University, Department of Chemistry, New York, NY 1003 [email protected], [email protected] University of South Florida, Department of Mathematics, Tampa, FL 33620, [email protected]
Abstract. We show experimental results of constructing non regular graphs using junction molecules as vertices and duplex DNA molecules as edge connections.
1
Introduction
A variety of computational models have been introduced recently that are based on DNA properties and Watson-Crick complementarity. Several researchers have observed that the inherent three dimensional structure of DNA and self-assembling of complex three dimensional structures can be used as computational devices as well as for improving the computational power of some models. DNA tiles that perform computation, made of double and triple cross-over molecules have been proposed for self-assembly of two-dimensional arrays [9,14]. Such arrays made of DNA tiles have been obtained experimentally [10,15]. In the case of triple cross-over molecules, a simple XOR operation was experimentally confirmed [8]. Branched junction molecules and graph-like DNA structures as computational devices have been proposed by several authors [5,6,16,14,13]. Splicing of tree-like structures made of junction molecules and hairpins based on the splicing model proposed by T. Head [4] was suggested in [16]. In [13], the authors propose to simulate Horn clause computation by self-assembly of three junctions and hairpins. Computations of a variety of problems by actual construction of graphs by DNA molecules was proposed in [5,6]. All these models using branched junction molecules and construction of graphs are yet to be confirmed experimentally. Branched junction DNA molecules are already fairly well understood experimentally and in fact they have been used to form certain three dimensional DNA structures such as a DNA cube and a truncated octahedron [1,17]. Although these DNA constructs were made out of junction molecules, they all represent regular graphs, i.e. the degrees of all of the vertices are same. The experimental design and construction of a non-regular graph solely by junction molecules (representing the vertices of the graph) and duplex molecules (representing the edges) has not been obtained. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 1–9, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
Phiset Sa-Ardyen, Nataˇsa Jonoska, and Nadrian C. Seeman
In this paper we show experimental progress in obtaining such non-regular DNA graph structures. A graph made of 5 vertices and 8 edges was chosen for self-assembly. For construction of the DNA graph we use junction molecules that represent the vertices and duplex molecules that represent the edges. The molecules have vertex-edge specific sticky ends and use WC complementarity for self-assembling the graph. In section 2 we explain the design of the oligos. The experimental results are shown in section 3.
2
Design of the Self-assembly
In [6,7] the graph depicted in Figure 1 (a) was used as an example to show how the three-vertex colorability and the Hamiltonian path problem could be solved by construction of the graph. For our experiment we chose the same graph. Since the vertex v5 has degree two, for the experiment we used one duplex molecule to represent the vertex v5 and the edges connecting v5 with v1 and v4 . Hence, the designed experiment uses 5 vertices and 8 edges as depicted in Figure 1(b). V
V
V4
5
6
V4
V3
V
e6
5
e7 e3
V
1 (a)
2
V
V3 e4
e8 V
e5
e1
1
V
e2
2
(b)
Fig. 1. The graph to be DNA self-assembled. The five vertices were designed to be junction molecules, 3-armed junctions for vertices v1 , v3 , v4 and v5 and a 4-armed junction molecule for v2 . The sticky ends were designed such that once the molecules hybridize and are ligated to form the graph structure, the final ligation product is one cyclic molecule (maybe one or more knots) which can be distinguished from the partially formed linear products by two-dimensional gel electrophoresis techniques. This final structure is depicted in Figure 2. Six DNA strands are involved in the construction of each edge that connects two vertices of the graph: two strands from the junction molecule representing one of the vertices, two strands from the junction molecule representing the other vertex molecule and two strands to form the duplex molecule that represents the edge (see Figure 3). The distances between the vertices were designed such that e1 = {v1 , v2 }; e2 = {v2 , v3 }; e5 = {v3 , v4 }; e8 = {v5 , v1 } are all 4 helical turns, i.e. the length
Self-assembling DNA Graphs 1 0 1 0 0 1 1 0 0 1 0 01 1
V4
1 0 0 1 0 1 1 0 1 0
V5
* 00 11 00 11 1 0 0 1 00 11
0 1 0 1 0 1 1 00 *1
* 00 11 0 1 1 0 00 11
11 00 1 0 00 11
0 1 0 1 00 1 1 0 1 0 1 0 1 0 1 0 1 1 00 *1
V1
1 0 1 0 0 1 0 1 1 0 1 0 00 11 1 0 00 11 0 1 00 11
V3
1 0 1 0 1 0 0 1 0 1 1 0 1 0
1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 1 00 1 1 *
3
1 0 00 11 00 011 1
1 0 1 00 1 V2
Fig. 2. The final DNA structure representing the graph in Figure 1 (b). The double helix is not depicted and the shaded area represents the position of the sticky ends. The stars indicate the 5 ends that have been labeled with γ−32 PATP for the experiment with results shown in Fig. 7. The hairpins added to the four edges e3 , e4 , e6 and e7 are not presented.
vertex 1
edge
vertex 2
Fig. 3. Six DNA strands involved in assembling an edge connection.
of these DNA segments is 42 base pairs. In order to attain desired flexibility the other distances were longer. The edges e3 = {v2 , v5 }; e4 = {v2 , v4 }; e7 = {v3 , v5 } were chosen to be 6 helical turns, i.e. 63 base pairs. The longest segment of 8 helical turns (i.e. 84bp long) is e6 = {v1 , v4 }. Each of the longer edge molecules e3 , e4 , e6 , e7 contains an additional hairpin of one and a half helical turn in the middle of the molecule. The 3-junctions of the hairpins contain bulges of T’s at each turn, as can be observed in Figure 4. These hairpins are expected to add flexibility to the edge molecules. For similar reasons i.e. for increased flexibility of the junctions, bulges of two T’s were added to the junction molecules at the positions where the arms of the junction meet (Figure 5). Shorter edges obtained by connecting short duplex molecules with the arms of the appropriate junctions use sticky ends of 6 bases. The sticky ends for connection of the longer edges were designed to be 8 bases long.
Phiset Sa-Ardyen, Nataˇsa Jonoska, and Nadrian C. Seeman to V2 gAACCTAggCgCAgTCTC
gAgCTATAgTACATCCAC
to V3
gTACTATAgCTCAACCgA
gCgCCTAggTTCTgCCTA
to V1
to V2
to V3
edge E2
gAATgg CCTAAgTATgTA
CTTAggCCATTCACgAAT
4
to V4
edge E1
edge E5
g T gTgg AgCTCgATC T to V2 T
T
T ACTAgCTTAAgATCg T
CTTAgTTATgC T
TCAATgTTAgA T
T
CAgTgAAgCAgCTCTAgCg
T TT T TCTAACATTgACTgTgTCC CTgCTTCACTg T T
T CgATCTTAAgCTAgT T
to V2
T gATCgAgCTCCACAC T
TT
TT
CgTAgAATCggCgTACCTC to V5
to V4
edge E4
T TT T gCATAACTAAggAgTggTC CCgATTCTACg T T
edge E3
gTAgATAggTAAg gTCAgTAgCAgACATC
CTACTgACCTTACCTATCTAC T
T gAT TggTCgACACTg T
CATgAgACgTCCgAg T T
to V5
TT
TCCTACTAgAC T
T g gAgT TgATgAgCAggATC
TT TT T gTCTAgTAg gAgCgATCCg
T
gCAgTATACTTgACgCAg
T
to V1
edge E8
T CAgTgTCgACCAATC T
gCACTTCCATTACTgTTATCg
T CTCggACgTCTCATg T
to V1
CAAgTATACTgCCAAGCA
to V4
TT TT T CgATAACAgTAATg gAAgTgCgCTCAgAg
edge E6
TCATCAACTCCT
to V5
TT
to V3
edge E7
Fig. 4. DNA sequences for the edges. For the purposes of later, more detailed analysis, a unique restriction site was incorporated into the sequence of each edge. For the longer edge molecules, these restriction sites were included in the sequences of the hairpins. The sequences for each of the strands were designed by use of the program SEQUIN [11]. The sticky ends were designed by DNASequenceGenerator [3]. In use of SEQUIN, a special care was taken so that no “sticky end” sequence appears in the whole final product more than once. In the design of the 4-junction molecule that represents v2 we took advantage of the already well understood J1 molecule [12], and the first 8 bases of the arms of v2 contain the exact sequence of J1.
3
Experimental Results
The shorter DNA oligonucleotides used in five of the vertices and four of the shorter edges were ordered from IDT, Inc., non-phosphorylated. The longer molecules were synthesized on an Applied Biosystems 380 automatic DNA synthesizer. In contrast to the shorter oligonucleotides, these were phosphorylated with their syntheses. All molecules were purified by electrophoresis. 3.1
Preliminaries
All oligonucleotides were subjected to phosphorylation using kinase labeling. First, junction molecules and edge molecules were annealed by heating each
T ACTCgTgCAgTggACCACTC
T
Tg
Ag
gT
gC
gA
TC
AC
Ag C
CT
CA
g
TT
g
CAACTgAAggTggAT
T gAACggACATAgCggATCgC
T
CCggTATCA
TA
T
gCgATggACTACATA T C
vertex 4
gA
CA
TC
gT
T
gA
vertex 5
CT
gg
AT
Tg
Ag
gA
gg
TA
Cg
to V2 gT
CA
gA
AT
Tg
CT
gT
T Tg
TC
Tg
AT
AC
g T T T
TA
Ag
Ag
CA
TT
Cg
T
TA
TC
Cg
Ag
AT
to V1
gT
gC
CC
Tg
to V5
CA Tg
gC
CA
CT
g
TC
CC A
gA
T T CA
gA
to V1 gT
gT
gC
CT
C
CT
gg
T CA gg
gA
AT
Cg
vertex 3
CT
to V3
Tg
TA
T
to V3
CTATgTCCgTTC T
T A
to V2
gTCCATCgC T
T gC
CT
C TA
to V3
A
CC A
to V2
vertex 2
CA
A
CT
gC
gA T T C
Ag
TC
gT
Ag
Tg
to V4
TC C
T
T
ggCTTACgCgAgACT TA
T
Cg
vertex 1
CTTCAgTTg T
to V4
CA
T CCTAACgCTACAggACACAg
TgTAgCgTTAgg T
A
T
TT
gCgTAAgCC T
T TgATACCggTCggTT
TC
to V1
5
to v5
CACTgCACgAgT T
CC
to V5
g TA
to V4
CCAgTTCTAggAA T
TTCCTAgAACTggCTCTgAgC
Self-assembling DNA Graphs
gC
Ta
gA
g
to V2
gT
CT
g
Fig. 5. DNA sequences for the vertices. solution to 90o C for two minutes and slowly cooled to room temperature. Formation of each of the junction molecules and the edge molecules as well as the stoichiometry was established on 12% non-denaturing polyacrylamide gels. Figure 6 shows non-denaturing gels for two of the junctions representing vertices v2 and v5 . The lanes indicated by the arrows show the junction molecule with all of the strands. The remaining lanes show partial complexes that contain only one, two, or in the case of the four junction, three of the strands. From the lanes containing the junction molecule it is clear that almost all molecules are included in the formation of the complex and there are no stoichiometry problems. Similar experiments were performed with the other junctions and the hairpins. These results are not shown here. Four strands of the expected final product were chosen to be labeled with γ−32 P-ATP. One strand at each of the junctions representing vertices v1 , v3 , v4 and v5 were labeled. The labeled strands are marked with a in Figure 2. The labeling of a v2 strand was not succesful and no ligation product was visible. All labels were chosen such that they would appear evenly spread throughout the cyclic molecule that represents the final product. The ligation of the whole graph was done by annealing all junction molecules and edge molecules separately first, according to the determined stoichiometry. Then all these complexes were mixed together and treated with T4 DNA ligase at room temperature overnight. The final single cyclic DNA molecule that represents the graph is 1084 bases long. The gel of the ligation product of these four ligations using four distinctly labeled
6
Phiset Sa-Ardyen, Nataˇsa Jonoska, and Nadrian C. Seeman
Fig. 6. Non-denaturing gels for the junction molecules representing vertices v2 and v5 . strands is presented in Figure 7. The first lane contains a linear marker, and the lanes 2,3,4,5 contain ligation products with labeled strands of junctions at v1 , v3 , v4 and v5 respectively. It is clear that a band of approximately 1078 bases, i.e. the length of the expected final product has been formed in each reaction. Having this band in all four lanes indicates that all four labeled strands are included in formation of the molecules of this length. 3.2
Final Product Results
For obtaining the final cyclic molecule, the above procedure was slightly modified. There are 32 ligation sites in the self-assembly. Even with 90% efficiency of the ligase, we will not have more than 3% of the final product. By labeling a trace of the strands we are facing the possibility that the final cyclic molecule might not be visible on autoradiograph. The band of the expected size that is clearly visible in Figure 7 was eluted, but the eluted amount was hardly detectible and all additional experiments did not produce a visible product. For the two-dimensional electrophoresis we modified the labeling protocol. The junction and edge molecules were annealed as before, but this time prior to the phosphorylation. After annealing, all junctions and edge molecules were subjected to γ−32 P-ATP labeling for one hour with kinase. To improve the phosphorylation, unlabeled ATP molecules were added for additional 7 minutes. We expected that with this process the visibility of a small yield of the final circular product would improve. In Figure 8, a two dimensional gel is presented. The first dimension was run on the 3.5% denaturing gel and the second dimension on 5%. The circled cyclic
Self-assembling DNA Graphs
7
Fig. 7. Ligation product of self-assembly of the whole graph run on a superdenaturing gel.
bands (gel areas) denoted A-H were extracted and eluted. The final cyclic product has a very complex secondary structure and in fact it might take the form of many topoisomers and may appear in the gel at several places off-diagonal. Its behavior could not be predicted, so several different bands that were clearly not linear were tested. After elution, the excised molecules were subjected to linearization (heating at 95o C for 20 minutes). The result of this linearization is presented on the right in Figure 8. Lanes 3 and 6 corresponding to the excised areas B and E from the twodimensional gel show a unique clear band of the expected size of approximately 1100 bases. There appears to be no product in the lanes corresponding to A, G, H. On the other hand, there seem to be too many bands in the lanes corresponding to C and D. At this point there is no analysis of these extra bands. They may correspond to a dimer (double cover) molecule of the expected graph. Lane F shows two molecules, none of the size that was expected. Since the position of the area marked F in the two dimensional gel is somewhat below the curve of the other excised areas, it might not contain the target product.
8
Phiset Sa-Ardyen, Nataˇsa Jonoska, and Nadrian C. Seeman
Fig. 8. 2d denaturing gel of the ligation product (left) and a denaturing gel of the linearization of the bands extracted from the 2d-gel (right).
4
Concluding Remarks
Theoretically, by the design of the oligonucleotides and the sticky ends of the junctions such that the target molecule representing the graph is one circular molecule (or knot), the results in Figure 8 suggest that we sometimes form the expected product. Although our experiments up to this point are very encouraging, we have not confirmed with certainty that the desired cyclic molecule has been obtained. Such a confirmation would be obtained by sequencing the final product and confirming with the sequence of the designed oligonucleotides. Since the yield of the resulting molecules is very low, the extracted product from the two dimensional gel will need to be amplified by PCR. In that regard the restriction sites at the hairpins of the longer edges can be utilized to specify the primers to be picked for the PCR reaction. Acknowledgement This work has been supported by grants GM-29554 from NIGMS, N00014-98-1-0093 from ONR, grants CTS-9986512, EIA-0086015, EIA0074808, DMR-01138790, and CTS-0103002 from NSF, and F30602-01-2-0561 from DARPA /AFOSR.
References 1. J.H. Chen, N.C. Seeman, Synthesis from DNA of a molecule with the connectivity of a cube, Nature 350 (1991) 631-633.
Self-assembling DNA Graphs
9
2. J.H. Chen, N.R. Kallenbach, N.C. Seeman, A specific quadrilateral synthesized from DNA branched junctions, J. Am. Chem. Soc. 111 (1989) 6402-6407. 3. U. Feldkamp, S. Saghafi, H. Rauhe, DNASequenceGenerator: A program for the construction of DNA sequences, DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers (N. Jonoska, N.C. Seeman editors), Springer LNCS 2340 (2002) 23-32. 4. T. Head, Formal language theory and DNA: an analysis of generative capacity of recombinant behaviours, Bul. of Mathematical Biology 49 (1987) 735-759. 5. N. Jonoska, S. Karl, M. Saito, Creating 3-dimensional graph structures with DNA, DNA computers III, (H. Rubin, D. Wood editors), AMS DIMACS series 48 (1999) 123-135. 6. N. Jonoska, S. Karl, M. Saito, Three dimensional DNA structures in computing, BioSystems 52 (1999) 143-153. 7. N. Jonoska, 3D DNA patterns and computing, Pattern formation in biology, vision and dynamics, (A.Carbone, M.Gromov, P.Pruzinkiewicz editors) World Scientific Publishing Company, Singapore (1999) 310-324. 8. C. Mao, T.H. LaBean, J.H. Reif, N.C. Seeman, Logical computation using algorithmic self-assembly of DNA triple-crossover molecules, Nature 407 (2000) 493-496. 9. J.H. Reif, Local parallel biomolecular computation, in DNA Based Computers III (H. Rubin, D.H. Wood edtors) AMS DIMACS series 48 (1999) 217-254. 10. P.W.K. Rothemund, Using lateral capillary forces to compute by self-assembly, Proc. of Nat. Acad. Sci. (USA) 97 (2000) 984-989. 11. N.C. Seeman, De novo design of sequences for nucleic acid structural engineering J. of Biomolecular Structure & Dynamics 8 (3) (1990) 573-581. 12. N.C. Seeman, N.R. Kallenbach, Nucleic acid junctions: a successful experiment in macromolecular design, Molecular Structure: Chemical Reactivity and Biological Activity (J.J. Stezowski, J.L. Huang, M.C. Shao editors) Oxford Univ. Press, Oxford (1988) 189-194. 13. H. Uejima, M. Hagiya, S. Kobayashi, Horn clause computation by self-assembly of DNA molecules, DNA Computing: Proceedings of the 7th International Meeting on DNA Based Computers (N. Jonoska, N.C. Seeman editors), Springer LNCS 2340 (2002) 308-320. 14. E. Winfree, X. Yang, N.C. Seeman, Universal computation via self-assembly of DNA: some theory and experiments, DNA computers II, (L. Landweber, E. Baum editors), AMS DIMACS series 44 (1998) 191-214. 15. E. Winfree, F. Liu, L.A. Wenzler, N.C. Seeman, Design and self-assembly of twodimensional DNA crystals, Nature 394 (1998) 539-544. 16. Y. Sakakibara, C. Ferretti, Splicing on tree-like structures, DNA computers III, (H. Rubin, D.H. Wood editors.) AMS DIMACS series 48 (1999) 361-275. 17. Y. Zhang, N.C. Seeman, The construction of a DNA truncated octahedron, J. Am. Chem. Soc. 160 (1994) 1661-1669.
DNA Nanotubes: Construction and Characterization of Filaments Composed of TX-tile Lattice Dage Liu, John H. Reif, and Thomas H. LaBean Department of Computer Science Duke University, Durham, NC 27708 USA {liudg, reif, thl}@cs.duke.edu
Abstract. DNA-based nanotechnology is currently being developed for use in biomolecular computation, fabrication of 2D tile lattices, and engineering of 3D periodic matter. Here we present recent results on the construction and characterization of DNA nanotubes – a new self-assembling superstructure composed of DNA tiles. Triple-crossover (TAO) tiles modified with thiol-containing dsDNA stems projected out of the tile plane were utilized as the basic building block. TAO nanotubes display a constant diameter of approximately 25 nm and have been observed with lengths up to 20 microns. We present high resolution images of the constructs from transmission electron microscopy (TEM) and atomic force microscopy (AFM) as well as preliminary data on successful metallization of the nanotubes. DNA nanotubes represent a potential breakthrough in the self-assembly of nanometer scale circuits for electronics layout since they can be targeted to connect at specific locations on largerscale structures and can subsequently be metallized to form nanometer scale wires. The dimensions of these nanotubes are also perfectly suited for applications involving interconnection of molecular scale devices with macroscale components fabricated by conventional photolithographic methods.
1 1.1
Background & Introduction Self-assembling DNA Nanostructures
Adleman’s initial report of the experimental implementation of an artificial computer utilizing DNA molecules as integral components has helped to spur widespread interest in the use of DNA in information processing systems [1]. DNA-based computing can be thought of as a subfield of the broader endeavor known as DNA nanotechnology [2], [3], [4], [5], [6]. These enterprises seek to engineer synthetic DNA polymers to encode information necessary for realization of desired structures or processes on the molecular level. Properly designed sets of DNA oligonucleotides are able to self-assemble into complex, highly organized structures. The major advantage of DNA as a self-assembling structural material over all other currently known chemical systems is DNA’s well-understood base-pairing M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 10–21, 2003. c Springer-Verlag Berlin Heidelberg 2003
DNA Nanotubes
11
rules which essentially act as a programming language with which to input organizational instructions. In particular, B-form DNA, the structure adopted by double-stranded DNA (dsDNA) under standard solution conditions, consists of the famous, anti-parallel, double-helix of complementary strands. Its specifically programmable molecular recognition ability, as well as the huge diversity of available sequences, make DNA ideal for controlling the assembly of nanostructures. Annealing of complementary strands is achieved by simply heating dissolved oligonucleotide strands above their dissociation temperature then cooling them very slowly thus allowing strands to find their sequence complements and bind. Unpaired, single-strand regions (ssDNA) extending from the ends of doublestrand domains can be used to specifically bind other dsDNA displaying a complementary ssDNA sticky-end. Regions of ssDNA therefore, can act as address codes capable of orienting attached substructures to neighboring points in 3space. If DNA is appended to nanoscale objects and chemical moieties, it can act as smart glue for organizing such objects and group in space. An obvious advantage of DNA self-assembly for fabrication is the simultaneous creation of millions or billions of copies of a desired structure due to the massive parallelism inherent in molecular processes. 1.2
DNA Tiles, Lattices, and Templates
DAO
TAE
Fig. 1. Example DNA tiles. DAO 2 double helices, 2 crossovers. TAE 3 double helices, 4 crossovers. TAO tiles with some important features of each were used in construction of the nanotubes. DNA nanotechnology and some approaches to DNA-based computing rely on unusual structural motifs including the immobile branched junction [2], [7] which is a stable analogue of the Holliday intermediate in genetic recombination [8]. Normal, linear double-strand DNA can only be used to form linear, 1D structures, while DNA complexes containing stable branched junction motifs
12
Dage Liu, John H. Reif, and Thomas H. LaBean
can be used to construct 2D and 3D lattices. The smallest stable complexes contain two branched junctions (also known as “crossovers”) connecting two double helices [9] where the helices are designed to be co-planar and are said to be linked by strand exchange at the crossover points. One strand from each helix crosses over and begins base-pairing on the opposite helix. These complexes are known as DX tiles; TX tiles contain three coplanar double-helices linked at each of four crossover points. An example of a DX and a TX tile are shown in Figure 1. DNA tiles can be programmed to assemble in large 2-dimensional sheets or lattices by properly designing their sticky-ends. Lattices composed of hundreds of thousands of DX or TX tiles and extending up to at least 10 microns on their long edge have been created and examined by AFM and TEM [10], [11]. Surface features of DX lattices have been manipulated not only by addition of hairpins directed out of the lattice plane, but also by enzymatic cleavage and by annealing of DNA carrying sticky-ends complementary to projecting ssDNA [12]. Besides being attractive structural members for simple lattice formation, multi-helix tiles have been utilized as components of self-assembling DNA computers [13], [14], [15]. Shown in Figure 2 is a schematic of a simple periodic tile lattice composed of two tile types (A and B*) and a TEM image from an assembled lattice. The neighbor relations of tiles in the lattice are shown on the far left. Given cornerto-corner tile associations and matching rules as shown, where 1 binds to 1’, 2 to 2’, etc., assembled sheets exhibit the pattern shown with alternating stripes of light (A) and dark (B*) tiles.
1 3
A
2
A
4
A
A B* B*
A
A
4’ 2’
3’
B*
A
1’
A
A
A
A
B* A
B* A
B*
B* A
B*
B* A
B*
A B*
B*
B*
B*
B* A
B*
B*
A
A B*
A
B* A
B*
B*
Fig. 2. Schematic and Platinum Rotary Shadow-TEM image of DNA lattice composed of two types of TAO tile: with (dark) and without (light) stem-loops directed out of the lattice plane. Stripes of dark B* tiles are approximately 28 nm apart as expected. Note that individual tiles with 14x6.5 nm edge dimensions are discernible. The TAO tiles used in the construction of the DNA nanotubes described here are very slightly modified versions of those described above and explained in detail previously [11]. One aim of further development of these tile and lattice structures is their use are templates upon which to organize desired patterns of materials other than DNA, for example, metals, carbon nanotubes, proteins, and peptides. Such DNA templates may be useful as scaffolds for the
DNA Nanotubes
13
creation of nanoelectronic circuits and other desired objects and devices requiring nanometer scale feature resolution. Toward this end, we tested various attachment chemistries for modification of tiles and lattices including the incorporation of thiol groups (-SH). Here we report on the interesting filamentous structures which resulted from some of these chemical modifications, analysis of the nanotube structures, preliminary results of metallization of the tubes, and some speculation on their possible future usefulness as nanowire templates and as nano- to micro-scale interconnects.
2 2.1
Materials and Methods DNA Seq. Design, Synthesis, Purification, Annealing
The TAO tiles used here have been extensively described and characterized [11]. Briefly, oligonucleotide sequences were designed using a random hill-climbing algorithm to maximize the likelihood of formation of the desired base-pairing structures while minimizing the chances of spurious pairings; tiles were shown to form as designed by electrophoretic and chemical analyses; and large, flat lattice sheets were observed by AFM imaging. Custom oligonucleotides were ordered (Integrated DNA Technology) and gel purified after receipt. Sequences of oligos used here (in 5’ to 3’ direction) are: Tile type A: A1: TCGGCTATCGAGTGGACACCGAAGACCTAACCGCTTTGCGTTCCTGCTCTAC A2: AGTTAGTGGAGTGGAACGCAAAGCGGTTAGGTCTTCGGACGCTCGTGCAACG A3: ACGAGCGTGGTAGTTTTCTACCTGTCCTGCGAATGAGATGCCACCACAGTCACGGATGGACTCGAT A4: TGCTCGGTAGAGCACCAGATTTTTCTGGACTCCTGGCATCTCATTCGCACCATCCGTGACTGTGGACTAACTCCGCTT
Tile type B*: B1: CGAGCAATGAAGTGGTCACCGTTATAGCCTGGTAGTGAGCTTCCTGCTGTAT B2: ACACAGTGGAGTGGAAGCTCACTACCAGGCTATAACGGACGATGATAAGCGG B3a: AGCCGAATACAGCACCATCTTTTGATGGACTCCTGAATCGACGTAACTT B3b: TTGTTACGTCTTTCTACTCGCACCTTCGCTGAGTTTGGACTGTGTCGTTGC B4: ATCATCGTGGTTCTTTTGAACCTGACCTGCGAGGTATGTGATTTTTCACATACTTTAGAGATTCACCAAACTCAGCGAAGGACTTCAT B4a: ATCATCGTGGTTCTTTTGAACCTGACCTGCGAGGTATGTGATTT B4b: TTCACATACTTTAGAGATTCACCAAACTCAGCGAAGGACTTCAT
Strand B3a was ordered with (B3aSH) and without (B3a) a thiol group attached to its 5’ end, while B3b was made with (B3bSH) and without (B3b) a thiol on its 3’ end. The 5’ end of strand B3a and the 3’ end of B3b form a basepaired stem projecting out of the plane of the tile, likewise for B4a and B4b. Thiol modified oligos were delivered with –SSH moieties and were reduced to –SH by treatment with dithiothreitol (DTT) and buffer exchanged into annealing buffer by gel permeation chromatography prior to use (NAP-5 columns, Pharmacia).
14
Dage Liu, John H. Reif, and Thomas H. LaBean
Oligos were annealed to form tiles and lattices in stoichiometric mixtures by heating to 95˚C for 5 minutes and slowly cooling to room temperature in a thermal cycler by dropping 0.1˚C per minute (total annealing time was approximately 12 hours). Exact stoichiometry is determined, if necessary, by titrating pairs of strands designed to hydrogen bond, and visualizing them by nondenaturing gel electrophoresis; absence of monomer is taken to indicate equimolar ratios have been achieved. 2.2
Polyacrylamide Gel Electrophoresis
Gels contain 8 to 10% acrylamide (19:1, acrylamide:bisacrylamide) and a buffer consisting of 40 mM Tris-HCl (pH 8.0), 2 mM EDTA and 12.5mM magnesium acetate (TAEMg) was used for non-denaturing gels, while a Tris-borate EDTA buffer without magnesium was used for denaturing gels which also contained 8.3 M urea. The DNA was dissolved in 10 µL of TAEMg or in water. Tracking dye (1 µL) containing TAEMg, 50% glycerol and 0.2% each of Bromophenol Blue (plus Xylene Cyanol FF in non-denaturing gels) is added to the sample buffer. Gels are run on a Hoefer SE-600 electrophoresis unit at 4 V/cm at room temperature (Non-denaturing) or 55˚C (Denaturing), and exposed to X-ray film for up to 15 hrs or stained with Stains-all (0.01% Stains-All from Sigma, 45% formamide) or ethidium bromide, then photographed. 2.3
TEM (Direct, Negative Stain, Platinum Rotary Shadow)
Transmission electron micrographs were obtained on a Phillips 301 TEM. Direct examination of DNA structures was carried out by pipetting 2-4 µL of annealed DNA solution onto a copper grid which had previously been layered with carbon film. Negative staining involved application of ∼10 µL of 2% uranyl acetate solution to the grid. In all cases excess liquid was blotted from the sample by lightly touching an edge or corner of filter paper to the edge of the grid. Platinum rotary shadow samples were prepared by gently mixing annealed DNA sample with one volume of 80% glycerol, then spraying the mixture onto freshly cleaved mica using a micro capillary and compressed air. The sample was dried on the mica under high vacuum overnight in a Balzer evaporator, tilted to a 4◦ angle, and rotated 180◦ (around the normal to the sample plane) during deposition by thermal evaporation of a thin layer of platinum. The sample was then tilted to 90◦ (flat faced toward the electrodes) and coated with a carbon layer. The carbon/platinum film was floated off the mica onto the surface of a domed droplet of buffer and finally captured on a copper grid. Samples were air dried prior to examination in the TEM. 2.4
AFM (Tapping Mode in Air)
Atomic force micrographs were obtained on a Digital Instruments Nanoscope III with a multimode head by tapping mode in air. AFM tips (Silicon OTESTP)
DNA Nanotubes
15
were obtained from Digital Instruments. Samples were prepared by pipetting annealed DNA solution (∼2 µL) onto freshly cleaved mica (Ted Pella, Inc.), allowed to adhere 2 minutes, rinsed gently by doming a drop of milliQ water onto the mica, then air dried either under a stream of nitrogen or by gentle waving.
3
Experimental
In order to develop DNA tile self-assemblies as templates for nano-scale patterning of various materials we tested chemically modified oligonucleotides in several attachment chemistry schemes. First, thiolated DNA strands were studied for reaction with monofunctional monomaleimido nanogold (1.4 nm) (Nanoprobes, Inc., Yaphank, NY)) and with nanoparticulate gold colloid (4 – 10 nm) formed by the reaction of gold chloride with sodium borohydride. Next, oligos labeled with a free amine on the 5’ or 3’ end were reacted with monofunctional sulfo-N-hydroxysuccinimido nanogold (1.4 nm) (Nanoprobes, Inc.). Finally, thiol modified oligos were studied under conditions which promoted the formation of disulfide and polysulfide structures between tiles and lead to observations of DNA nanotube formation.
Fig. 3. AFM image of TAO AB* Nanotube. Stripes perpendicular to the long axis of the filaments are clearly visible and indicate closed ring structures in successive layers rather than a spiral structure which would have given stripes with noticeable diagonal slant.
Previous imaging of TAO AB* flat, lattice sheets demonstrated lattice fragments with widely varying dimensions and often with rough edges. On the other hand, DNA nanotubes exhibit uniform apparent diameter of approximately 25 nm for lengths of 10 to 20 microns. These structures were observed by AFM and TEM on negative stained samples. AFM peak-peak width measurement
16
Dage Liu, John H. Reif, and Thomas H. LaBean
indicated that the individual subunit is separated by 27 nm. AFM height measurements (Figure 3) give a mean value of ∼4.4 nm, which is twice the height of the two-dimensional DNA nanostructure.
{
{
{
{
SS
A
B*
A
B*
Fig. 4. Schematic drawing of a proposed structure for the DNA nanotubes. Shown are layers consisting of six tiles.
Figure 4 shows a section of the proposed structure of the tubes with B* tile layers alternating with A tile layers, dsDNA helix axes are aligned parallel with the tube axis, sulfur groups are located inside the tube, and there are approximately 6 or 8 tiles per ring layer. Negative stained TEM samples (Figures 5 and 6) have provided images of sufficiently high resolution to observe that B* stripes (from outwardly projecting dsDNA stems) are perpendicular to the long axis of the fibers. This provides good evidence for an even number of tiles per ring layer and flat rings rather than a spiral structure, but based on the tube diameter there may be 8 tiles per layer rather than 6 depending upon how much the tubes are flattened during sample preparation and the extent of feature broadening artifacts in AFM examinations.
400 nm
200 nm
Fig. 5. Negative stained TEM images. The right panel is a zoom in on the structure observed on the far left of the left-hand panel. It appears to show a portion of a long, well-formed nanotube which has torn open along and parallel to the main axis of the tube. B* tube stripes are visible.
DNA Nanotubes
17
It is interesting that the negative stained TEM shows detailed information of the fiber with unit periodicity of 27 nm, in good agreement with the modeled length of AB* tile. The width of the fiber is around 25 nm. These indications show that the modified AB* tiles assembled periodically and their further selfassembly lead to the formation of regular DNA fibers. It is necessary to point out that, for the self-assembled DNA two-dimensional nanostructure sample after the same negative staining treatment, only outlines of the two-dimensional sheets could be discerned by TEM, no further detailed periodic unit structure could be identified (data not shown). The contrast difference between negative stained TEM samples of 2D lattice sheets and nanotubes indicates a significant structural difference between the assemblies. The DNA tube structure containing a hole inside may accommodate relatively more negative staining agent, and thus show relatively stronger contrast and reveal more detailed periodic unit structure. The revealed DNA tube structure by AFM appears relatively fatter than that imaged by TEM, which is due to the well-known convolution effect in AFM images.
250 nm
Fig. 6. TEM of negative stained image. B* tile layers are again discernable along with what appears to be merger or bifurcation structures.
Figure 6 demonstrates another structural feature observed with low but significant frequency – the bifurcation or merging of tubes. Further experiments are required to clarify whether these are actual forks in the tile superstructures or if they are artifacts of the sample preparation and examination, however, they may turn out to be useful for targeted binding and interconnect formation if their formation can be in some way controlled. Burial of the sulfur moieties within the tubes makes logical sense because disulfide bridges and poly-sulfide rings are preferred structures formed by thiol groups under physiologic-like solution conditions such as those employed here. Internalized sulfur is also indicated by the lack of thiol available for gold binding on the outer surface of the tubes and by the tubes’ ability to bind nanogold on their surface if amino groups are added
18
Dage Liu, John H. Reif, and Thomas H. LaBean
to the dsDNA stem on the face opposite the thiol containing stem on the B* tiles (see Figure 9).
Fig. 7. Negative stained TEM image of a large colloidal gold particle apparently bound to the very end of a DNA nanotube where some exposure of sulfur could occur.
The possible existence of free thiols on the outer surface of the nanotubes was probed using two different gold reagents with reactivity toward sulfur – monomaleimido nanogold and fresh colloidal gold nanoparticles. The monofunctional nanogold failed to react significantly with the tube surfaces, while the colloidal gold displayed the interesting reactivity illustrated in Figure 7. With very low background levels of unbound gold, the bound gold particles showed a very high probability of attachment to the ends of nanotubes and not to the outer surface anywhere else along the tubes’ length. The indication is that sulfur which is buried within the tubes becomes exposed to some extent at the open ends of the nanotubes. This observation, besides offering prima facia evidence of the location of the sulfur groups, may also be exploited in future work on targeted binding or formation of electrical contacts with the ends of DNA nanotubes. To test the effect of thiol oxidation state on nanotube formation, the same set of oligos was annealed in the presence of 20 mM dithiothreitol (DTT). AFM observation of this sample showed a DNA nanostructure rather different than the nanotubes (Figure 8). The former appears to be two sheets of self-assembled DNA nanostructure stacked one on top the other. There may be interactions between –SH group from one sheet to another sheet. Further experiments must be completed to clarify an interpretation of this observation. To test metallization of the DNA nanotubes in order to produce functional nanowires and also to further characterize the orientation of TAO tiles in the tubular lattices, primary amine groups were added to the oligonucleotides (B4a and B4b) which base-pair to form the dsDNA stem projecting out of the tile plane and on the tile face opposite that of the thiol-containing stem. According
DNA Nanotubes
19
Fig. 8. AFM image of nanotube forming oligonucleotides which were annealed in the presence of 20 mM DTT.
to our model of the nanotubes, this should place the primary amines on the outside of the tubes and accessible to reaction with to sulfo-N-hydroxy-succinimido nanogold.
a
Fig. 9. AFM images of DNA nanotubes with 1.4 nm gold particles bound to the outer surface (left panel), with 1.4 nm gold followed by 2 minute reaction with silver enhancement reagent (middle panel), and 1.4 nm gold with 5 minute silver enhancement.
As shown in Figure 9, incorporation of amine modified oligo allowed reaction with the monofunctional nanogold reagent which successfully bound 1.4 nm gold particles to the outside surface of the assembled tubes. Reaction with unmodified tubes resulted in no gold labeling (data not shown). Also shown in Figure 9 are progressive time points of chemical silver deposition onto the prebound gold nanoparticles. Silver deposition is one possible route to the complete metallization of DNA nanotubes for formation of functional nanowires. The sil-
20
Dage Liu, John H. Reif, and Thomas H. LaBean
ver enhancement reagent does not deposit silver efficiently onto DNA nanotubes in the absence of bound nanogold.
4
Discussion and Conclusions
Use of a programmable, self-assembling polymer like DNA appears to be one of the most promising avenues available for positioning components for molecule electronics and for targeted connections which can subsequently be metallized to form functional nanowires. The annealing of oligonucleotides has already been used to direct specific association of gold nanospheres into 3D clusters [16], [17], [18]. Nanometer-scale gold rods have also been bound to one another via hybridization of oligonucleotides [19]. The assembly of a silver wire between two prefabricated leads by a DNA hybridization technique has also been reported [20]. However, previous studies have utilized only linear dsDNA, therefore the available structures were limited and fine control of associations was not sufficiently diverse to allow for a wide selection of programmable superstructures. The length scale of the DNA nanotubes described here is intermediate between the molecular world where distances are measured in ˚ Angstroms (10−10 meters) and the macro world, the lower end of which can be measured in microns (10−6 meters). One eventual goal of this line of basic research and materials development is the fabrication of reliable interconnects useful in bridging between these disparate length scales. Acknowledgements. This work was supported in part by research grants from DARPA/NSF (NSF-CCR-97-25021), from NSF (NSF-EIA-00-86015), and from a sponsored research agreement with Taiko Denki Co., Ltd., Tokyo.
References 1. Adleman, L.M. (1994) Molecular computation of solutions to combinatorial problems. Science 266, 1021-1024. 2. Seeman, N.C. (1982) Nucleic Acid Junctions and Lattices. J. Theor. Biol. 99, 237247 . 3. Seeman, N.C. (1999) “DNA engineering and its application to nanotechnology”, Trends in Biotechnology 17, 437-443. 4. Reif, J.H., LaBean, T.H., and Seeman, N.C., (2001) Challenges and Applications for Self-Assembled DNA Nanostructures, Proc. Sixth International Workshop on DNA-Based Computers, Leiden, The Netherlands, June, 2000. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Edited by A. Condon and G. Rozenberg. Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg, vol. 2054, pp. 173-198. 5. LaBean, T.H. (in press, 2002) Introduction to Self-Assembling DNA Nanostructures for Computation and Nanofabrication. in CBGI 2001, Proceedings from Computational Biology and Genome Informatics, held 3/2001 Durham, NC, World Scientific Publishing.
DNA Nanotubes
21
6. Reif, J.H. (to appear, 2002) DNA Lattices: A Programmable Method for Molecular Scale Patterning and Computation, to appear in the special issue on BioComputation, Computer and Scientific Engineering Journal of the Computer Society. 2002. 7. Seeman, N.C.; Chen, J.-H.; Kallenbach, N.R., 1989 Electrophoresis 10 345-354. 8. Holliday, R. (1964) Genet. Res. 5, 282-304. 9. Fu, T.-J. and Seeman, N.C. (1993) Biochemistry 32, 3211-3220. 10. Winfree, E., Liu, F., Wenzler, L.A., and Seeman, N.C. (1998) “Design and SelfAssembly of Two-Dimensional DNA Crystals”, Nature 394, 539-544. 11. LaBean, T. H., Yan, H., Kopatsch, J., Liu, F., Winfree, E., Reif, J.H. & Seeman, N.C. (2000) “The construction, analysis, ligation and self-assembly of DNA triple crossover complexes”, J. Am. Chem. Soc. 122, 1848-1860. 12. Liu, F. R., Sha, R. J. and Seeman, N. C. (1999) Modifying the surface features of two-dimensional DNA crystals, J. Am. Chem. Soc. 121, 917-922. 13. LaBean, T.H., Winfree, E., and Reif, J.H. (2000) “Experimental progress in computation by self-assembly of DNA tilings”, 5th DIMACS Workshop on DNA Based Computers, MIT, June, 1999. DNA Based Computers, V, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, (ed. E. Winfree), American Mathematical Society, 2000. 14. Winfree, E., Yang, X., Seeman, NC, “Universal Computation via Self-assembly of DNA: Some Theory and Experiments.” In DNA Based Computers II: DIMACS Workshop, June 10-12, 1996 (Volume 44 in DIMACS). Laura F. Landweber and Eric B. Baum, editors. American Mathematical Society, 1998, 191–213. 15. Mao, C., LaBean, T.H., Reif, J.H., and Seeman, N.C. (2000) “Logical computation using algorithmic self-assembly of DNA triple-crossover molecules”, Nature, 407 493-496. 16. Alivisatos, A.P., K.P. Johnsson, X. Peng, T.E. Wilson, C.J. Loweth, M.P. Bruchez Jr., and P.G. Schultz (1996) Organization of ’nanocrystal molecules’ using DNA. Nature, 382, 609-611. 17. Mirkin, C.A., Letsinger, R.L., Mucic, R.C., and Storhoff, J.J. (1996) “A DNABased Method For Rationally Assembling Nanoparticles Into Macroscopic Materials”, Nature 382, 607-609. 18. Mucic, R. C.; Storhoff, J. J.; Mirkin, C. A.; Letsinger, R. L., (1998) “DNA-directed synthesis of binary nanoparticle network materials”, J. Am. Chem. Soc. 120, 1267412675. 19. Mbindyo, J.K.N., Reiss, B.D., Martin, B.R., Keating, C.D., Natan, M.J., and Mallouk, T.E. (2001) Advanced Materials 13, 249-254. 20. Braun, E., Eichen, Y., Sivan, U., and Ben-Yoseph, (1998), G. DNA-templated assembly and electrode attachment of a conducting silver wire, Nature 391, 775778.
The Design of Autonomous DNA Nanomechanical Devices: Walking and Rolling DNA John H. Reif Department of Computer Science, Duke University, Box 90129, Durham, NC 27708-0129. [email protected]
Abstract. We provide designs for the first autonomous DNA nanomechanical devices that execute cycles of motion without external environmental changes. These DNA devices translate across a circular strand of ssDNA and rotate simultaneously. The designs use various energy sources to fuel the movements, include (i) ATP consumption by DNA ligase in conjunction with restriction enzyme operations, (ii) DNA hybridization energy in trapped states, and (iii) kinetic (heat) energy. We show that each of these energy sources can be used to fuel random bidirectional movements that acquire after n steps an expected translational √ deviation of O( n). For the devices using the first two fuel sources, the rate of stepping is accelerated over the rate of random drift due to kinetic (heat) energy. Our first DNA device, which we call walking DNA, achieves random bidirectional motion around a circular ssDNA strand by use of DNA ligase and two restriction enzymes. Our other DNA device, which we call rolling DNA, achieves random bidirectional motion without use of DNA ligase or any restriction enzyme, and instead using either hybridization energy (also possibly just using kinetic (heat) energy at a unfeasible low rate of resulting movement).
1
Introduction
Prior Work. Nanomechanical devices built of DNA have been developed using two distinct approaches: (i) Seeman’s group [MSS+99] used rotational transitions of dsDNA conformations between the B-form (right handed) to the Z-form (left-handed) controlled by ionic effector molecules and extended this technique to be DNA sequence dependant [YZS+02]. (ii) Yurke and Turberfield[YTM+00, YMT00,TYM00] demonstrated a series of DNA nanomechanical devices that used a fuel DNA strands acting as a hybridization catalyst to generate a sequence of motions in another tweezers strand of DNA; the two strands of DNA bind and unbind with the overhangs to alternately open and shut the tweezers.
Supported by DARPA/AFSOR F30602-01-2-0561, NSF ITR EIA-0086015, DARPA/NSF CCR-9725021. Paper URL: http://www.cs.duke.edu/∼reif/paper/DNAmotor/DNAmotor.pdf
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 22–37, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Design of Autonomous DNA Nanomechanical Devices
23
These prior DNA nanomechanical devices have some key restrictions that restrict their use: (a) they can only execute one type of motion (rotational or translational). It seems feasible to re-design these DNA nanomechanical devices to have both translational and rotational motion, but this still needs to be done. (b) More importantly, these prior DNA devices require environmental changes such as temperature cycling or bead treatment of biotin-streptavidin beads to make repeated motions. The Technical Challenge. A key challenge, which we address in this paper, is to make an autonomous DNA nanomechanical device that executes cycles of motion (either rotational or translational or both) without external environmental changes. Our Results. We provide designs for two autonomous DNA nanomechanical devices. Both move across a circular ssDNA strand which we call a road and simultaneously exhibit both translational and rotational movement. However, they use distinct energy sources to fuel the movements. (i) Our first nanomechanical device design, which we call walking DNA, uses as its energy source ATP consumption by DNA ligase in conjunction with two restriction enzyme operations. Our walking DNA construction, though technically quite distinct and with a different objective, might be viewed as analogous to Shapiro’s recent autonomous DNA computing machine [BPA+01].
Rolling DNA Device
Walking DNA Device dsDNA Walker
ssDNA Road:
ssDNA Road:
Bidirectional Translational & Rotational Movement
ssDNA Roller:
Bidirectional Random Translational& Rotational Movement
Fig. 1. (a)The DNA Walking Mechanism. (b)The DNA Rolling Mechanism. (ii) Our next nanomechanical device design, which we call rolling DNA (Note: the illustration on the right in Figure 1has been simplified; the small wheel loop DNA strand would in fact be topologically linked (intertwined) with the road strand of DNA). It makes no use of DNA ligase or any restriction enzyme. Instead, it uses as its energy source the hybridization energy of DNA in temporally trapped states. This involves the use of fuel DNA and the application of DNA catalyst techniques for liberating DNA from these loops conformations, and to harness their energy as they transit into lower energy conformations. Unlike the prior work of [YTM+00,YMT00,TYM00], our resulting nanomechanical device is autonomous. We also observe that a similar movement, but possibly at an
24
John H. Reif
unusable lower rate, can also be achieved using this basic design using simply kinetic (heat) energy. Potential Applications. In principle, these DNA nanomechanical devices can be incorporated at multiple sites of larger DNA nanostructures such as selfassembled DNA lattices. The DNA nanomechanical devices might be used to induced movements to hold state information and to sequence between distinct conformations. Possible applications include: (a) Array Automata: The state information could be stored at each site of a regular DNA lattice (see [WLW98,S99,LYK+00,LWR00,MLR+00,RLS01,R01]), and additional mechanisms for finite state transiting would provide for the capability of a cellular array automata. (b) Nanofabrication: These capabilities might be used to selectively control nanofabrication stages. The size or shape of a DNA lattice may be programmed through the control of such sequence-dependent devices and this might be used to execute a series of foldings (similar to Japanese paper folding techniques) of the DNA lattice to form a variety of 3D conformations and geometries.
2
Our Walking DNA Device
Here we describe a DNA nanomechanical device (see Figure 1a) that achieves autonomous bidirectional translational and rotational motion along a DNA strand. We will be using, as the energy source, ATP consumption by DNA ligase in conjunction with restriction enzyme operations. We will use the following notation: (i) superscript R to denote the reverse of a sequence, and (ii) overbar to denote the complement of an ssDNA sequence. Overview of Our Walking DNA Device Construction. The basic idea is to form a circular repeating strand R of ssDNA we call the road, which will be written in 5 to 3 direction from left to right. The road (See Figure 2 ) will consist of an even number n of subsequences, which we call steppingstones, indexed from 0 to n-1 modulo n. The ith steppingstone consists of a length L(where L is between 15 to 20 base pairs) sequence Ai of ssDNA. (In our constructions, the Ai repeat with a period of 2.) A0
A1
A2 …An-1
An … …
Fig. 2. The DNA Road. Let us designate the direction 5 to 3 to be direction right on the road R, and let us designate the direction 3 to 5 to be direction left on the road R. At any given time, there is a unique a partial duplex DNA strand Wi which we call the th i th walker with 3 ends Ai−1 and Ai that are hybridized to consecutive (i − 1) th and i steppingstones Ai−1 and Ai , respectively, as illustrated on the left in Figure 3.
The Design of Autonomous DNA Nanomechanical Devices
Wi
Wi+1:
A i-1
A0
25
A1 … Ai-1
Step
Ai
Ai
Ai+1
A0
A1 … Ai-1
Ai
A i+1
Ai
Ai+1
Fig. 3. A Step of the DNA Walker The Goal of the Device Construction. The below construction will have the goal of getting the ith walker Wi to reform to another partial duplex DNA strand, namely the (i+1)th walker Wi+1 which is shifted one unit over to the left or the right. In the case of the latter movement to the right, the movement results in the 3 ends Ai and AI+1 being hybridized to consecutive ith and (i + 1)th steppingstones Ai and Ai+1 , respectively, as illustrated on the right in Figure 3(the movement to the left is reversed). Hence, the movement we will achieve is bidirectional, translational movement either in the 5 to 3 direction (from left to right) on the road, or in the 3 to 5 direction. We will describe in detail the movement in the 5 to 3 direction, and then observe that the in the 3 to 5 direction is also feasible. We cycle back in 2 stages, so that Wi+2 = Wi for each stage i. To achieve the movements, we will use two distinct type-2 restriction enzymes. We will also make use of DNA ligase, which provides a source of energy (though ATP consumption) and a high degree of irreversibility. Simultaneous Translational and Rotational Movements. Due to the well known secondary structure of B-form dsDNA (rotating 2π radians every approx 10.5 bases), it follows that in each such step of translational movement, the walker will rotate around the axis of the road by approximately L2π/10.5 = Lπ/5.25 radians. Hence the DNA walker construction will execute a very simple type of motion with simultaneous translation and rotation. Avoiding Unwanted Interactions. To ensure there is no interaction between a walker and more than one distinct road at a time, we use a sufficiently low road concentration and solid support attachment of the roads. To ensure there is no interaction between a road and more than one walker, we use a sufficiently low walker concentration. Details of Our Device Construction. In the following detailed description given below, we describe the sequence of movements involving all the components: walker, road, restriction enzymes and ligase. And we provide a sequence of cartoon Figure s illustrating these movements. Definition of the Walker and Oglionucleotides used in Walking DNA Construction: For each i = 0, 1, we define ssDNA (in the 5‘ to 3‘ direction): R R R Ci−1 C i B i Ai R . Here Bi (a) Gi = BiR CiR C i−1 B i−1 Ai−1 and (b) Hi = Bi−1 Ci and the previously defined Ai are distinct oglionucleotides of low annealing cross-affinity. To ensure we cycle back in 2 stages, the subscripts of the Ai Bi Ci are taken modulo 2, and so: Ai+1 = Ai−1 , Bi+1 = Bi−1 and Ci+1 = Ci−1
26
John H. Reif
for each i. Note that Gi has subsequence BiR CiR C i−1 B i−1 and Hi has, as a R R subsequence, the reverse complement Bi−1 Ci−1 C i B i of that subsequence. We th define the i walker Wi to be the partial duplex resulting by hybridization of Gi and Hi at these reverse complementary subsequences, with ssDNA 3‘ overhangs Ai−1 and Ai , as given in Figure 4.
BiR B iR
Ai
CiR C i-1 C iR
Fig. 4. (a)The DNA walker. (b)The DNA stepper. We inductively assume that ith walker Wi has the 3 end Ai−1 of its component strand Gi hybridized to steppingstone Ai−1 on the road R and the 3 end Ai of its component strand Hi hybridized to steppingstone Ai on the road R. Definition of the Stepper: To aid us in this movement, for each i = 0, 1 we include in solution the following partial duplex Si which we call the i th stepper strand illustrated on the right in Figure 4. Note that each of the ith steppingstone Ai subsequences will hybridize with a complementary subsequence Ai of the ith stepper Si . We assume that this occurs at each steppingstone, except the steppingstones where the walker’s ends are hybridized, as illustrated on the left in Figure 5.
Ci
A
0
A0
A1 …
A i-1
Ai
A1 … Ai-1 Ai
A
i+1
Ai+1
… …
Restriction Enzyme Cleavage
CiR
A0
A1 …
A i-1
Ai
A
A0
A1 … Ai-1
Ai
A
Fig. 5. Cleavage of the DNA walker. Restriction Enzyme Cleavage of the Walker: For each i = 0, 1, we use a type two restriction enzyme that matches with the duplex subsequence containing Ci−1 Bi−1 and its complement C i−1 B i−1 within Wi , and then cleaves Wi just before C i and just after Ci , as illustrated on the right in Figure 6. This can result in the following products: (P1) A i th truncated walker T Wi (still attached to the ith steppingstone Si ) consisting of a partial duplex with an ssDNA overhang (Ci )R at one 3 end, as illustrated on the left in Figure 6, and also (P2) A partial duplex, as also illustrated on the right in Figure 6, which we can observe is the (i − 1)th stepper Si−1 , still attached to the (i − 1)th steppingstone
The Design of Autonomous DNA Nanomechanical Devices
27
Si−1 (To see this is the (i − 1)th stepper Si−1 , note that the right illustration in Figure 6can be flipped left to right and top to bottom so as to be equivalently viewed as C i ; since C i = C i−2 , this above cleavage product is identical to above defined (i − 1)th stepper Si−1 .) with an ssDNA overhang (C i )R at one 3 end.
Ai
BiR CiR B iR
Products of
Cleavage
Fig. 6. Products of the cleavage of the DNA walker. We assume that the sequences of sufficiently short length that the only strands with an ssDNA subsequence C i which can possibly hybridize with the ssDNA subsequence Ci of the above ith truncated walker T Wi are the adjacent steppers Si−1 and Si+1 . Reversals. It will be important to observe, when at the end of this section further below we observe bi-directionality of the movement of the walker, that the walker has two possible (dual) restriction enzyme recognition sites, and so the dual use of the two restriction enzymes also can result in other pair of products, resulting in a reversal of movement. However, to simplify the discussion, we ignore these reversals for the moment. Stalls. Observe that it is certainly quite possible that the above cleavage operation might simply be reversed by re-hybridization of subsequence C i of stepper Si−1 with the complementary subsequence Ci of the above truncated walker T Wi (and possibly a subsequent ligation), reforming the original walker Wi . We will call this restriction enzyme cutting and then re-hybridization a stall event for Wi . Each stall event has a constant probability q, where 0 < q < 1. The consecutive repetition of stall events for Wi might continue any number of times, but as we shall see, the likelihood drops rapidly with the number of repeated stall events. To see this, observe that the number of repeats has, by definition, a geometric probability distribution with geometric parameter p = 1 − q. This geometric probability distribution is known to have expectation E = p/(1 − p), which is a fixed constant. Furthermore, the probability of sE repeats of consecutive stall events for Wi (that is, a factor s more than the expected number E) can be shown to drop exponentially with s, for s > 2. The Stepping of the Walker. In the following, we now consider the case where there have been possibly already been a number of stall events for Wi , but now there is not a further repeat of a stall event for Wi . In this case, we show that what eventually occurs is the intended modification of the truncated walker T Wi into the (i + 1)th walker Wi+1 . Observe that by flipping the left Figure 6for TDi (the flip is top to bottom and left to right), the truncated walker TDi can be redrawn as illustrated on the right in Figure 7and the (i + 1)th stepper strand Si+1 is by definition as illustrated on the left in Figure 7.
28
John H. Reif
Bi
A iR
Ci Bi
Hybridization
Fig. 7. Hybridization with the DNA walker. This (i + 1)th stepper strand Si+1 was assumed to be hybridized with a complementary subsequence Ai+1 of (i + 1)th stepper Si+1 . This (i + 1)th stepper strand Si+1 can also hybridize with the truncated walker T Wi at their C i and Ci ssDNA overhangs, to form the duplex as illustrated on the left in Figure 8, and the DNA ligase can concatenate the strands, forming the following duplex as illustrated on the right in Figure 8.
ligation
Fig. 8. Ligation within the DNA walker. But the above duplex is by definition the (i + 1)th walker Wi+1 . We now claim that this (i + 1)th walker Wi+1 has its 3 ends hybridized to consecutive steppingstones Ai and Ai+1 in its required position as given on the right of Figure 9.
Ci
CiR
A0
A1 …
A i-1
Ai
A i+1
A0
A1 …
A i-1
Ai
A i+1 …
A0
A1 … Ai-1
Ai
Ai+1
A0
A1 …
Ai-1
Ai
Ai+1 …
Fig. 9. State after one cycle of stepping by the DNA walker. This follows since: (a) A 3 end of the ith walker Wi was assumed to be hybridized with a complementary subsequence Ai of stepper Si , so that the same 3 end of the ith truncated walker T Wi remains hybridized with a complementary subsequence Ai of stepper Si , and further the (i + 1)th walker Wi+1 inherits this hybridization of that 3 end. (b) A 3 end of the (i + 1)th steppingstone subsequence Ai+1 was assumed to be hybridized with a complementary subsequence Ai+1 of stepper Si+1 , so it follows the (i + 1)th walker Wi+1 inherits this
The Design of Autonomous DNA Nanomechanical Devices
29
hybridization of that 3 end. The above described movement of the ith walker Wi to reformed (i + 1)th walker Wi+1 on the road R is always from direction 3 to 5 once ligation occurs, and it is very unlikely to be reversed after that. The above discussion of the feasibility of movement in the 3 direction can be summarized as follows:. We assumed the situation STATE(i) where: (a) walker Wi has the 3 end of Gi hybridized to steppingstone Ai−1 and the 3 end of Hi hybridized to steppingstone Ai , and also (b) the steppingstone Ak subsequence is hybridized with a complementary subsequence Ak of stepper Sk , for each k, where k < i − 1ork > i. The above discussion has demonstrated that given STATE(i), and then in an expected constant number of phases, it is feasible for us to have a movement in the 3 direction on the road to STATE(i+1). But this does not complete our description of all possible movements of the walking DNA construction. Feasible Reverse Movements. The above does not ensure the movement is only in the 5 direction along the road. In the above subsection titled “Restriction Enzyme Cleavage of the Walker” we described a pair of products (P1) and (P2) that can result from the use of the restriction enzymes described there. (To simplify the description of 3 movements, we temporarily ignored other possible products in that above discussion.) We now observe that the walker actually has two possible (dual) restriction enzyme recognition sites, so dual use of the two restriction enzymes defined there also can result in: (P1 ) A partial duplex, consisting of the ith stepper Si (still attached to the ith steppingstone Si ) with th an ssDNA overhang (Ci−1 )R at one 3 end, or (P1 ) A (i − 1) truncated walker th T Wi−1 (still attached to the (i − 1) steppingstone Si−1 ) consisting of a partial duplex with an ssDNA overhang ( C i−1 )R at one 3 end. By a similar argument to the above, it can easily be shown: there is also a feasible movement of the walker in the 5 direction on the road from STATE(i) to STATE(i-1). By symmetry, both these possible movements have equal likelihood. Summarizing, we have shown that there are two equally likely movements: (forward) in the 3 direction on the road from STATE(i) to STATE(i+1), and (reverse) in the 5 direction on the road from STATE(i) to STATE(i-1). Expected Drift Via Random Translational Movement and the Use of Latching. We have shown that the resulting movement forms a random, bidirectional translational movements in both directions along the road. Due to the symmetry of the construction, both translational movements would have equal probability in either direction. It is well known (see Feller [F82]) from the theory of random walks in 1 dimension, that the expected deviation after n steps is √ O( n ). We can easily modify the above design to include a “latching mechanism” that fixes the wheels position at specified locations along the road. This can be done by appending to each 3 end of the walker an additional “latching” sequence and also inserting the complements of these “latching” sequences at a specified pair of locations along the road, which can fix (via their hybridization) the walker’s location once the locations are reached and these “latching” hybridizations occur.
30
3
John H. Reif
Our Rolling DNA Device
It is an interesting question whether bidirectional autonomous motion can be achieved without use of DNA ligase or any restriction enzyme and also without changes in the environmental condition of the system. Here we answer the question positively. We give a description of a DNA device (see Figure 1b), that we call Rolling DNA, which achieves bidirectional autonomous motion, without use of DNA ligase or any restriction enzyme and also without changes in the environmental condition of the system. The motion is random and bidirectional, forming a random walk across a DNA strand. Overview of our Rolling DNA Construction. Let A0 , A1 , B0 , B1 each be distinct oglionucleotides of low annealing cross-affinity, consisting of L (L R R R R typically would be between 15 to 20) bases pairs, and let A0 , A1 , B 0 , B 1 each be their respective reverse complements. Let a0 , a1 be oglionucleotides derived from A0 , A1 by changing a small number of bases, so their annealing affinity R R with A0 , A1 respectively is somewhat reduced, but still moderately high. A R hybridization between A0 and reverse complementary sequence A0 (or between R A1 and reverse complementary A1 ) will be called a strong hybridization, where R as a hybridization between a0 and A0 R (or between a1 and A1 ) will be called a weak hybridization. We assume that at the appropiate temporature, strong hybridization will be able to displace a weak hybridization. We now define (using descriptive names) some DNA strands that will be of use: (i) The road (See Figure 10) consists of an ssDNA with a0 , a1 , a0 , a1 , a0 , a1 , . . . in direction from 5 to 3 , consisting of a large number of repetitions of the sequences a0 , a1 .
a0
a1
a0
a1
a0
a1 … …
Fig. 10. The DNA wheel’s road. R
R
R
R
(ii) The wheel consists of a cyclic ssDNA of base length 4L with A0 , A1 , A0 , A1 in direction from 5 to 3 ; this corresponds to A1 , A0 , A1 , A0 in direction from 3 to 5 . (Note: in the following we will illustrate the wheel with the subsequences at the top of the wheel labeled in the 5 to 3 direction, and the subsequences at the bottom of the wheel labeled in the 3 to 5 direction.) The wheel has two distinguished positions with respect to the road: (0) A type 0 wheel position, as illustrated on the left in Figure 11 , where two of the wheel subsequences A0 , A1 are weakly hybridized to a consecutive pair of subsequences R R a0 , a1 on the road. (Note that the left in Figure 11 lists A1 , A0 at top of the circular DNA wheel in the 5 to 3 direction, where as the second pair below the wheel are written in the opposite 3 to 5 direction as A0 , A1 . Also, note that for
The Design of Autonomous DNA Nanomechanical Devices
31 R
R
sake of simplicity in the left in Figure 11 illustration the subsequences A1 , A0 at the top of the wheel are illustrated as not hybridized to any other sequences, but in fact they may also be weakly hybridized to a further consecutive pair of subsequences a0 , a1 of the road.) (1) A type 1 wheel position, as illustrated on the right in Figure 11 , where two of the wheel subsequences A1 , A0 are weakly hybridized to a consecutive pair of subsequences a1 , a0 on the road. (Note again R R that the right in Figure 11 lists A0 , A1 at top of the circular DNA wheel in the 5 to 3 direction, where as the second pair below the wheel are written in the opposite 3 to 5 direction as A1 , A0 . Also, again note that for sake of simplicity R R in the illustration on the right in Figure 11 the subsequences A0 , A1 at the top of the wheel are illustrated as not hybridized to any other sequences, but in fact they may also be weakly hybridized to a further consecutive pair of subsequences a1 , a0 of the road.)
a0
A 1R
A 0R
A0
A1
a1 a0
a1
A 0R
a0
a1 …
Step
…
… a0
A 1R
A1
A0
a1
a0
a1
a0
a1 …
Fig. 11. One step of the wheel’s movement. Note In the Figure 11 illustration the binding of the wheel with the road has been simplified; the wheel DNA strand would in fact be topologically linked (intertwined) with the road strand of DNA. Avoiding Unwanted Interactions. To ensure there is no interaction between a wheel and more than one distinct road at a time (e.g., to ensure the wheel is not sandwiched between two road strands), we use a sufficiently low road concentration and solid support attachment of the roads. To ensure there is no interaction between a road and more than one wheel, we use a sufficiently low wheel concentration. Random Translational Movement Fueled by Heat Energy. Note that the kinetics (similar to those occurring in branch migrations classically found in Holiday junctions with exact symmetry, and very well understood), will allow for a slow rate of random movements of the wheel in direction 5 to 3 or its reverse, resulting in transitions between these two position types. Note that these translational movements would be in both directions along the road. Due to the symmetry of the construction, both translational movements would have equal probability in either direction. √ As noted in the previous section, the expected deviation after n steps is O( n ). Again, it is possible to modify the above design to include a “latching mechanism” that fixes the wheels position at specified locations along the road (e.g., by hybridization with an additional pair of complementary ssDNA strands inserted into in the wheel and also at
32
John H. Reif
a specified pair of locations along the road, which can fix the wheel’s location once the locations are reached and these annealing occur). In each such step of translational movement, the wheel will simultaneously rotate around its center by 2π/4 = π/2 radians. Again, due to the well-known secondary structure of B-form dsDNA (rotating 2Π radians every approxamately 10.5 bases), it follows that the wheel will also rotate around the axis of the road by approximately L2π/10.5 = Lπ/5.25 radians. Hence the above Rolling DNA construction, even without the further elements described below, can be viewed as executing a very simple type of autonomous motion, which is both translational and rotational, fueled by kinetic (heat) energy. (The physics literature is full of schemes – some possibly violating the laws of thermodynamics, and others theoretically feasible but of very limited use - for harnessing heat energy to fuel nanomechanical motion. For a feasible method using heat energy, see [HLN01].) However, by the laws of thermodynamics, this random autonomous motion will eventually reverse itself. Hence, we feel it would likely be too slow to be of use if it is fueled only by heat energy, and we do not feel it could do useful work. We next face the challenge of accelerating this random walk movement via some other energy source. Accelerated Random Movement Fueled by DNA Hybridization. Our construction, further detailed below, will provide for an acceleration of this random walk movement, by the use of fuel DNA strands, which provide a facility of energy exchange. This initial conformation of the fuel DNA is a state, which temporarily traps its hybridization energy. We will describe the application of DNA catalyst techniques for liberating DNA from these loop conformations, and harness their energy as they transition into lower energy conformations. (Note: We wish to emphasize that the idea of using fuel DNA strands and hybridization catalysts to generate a sequence of motions is not a new idea; it was used by Yurke and Turberfield [YTM+00,YMT00,TYM00] to demonstrate a series of DNA nanomechanical devices such as their DNA tweezers. However, to induce repetitions of the motions by their DNA devices, their devices required external environmental changes; in particular, the use of biotin-streptavidin beads.) The main difference here from that prior work is that here we require no external environmental changes to induce repetitions of the motions by our DNA devices. Instead, we will gradually consume the energy of the DNA fuel loop strands. (This was suggested by Turberfield as a possible mechanism for obtaining autonomous motion.) Details of Our Device Construction. (Note: in the following we will illustrate the fuel strands with the subsequences at the top of the stem loops labeled in the 5 to 3 direction, and the subsequences at the bottom of the stem loops labeled in the 3 to 5 direction.) We define: the type 0 primary fuel strand, as illustrated on the left in Figure 12, consisting of an ssDNA of base length 4L R with A1 , A0 , B0 , A1 . Initially, we assume an initial loop conformation (prepared in a distinct test tube from the complementary fuel strand) of the type 0 primary fuel strand, as illustrated in the right in Figure 12, with its 5 end segment
The Design of Autonomous DNA Nanomechanical Devices
33
R
Ai annealed to its 3 end segment Ai and its third segment B i annealed to an R additional strand B i (which we call its protection segment). A1
A1
A0
B0
A 1R
Hybridization
A0
A1
B0R B 0R
at the ends of the primary fuel strand.
Fig. 12. The primary fuel strand. A type 0 complementary fuel strand, as illustrated on the left in Figure 13, R R R consists of an ssDNA of base length 4L with A1 , B 0 , A0 , A1 .Similarly, we assume the following initial loop conformation of the type 0 complementary fuel strand, as illustrated in the right in Figure 13, with its 5 end segment Ai R annealed to its last segment AR i and its third segment Ai is annealed to an additional strand Ai (which we call its protection segment). (The type 1 primary fuel strands and type 1 complementary fuel strands are defined identically to the type 0 primary and complementary fuel strands, respectively, but with the subscripts 0 and 1 exchanged.)
B 0R
A1 A1
Hybridization at the ends of the fuel strand.
A0
A0
Fig. 13. The type 0 complementary fuel strand. Energetics of the Fuel Strands. Note that the whole complementarity of the primary and complementary fuel strands of a given type implies that the duplex DNA resulting from their complete hybridization has lower free energy. So eventually, over a sufficiently long time interval, the free energy will drive these two species to a double stranded duplex, which is their lowest energy equilibrium state. (For example, the Seeman lab has previously demonstrated [LGM+92] that when two 4-arm junctions are mixed together at an appropriately high temperature, and cooled, they will go to 4 duplex.) The duplex DNA consisting of the complete hybridization of the type 0 primary and complementary fuel strands is as given in Figure 14. (The duplex consisting of the complete hybridization of the type 1 primary and complementary fuel strands is identical, but with the 0 and 1 subscripts exchanged.) However, we are concerned with the special case where the primary
34
John H. Reif
A1
A0
A1
A0
B0 B0
A 1R
A1R
Fig. 14. Complete hybridization between the fuel strands. fuel strands and the complementary fuel strand are prepared separately to form the initial conformations define above, By setting a sufficiently low temperature, that equilibrium duplex state can be made to take any given time duration to reach on the average. So we will use an appropriately chosen sufficiently low temperature to ensure that initially, the primary and complementary fuel strands of the same type will each hold its own state and are not likely to anneal to each other to form a duplex state for say 5 hours on the average. Many cycles of motions described below will then be feasible within this limited time interval. The Sequence of Events of a Feasible Movement. Now suppose the wheel is in type 0 position with respect to the road. Observe that the 0th primary fuel strand can interact with the wheel in the type 0 position by the following hybridizations: (i) First a hybridization of the second segment A0 of the 0th primary fuel strand R with the reverse complementary segment A0 of the wheel, resulting in the conformation illustrated on the left in Figure 15. (Note that if one or both of the subsequences A1 , A0 at the top of the wheel were previously weakly hybridized to a further consecutive pair of subsequences a1 , a0 on the road, then these weak hybridizations will be displaced by the stronger hybridization of the second segment a0 of the 0th primary fuel strand with the reverse complementary segment R A0 of the wheel.) (ii) Then that single segment hybridization extends to a hybridization of two first segments A1 , A0 (in the 3 to 5 direction) of the 0th primary fuel strand R R with the consecutive reverse complementary segments A1 A0 (in the 3 to 5 direction) of the wheel. (Note again that the previous weakly hybridization of these wheel subsequences with subsequences of the road, will be displaced by the strong hybridization of subsequences of the primary fuel strand with the reverse complementary segment s of the wheel.) (iii) This displaces the prior hybridization of segment A1 of the wheel strand with road segment a1 (which recall is of lower energy). As a consequence of this: (a) The wheel moves by one segment in the 5 direction along the road, effecting a transition of the state of the wheel from the type 0 position to type 1 position. (b)This also displaces the prior hybridization of the 5 end segment A1 with its R 3 end segment A1 of the primary fuel strand, which now is exposed. This results in the conformation illustrated on the right in Figure 15. Next, a type 0 complementary fuel strand consisting of an ssDNA of base length R R 4L with A1 , B 0 , A0 , AR 1 (in the 3 to 5 direction) is now be susceptible to hybridizations with reverse complementary subsequences of the type 0 primary
The Design of Autonomous DNA Nanomechanical Devices
35
Wheel rotation due to displaced hybridization with primary fuel strand.
Fig. 15. A rotation of the DNA wheels. fuel strand, first at that fuel strand’s newly exposed 3 end segment AR 1 and then at B0 . This results in the conformation illustrated on the left in Figure 16.
A 0R
Hybridization between primary & complementary fuel strands.
… a0
A 1R
A1
A0
a1
a0
a1
a0
a1 …
Fig. 16. Hybridization of the DNA fuel strand withthe wheel. We will show below that the formation of a type 0 fuel strand duplex removes the type 0 fuel strands from the wheel, completing the step, as illustrated on the right in Figure 16. (Also, note that for sake of simplicity, in the illustration R R on the right in Figure 16, the subsequences A0 , A1 at the top of the wheel are illustrated as not hybridized to any other sequences, but in fact they may also be weakly hybridized to a further consecutive pair of subsequences a1 , a0 of the road.) The type 0 fuel strand duplex consists of the mutual hybridization of the entire type 0 primary fuel strand with the type 0 complementary fuel strand, as illustrated in the left in Figure 17. Its formation removes the type 0 fuel strands from the wheel. The same resulting duplex of the type 0 fuel strands is also illustrated on the right in Figure 17(it is flipped top to bottom and left to right), so it can be compared with the prior illustrations of the type 0 primary fuel strands:
A1
B 0R
A 0R
A 1R
A1
B0R
A0R
A1R
FLIP
A1
A0
A1
A0
B0 B0
Fig. 17. Flipping of the DNA fuel strand.
A 1R
A1R
36
John H. Reif
We conclude that: (a) There is a feasible transition of the state of the wheel from type 1 position to type 0 position driven by the fuel strands of type 0, where the wheel moves by one segment in the 5 direction along the road. (b) Also, there is also a similarly feasible transition of the state of the wheel from a type 0 position to type 1 position driven by the type 1 fuel strands, where again the wheel moves by one segment in the 5 direction along the road. (This is almost exactly the same as described above, but with 0 and 1 reversed.) Other Feasible Movements. The above does not ensure the movement is only in the 5 direction along the road. In fact, there are multiple other feasible interactions of the wheel with the fuel strands. By a similar argument, it can easily be shown: (c) There is a feasible transition of the state of the wheel from type 1 position to type 0 position driven by the type 0 fuel strands, where the wheel moves by one segment in the 3 direction along the road. (d) There is a feasible transition of the state of the wheel from a type 0 position of to a type 1 position driven by the type 1 fuel strands, where the wheel moves by one segment in the 3 direction along the road. (e) For each i=0, 1, there is a feasible hybridization of the exposed (top) portion of the wheel in a type i position with a fuel strand that makes no transition of the state of the wheel (the wheel remains in a type i position, though the fuel strand is transformed into its duplex state), so the wheel stalls for a moment in its movement along the road. The Accelerated Random Bidirectional Movement of the Wheel. Hence the movement of the wheel with respect to the road remains a random walk with either direction or a momentary stall possible on each step with fixed probability. By symmetry in the construction, it can be shown that the likelihood of each direction (from 5 to 3 , or its reverse) of movement is equally likely. Recall again that for unbiased random walks √ in 1 dimension, that the expected translational deviation after n steps is O( n ). However, the fuel strands appear to considerably accelerate the rate that steps of this random walk are executed (as compared to random movements fueled by kinetic (heat) motion). Hence the acceleration in the rate of these transitions may be of use, and the above has the potential of being feasible design of an autonomous bidirectional DNA nanomechanical device without use of DNA ligase or restriction enzymes.
4
Conclusion
A key open problem is to design and demonstrate a DNA Device achieving unidirectional movement (either translational or rotational). We conjecture such as device will need to use irreversible reactions (e.g., ligation). Acknowledgments. We wish to thank Tingting Jiang, Thom Labean, and Hao Yan and Peng Yin, for their valuable and insightful comments that led to many improvements in this paper. As stated above, Turberfield proposed to us the use of DNA fuel loop strands to provide energy for autonomous motion by DNA devices and some possible mechanisms to use them, but these were distinct from what we have presented.
The Design of Autonomous DNA Nanomechanical Devices
37
References [BPA+01] Benenson, Y., T. Paz-Elizur, R. Adar, E. Keinan, Z. Livneh, and E. Shapiro, Programmable and autonomous computing machine made of biomolecules. Nature, 414, 430 - 434, (2001). [F82] Feller, W., An Introduction to Probability Theory and Its Applications, John Wiley & Sons, New York, (1971) [HLN01] T. Humphrey, H. Linke, and R. Newbury, Pumping Heat with Quantum Ratchets, to appear in Physica E (2001), cond-mat/0103552. [LYK+00] T. H. LaBean, Hao Yan, Jens Kopatsch, Furong Liu, Erik Winfree, John H. Reif, Nadrian C. Seeman. “Construction, analysis, ligation, and self-assembly of DNA triple crossover complexes.” J.Am.Chem.Soc. 122, 1848-1860(2000). [LWR00] T. H. LaBean, E. Winfree, and J.H. Reif, Experimental Progress in Computation by Self-Assembly of DNA Tilings, Gehani, A., T. H. LaBean, and J.H. Reif, DNA-based Cryptography, Proc. DNA Based Computers V: Cambridge, MA, June 14-16, 1999. DIMACS Volume 54, edited by E. Winfree and D.K. Gifford, American Mathematical Society, Providence, RI, pp. 123-140, (2000). URL: http://www.cs.duke.edu/∼reif/paper/DNAtiling/tilings/labean.pdf [LGM+92] M. Lu, Q. Guo, L.A. Marky, N.C. Seeman and N.R. Kallenbach, Thermodynamics of DNA Chain Branching Journal of Molecular Biology 223, 781-789 (1992). [MLR+00] C. Mao, T. LaBean, J.H. Reif and N.C. Seeman, Logical Computation Using Algorithmic Self-Assembly of DNA Triple Crossover Molecules, Nature 407, 493-496 (2000). [MSS+99] C. Mao, W. Sun, Z. Shen and N.C. Seeman, A DNA Nanomechanical Device Based on the B-Z Transition, Nature 397, 144-146 (1999). [RLS01] J.H. Reif, T.H. LaBean, and N.C. Seeman, Challenges and Applications for Self-Assembled DNA Nanostructures, Proc. Sixth Inter.l Workshop on DNA-Based Computers, Leiden, The Neth., June, 2000. DIMACS Ed. by A. Condon and G. Rozenberg. Lecture Notes in CS, Springer-Verlag, Berlin Heidelberg, vol. 2054, 2001, pp. 173-198. http://www.cs.duke.edu/∼reif/paper/SELFASSEMBLE/selfassemble.pdf [R01] J. H. Reif, DNA Lattices: A Programmable Method for Molecular Scale Patterning and Computation, to appear in the special issue on Bio-Computation, Computer and Scientific Engineering Journal of the Computer Society. 2001. URL: http://www.cs.duke.edu/∼reif/paper/DNAlattice/DNAlattice.pdf [S99] N.C. Seeman, DNA Engineering and its Application to Nanotechnology, Trends in Biotech. 17, 437-443 (1999). [WLW98] Winfree, F. Liu, L. A. Wenzler, and N.C. Seeman, Design and Self-Assembly of Two-Dimensional DNA Crystals, Nature 394, 539-544 (1998). [YZS+02] H. Yan, X. Zhang, Z. Shen and N.C. Seeman, A robust DNA mechanical device controlled by hybridization topology, Nature, 415,62-65, 2002. [YTM+00] Yurke, B., Turberfield, A. J., Mills, A. P.Jr, Simmel, F. C. & Neumann, J. L. A DNA-fuelled molecular machine made of DNA. Nature 406, 605-608 (2000).
Cascading Whiplash PCR with a Nicking Enzyme Daisuke Matsuda1 and Masayuki Yamamura2 1
2
Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku, Yokohama 226-8502 Japan; [email protected] CREST, Japan Science and Technology Corporation(JST) and Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology; [email protected]
Abstract. Whiplash PCR has been proposed as a unique mechanism realizing autonomous inference machines and used in various applications of DNA coumputing. However, it is not easy to increase step sizes within a single molecule because of back annealing. This paper proposes a sheme to cascade results of WPCR from molecules to molecules by using a nicking enzyme. We also show preliminary experiments to produce output fragments continuously from WPCR.
1
Introduction
Whiplash PCR(WPCR) was proposed by Hagiya et al, as a unique mechanism to realize autonomous inference machines with single strand DNAs forming hairpin structures [1]. Early WPCR required a certain thermal control like normal PCR. Sakamoto et al. established an isothermal protocol so that DNA molecules act as completely autonomous ways without any outside control [2]. They also proposed an elegant input method for WPCR[3]. WPCR is used not only in abstract computation but also in various applications. Wood et al. used WPCR to represent strategies for game in their research on evolving intelligent agents[4]. Yamamura et al. proposed a memory state copy procedure with WPCR in their research on aqueous computing with PNAs[5]. WPCR has a difficulty called back annealing that the active edge of DNA will hybridize already processed sites more frequently than expected ones. Rose et al. proposed a scheme to inhibit back annealing with bis-PNA. Although the step size will be increased, it is not easy to extend it only with intramolecular activity. This paper proposes a scheme to cascade results of WPCR from molecules to molecules. We expect to ease the difficulty on the step size and also expect new applications by introducing intermolecular activities. This paper consists of 4 sections. Section two proposes a scheme to cascade WPCR with a nicking enzyme. Section three shows preliminary experiments to examine feasibility of proposed idea. Section four discusses on practical and further issues. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 38–46, 2003. c Springer-Verlag Berlin Heidelberg 2003
Cascading Whiplash PCR with a Nicking Enzyme
2
39
Sheme to Cascade WPCR
We designed a scheme to cascade WPCR by using a nicking enzyme. 2.1
Nicking Enzyme
TypeII and TypeI restriction enzymes are useful in DNA engineering because they cleave DNA at a specific site. Most restriction enzymes catalyze doublestranded cleavage on DNA substrates. But there are some proteins called nicking enzymes that cleave only one DNA strand, and introduce a nick into the DNA molecule. We used N.BstNBI as a nicking enzyme[7]. It recognizes a nonpalindromic sequence,5’-GAGTC- 3’, and cleaves only on the top strand 4 base pairs away from its recognition sequence (5’-GAGTCNNNNˆN-3’). 2.2
Output Reaction and Cascading
Figure1 shows the scheme to cascade results of WPCR from molecules to molecules. In order to cascade WPCR, we must prepare a DNA sequence with a special output site. In final step of WPCR, molecular machine has an output fragment along with a recognition sequence of a nicking enzyme. The nicking enzyme makes nick between the output fragment and the recognition sequence. Finally, the output fragment leaves the molecular machine. It can be considered that the output fragment is new data fragment that defines the state of a system. So, the state of a system is updated because the output fragment joins the set of data fragments. An ssDNA fragment can be used as inputs for other molecular machines[3]. An data fragment consists of two parts, paste part and definition part. The paste part hybridizes to the 3’-end of a rule region, and an initial primer is appended to the 3’-end of the rule region by polymerization using the definition part as a template. The extended rule region can work as a molecular machine for next WPCR. At this point, rule regions have no complementary sequence of definition part.
3
Preliminary Experiments
We designed preliminary experiments to examine the feasibility of proposed idea. 3.1
Mateirials and Methods
Molecular Machine. The DNA sequences used in this paper are listed in Table1.The molecular machine Tran3, in the form shown in Figure2, was commercially synthesized by greiner bio-one. Since the recognition sequence of the enzyme usually contains all kinds of basis (A, T, C, G), C18-spacer is used as a stopper in Tran3. And it can make enzymatic reaction to Tran3 possible. s4 that is complementary strand of s4 plays the part as an output fragment.
40
Daisuke Matsuda and Masayuki Yamamura
5VCVGQH5[UVGO
4WNGTGIKQPU 4WNG 4WNG
&CVCHTCIOGPVU
4WNGP
&CVCC
G
&CVCD &CVCP
H
C
F KPKVKCNRTKOGT
1WVRWVHTCIOGPVU
(KPCNUVGRQH92%4 TGEQIPKVKQPUGSWGPEG
D
QWVRWVHTCIOGPV
E
Fig. 1. The scheme to cascade results of WPCR from molecules to molecules. (a) A molecular machine in an initial state starts WPCR. (b) The nicking enzyme nicks the molecular machine between the output fragment and the recognition sequence. And the output fragment leaves the molecular machine. (c) The Molecular machine restores the output fragment. (d) The output fragment is added into the set of data fragments that define the state of a system, and new state is formed.(e) One of rule regions was paired with one of data fragments. (f) An initial primer is appended to the 3’-end of the rule region by polymerization using the data fragment as a template.
No s0 s1 s2 s3 s4 s5
Sequence CCGTCATCTTCTGCT TACCCTCCCTCACTT CGTCCTCCTCTTGTT GCTTGACTC CATTTCTGGCTCGTC TCGTGCCGTTCGTCCATTTCTGGCTCGTC Table 1. DNA sequences used in this paper
Cascading Whiplash PCR with a Nicking Enzyme
U
U
U
U
U
U
U
41
U
5VQRRGT
Fig. 2. The structure of Tran3. s3 contains the recognition sequence of the nicking enzyme N.BstNBI. And Tran3 is nicked between s3 and s4.
Output Reaction from WPCR. At first, three successive transitions were performed in a reaction mixture containing Tran3 (7pmol), dNTP (60µmol each), Bst DNA Polymerase (4 units, BioLabs inc.), Bst DNA Polymerase buffer, and N.BstNBI buffer. The thermal cycle of WPCR was 64 for 1 min, shift up to 80 in 1 min, 80 for 5 min[3]. And the WPCR consists of 8 reaction cycles. Secondly, the reaction mixture that had undergone WPCR was incubated at 64 for 90 min with the addition of N.BstNBIand the template s4-s5. In this experiment, output reaction is separated from WPCR, so we called this reaction ”separated output reaction”. separated output reaction is a control reaction. In this reaction, two partical reactions(WPCR and output reaction) are performed separately. N.BstNBI and the template s4-s5 were poured into the reaction mixture before WPCR, were used in isothermal and thermal output reaction from WPCR. In contrast to separated output reaction, ”thermal output reaction” underwent two partical reactions continuously. And ”isothermal output reaction” were performed at 64 for 90 min. 3.2
Results
WPCR. The transition products were analyzed on a denaturing 10% polyacrylamide gel stained with ethidium bromide (Figure4). As the result of three successive transitions, it is expected that the reaction mixture contain three kinds of product (Figure3), and Figure4 shows that. Therefore, the molecular machines completely underwent three successive transitions and C18 spacer plays the part as the stopper perfectly. Output Reaction from WPCR. The product of third transition has the double stranded recognition sequence of N.BstNBI, that makes nick between the output fragment s4 and s3. s4 leaves molecular machine, and it plays the part as the primer of s4-s5 that is a template of the output product(Figure5). There is not so great difference among the results of output reaction from WPCR under three conditions (Figure6).
4
Discussion and Conclusion
We proposed a scheme to cascade WPCR from molecules to molecules, and showed preliminary experiments.
42
Daisuke Matsuda and Masayuki Yamamura
U
U
U
U
U
=5VCVG? U
U
U
=5VCVG?
U
U
U U
U
U
U
U
U U
U
U
U
U
U
U
U
=5VCVG?
U U
Fig. 3. The products of the transitions. State1 is the product of first transition, State2 is that of second transition, and State3 is that of third transition.
/
Fig. 4. Gel electrophoresis of the transition products. Lane1 underwent no transition phase. Lane2 underwent WPCR. The band of lane1 shows State1. Lane2 shows State2 (lower band) and State3 (upper band).
=5VCVG?
U
U
U
U
PKEM DGVYGGPUCPFU
=QWVRWVUGSWGPEG? U
U
GZVGPF
=QWVRWVRTQFWEV? Fig. 5. The model of output reaction
Cascading Whiplash PCR with a Nicking Enzyme
/
43
/
C
Fig. 6. The results of output reaction from WPCR. This shows the marker output product (laneM2), the output product of separated output reaction (lane1), that of thermal output reaction (lane2), and that of isothermal output reaction (lane3).
4.1
Discussion about Preliminary Experiments
Since the temperature condition of output reaction and that of incubation phase in WPCR are the same, the output reaction is considered as an extended reaction of WPCR. Therefore, it is natural that the output reaction is performed as an autonomous reaction, and that it does not make much difference whether the efficiency of the autonomous output reaction or that of separated output reaction. And those experiments show that qualitatively. As a result of continuous production of output fragments, the output reaction has the specialty of amplification. But, the pilot experiment of output reaction qualitatively shows that the concentration of the output products tends to converge (data not shown). This means that equilibrium between divorced output fragments and back-annealed output fragments is formed. We suppose the concentration of the output products is decided by two factors that are the concentration of the molecular machines in final step of WPCR and this equilibrium. In order to test this possibility, we hereafter will have to do quantitative analysis. The output product of isothermal output reaction was observed. In previous paper[2],On the contrary, incubation at 80 is recommended because products of transitions are not observed under the conditon at 64. The differences between this paper and previous paper are the polymerase used in experiment and the target observed. Since the target in this paper is the double stranded DNA and is amplified by output reaction, the efficiency of detection is improved. We suppose
44
Daisuke Matsuda and Masayuki Yamamura
those two factors makes observation of WPCR under the condition of 64 possible. In order to say incubation at 64 performs WPCR strictly, we hereafter will have to observe products of transitions.
C=+(##0&$6*'0%?
D=+(##0&$6*'0%? #
# C
$ D
D
C
D UVQRRGT
C
$ D
D
TGEQIPKVKQPUGSWGPEG
%
D
D
% D
C
C
D
TGEQIPKVKQPUGSWGPEG
C
# C
D
D
$
D
D
D
D
% %
%
D
C
D
C
=#CPF$? #
$ D
D
D
D
C
D
C
D
C
D
D
% %
%
%
D
% =#CPF$?
QWVRWVTGCEVKQP %
% %
D
%
D
Fig. 7. Ideas to realize [IF A AND B THEN C] and [IF ¬A AND B THEN C]. The State of system is defined by rule molecules, A and B.
Cascading Whiplash PCR with a Nicking Enzyme
4.2
45
Further Issues
We have a plan to apply proposed method to implement an expert system for medical diagnosis. A molecular machine is synthesized from rule region and data fragment, each concentration of synthesized molecular machines depends on that of data fragments. We expect the concentration of each data fragments will work as the certainty factor of expert systems. For example, we show two concrete ideas to realize logical operation in Figure7. These operations are performed according to the probability that depend on each data fragment concentrations, and the concentration is used as the certainty factor in such meanings. The state of a system is defined by data fragment A and B. (a) shows an idea to realize [IF A AND B THEN C]. Data fragment C is outputted by molecular machine when data fragment A and B exist in a system. (b) shows an idea to realize [IF ¬A AND B THEN C]. In this idea, C is outputted by molecular machine under existence of B. The output of C is blocked by existence of A. Therefore, when both of A and B exist, outputs of C decreases gradually. That is, The output of C can be ignored comparatively if it observed when time fully passes. Consequently, the higher the certainty of A is, the lowwer the certainty of C becomes. We have proposed a scheme to cascade WPCR, and shown a reliability study to produce output fragments by a nicking enzyme. We must develop programinig techniques as discussed, in the next stage. We expect expert systems as nice application field. Direct molecular I/O, that is one of the most fruitful characteristics of DNA computing, leads to micro diagnosis models in Biotechnology on Medicine.
References 1. M.Hagiya, M.Arita, D.Kiga, K.Sakamoto, and S.Yokoyama, ”Towards parallel evaluation of Boolean µ-formulas with molecules”, in (H.Rubin and D.H.Wood,Eds.) DNA Based ComputersIII, (American Mathematical Society, 1999), 57-72. 2. Kensaku Sakamoto, Daisuke Kiga, Ken Komiya, Hidetaka Gouzu, Shigeyuki Yokoyama, Shuji Ikeda, Hiroshi Sugiyama, and Masami Hagiya. State Transitions by Molecules. BioSystems, Vol.52, No.1–3, 1999, pp.81–91. 3. K.Komiya, K.Sakamoto, H.Gouzu, S.Yokoyama, M.Arita, A.Nishikawa, and M.Hagiya, ”Succesive State Transitions with I/O Interface by Molecules”, in (A.Condon and G.Rozenberg, Eds.) DNA Computing, Springer-Verlag LNCS, Vol.2054, (Springer, Berlin, 2001), 17-26. 4. David Harlan Wood, Hong Bi, Steven O. Kimbrough, Dongjun Wu, and Jumghuei Chen. DNA Starts to Lean Poker. DNA7, 7th International Meeting on DNA Based Computers, Preliminary Proceedings, 2001,pp.23-32. 5. Masayuki YAMAMURA, Yusuke HIROTO, and Taku MATOBA. Another Realization of Aqueous Computing with Peptide Nucleic Acid. DNA7, 7th International Meeting on DNA Based Computers, Preliminary Proceedings, 2001,pp.251-259. 6. J.A.Rose, R.J.Deaton, M.Hagiya, A.Suyama: An Equilibrium Analysis of the Efficiency of an Autonomous Molecular Computer, Physical Review E, Vol.65, No.2-1, 2002, 021910, pp.1-13.
46
Daisuke Matsuda and Masayuki Yamamura
7. Richard D. Morgan, Celine Calvet, Matthew Demeter, Refael Agra Huimin Kong.Characterization of the Specific DNA Nicking Activity of Restriction Endonuclease N.BstNBT. Biol. Chem. Vol. 381, 1123-1125 (2000).
A PNA-mediated Whiplash PCR-based Program for In Vitro Protein Evolution John A. Rose1,4 , Mitsunori Takano2 , and Akira Suyama3,4 1
Department of Computer Science, The University of Tokyo, [email protected] 2 Institute of Physics, The University of Tokyo, [email protected] 3 Institute of Physics, The University of Tokyo, [email protected] 4 Japan Science and Technology Corporation (JST-CREST)
Abstract. The directed evolution of proteins, using an in vitro domainal shuffling strategy was proposed in (J. Kolkman and W. Stemmer, Nat. Biotech. 19, 423 (2001). Due to backhybridization during parallel overlap assembly, however this method appears unlikely to be an efficient means of iteratively generating massive, combinatorial libraries of shuffled genes. Furthermore, recombination at the domainal level (30-300 residues) appears too coarse to effect the evolution of proteins with substantially new folds. In this work, the compact structural unit, or module (10-25 residues long), and the associated pseudo-module are adopted as the fundamental units of protein structure, so that a protein may be modelled as an N to C-terminal walk on a directed graph composed of pseudomodules. An in vitro method, employing PNA-mediated Whiplash PCR (PWPCR), RNA-protein fusion, and restriction-based recombination is then presented for evolving protein sets with high affinity for a given selection motif, subject to the constraint that each represents a walk on a predefined pseudo-module digraph. Simulations predict PWPCR to be a reasonably high efficiency method of producing massive, recombined gene libraries encoding for proteins shorter than about 600 residues.
1
Introduction
The generation of shuffled dsDNA libraries by DNase I digestion and parallel overlap assembly (POA) (i.e., DNA shuffling) has been established as a powerful means of implementing random recombination between homologous genes [1,2]. Simulations, however predict that this approach is unlikely to be capable of evolving substantially novel protein folds, and that the non-homologous swapping of folded structures (i.e., exon or domain shuffling) is key for optimizing the search of protein sequence space [3]. Many proteins can be modelled as a string of non-overlapping, independently folding elements, or domains, each of which is typically 30-300 residues in length [4]. This has recently prompted the suggestion of protein evolution by in vitro domainal shuffling [5]. In this model, a polynucleotide species encoding for each domain is combined in solution with a set of M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 47–60, 2003. c Springer-Verlag Berlin Heidelberg 2003
48
John A. Rose, Mitsunori Takano, and Akira Suyama
chimeric oligonucleotides, each of which encodes a domain-domain boundary. The iterated annealing, polymerase extension, and dissociation of this strand set (i.e., POA) then results in the production of a library of domain-shuffled dsDNAs. Although this operation preserves domainal integrity during initial shuffling, additional crossover following expression and screening requires either (1) DNase I digestion, which destroys the domainal integrity preserved by the initial process, or (2) a secondary, low-efficiency POA process, in which selected strands are used to implement the iterated extension of a set of added primers. In particular, the efficiency of POA is compromised by the high stability of backhybridized duplexes, relative to extendable hybrids [6,7], in a fashion similar to that reported for Whiplash PCR (WPCR) [8,9]. In a secondary domainal shuffling process, this inefficiency will be exacerbated by both the increased length of backhybrids, and the tendency for strands to stably hybridize away from extendable ends, due to the high sequence similarity conferred by the domainal shuffling strategy. Recombination at the domainal level, which focuses on swapping existing folds, also appears to be too coarse to effect the evolution of substantially new folds. Finally, it is not clear how an iterative DNA shuffling strategy can be designed to ensure that a high proportion of dsDNAs encode promoter and ribosomal binding sites, to facilitate the use of an in vitro system for expression and selection (e.g., RNA-protein fusion [10,11]), which is necessary for overcoming the transfection limit of about 109 strands [12]. In previous theoretical work, the efficiency of WPCR [13] was optimized [8,9], and the resulting architecture, PNA-mediated Whiplash PCR (PWPCR), was adapted to produce an in vitro genetic program for evolving approximate solutions to instances of Hamiltonian Path [14]. For related work, in which WPCR is used to evolve strategies for a simple version of Poker, readers are referred to [15]. In this work, PWPCR is combined with RNA-protein fusion [10,11] to implement a high-efficiency exon shuffling operation. For this purpose, the compact structural unit, or module [16], rather than the domain, is adopted as the basic element of protein structure, so that each shuffled protein represents a walk on a predefined graph, in which each vertex represents a pseudo-module contained in an initial protein set of interest.
2 2.1
Protein Representation The Module Picture of Protein Architecture
The domain, or independently folding unit, is well established as a basic element of protein structure [4]. As pointed out by M. Go, et al. [16,17], however the frequent occurrence of introns within domains belies the view that each exon encodes a domain, and suggests the decomposition of domains into a set of smaller, compact structural unit, or modules, each of which forms a compact structural unit within the larger domain, and corresponds roughly to an exon. Any globular protein with a known 3D structure can be decomposed into an Nterminal to C-terminal sequence of modules, by exploiting either the tendency for module junctions to be buried, or the tendency of modules to form a locally
A PNA-mediated Whiplash PCR-based Program
49
compact unit [17]. Module and exon boundaries have been shown to correlate strongly in ancient proteins [18], supporting the view that ancient conserved proteins were evolved via exon (module) shuffling, as expected by the Exon Theory of Genes [19]. The basic module structure is that of a “unit-turn-unit”, with a length which correlates with the radius of the protein [18]. Typical module lengths vary from 10-25 residues (about half the mean exon length), and are clustered tightly around a mean of 18 residues. Interestingly, this mean length is very close to the optimal length (20 residues) of exchanged segments, reported for simulations of genetic recombination [3]. Longer modules reported for globins (up to 45 residues) are decomposable into a set of shorter modules [20]. The element between the approximate midpoints of adjacent modules (the pseudo-module), which has the general structure of coil-unit-coil, has also been investigated as a basic structural element of proteins [21]. 2.2
The Pseudo-module Generating Graph
The module picture of protein structure suggests that a protein, P which can be decomposed into a N to C-terminal sequence of q + 1 pseudo-modules, may be modelled as a q step tour of the digraph, GP (V, E), composed of q + 1 vertices, V = {Vi : i = 1, . . . , q + 1}, so that Vi represents the ith pseudo-module from P ’s N-terminus, and edges, E = {Ei,i+1 : i = 1, . . . q}, where Es,t is the directed edge between source and target vertices, Vs and Vt , respectively. This pseudomodule graph representation facilitates a discussion of the generation problem for sets of proteins derived from P by various forms of pseudomodule sampling. For instance, the protein set generated by the random sampling of q + 1 pseudomodules from P (colloquially referred to as “shuffling”) corresponds to the set of q-step walks on the fully interconnected graph, GP (V, E † ), with edges, E † = {Ei,j : i, j ∈ {1, . . . , q}}. This model conveniently generalizes to discuss the generation of protein sets by restricted forms of pseudomodule “shuffling”. In particular, it may be desirable to generate a set of proteins by pseudo-module shuffling between (or within) specific regions of P , while other regions remain unshuffled. This constrained sampling on P can be expressed as a set of edges, Es ∈ E † on GP (V, E † ), which preserves the desired pseudo-module adjacencies. The protein set generated under the application of this sampling will then be equal to the set of random walks on the pseudo-module graph, GP (V, Es ). The framework also generalizes straightforwardly to discuss the generation of protein sets by the pseudo-module sampling of multiple proteins. Consider the sampling of a set of n proteins, P = {Pk : k = 1, . . . , n}, each of which can be expressed as a sequence of pseudo-modules, and thus, as a tour of a simple pseudo-module graph, Gk (V (k), E(k)). Furthermore, suppose that each protein, Pk ∈ P is to be sampled subject to an internal sampling constraint, Es (k), and each pair of proteins, Pk , Pk ∈ P is to be sampled subject to the pairwise constraint Es (k, k ), which specifies a set of allowed adjacencies (i.e., a set of edges) between Pk and Pk . The set of proteins which may be generated by means of this sampling procedure is then identical to the set of random
50
John A. Rose, Mitsunori Takano, and Akira Suyama
walks on the pseudo-module graph, GP (V , E ), where V = ∪k V (k), and E = (∪k Es (k)) ∪ (∪k,k Es (k, k )).
3
An In Vitro Genetic Program for Protein Evolution
In this section, an in vitro method is presented for evolving sets of proteins with high affinity for a predefined selection motif, subject to the constraint that each represents a directed walk on a specified pseudo-module graph, GP (V , E ). This method begins with initialization by self-assembly of a set of ssDNAs, each of which encodes a PWPCR program for the generation of a 3’ DNA tail which encodes a walk on GP (V , E ). Initialization is followed by the iterated application of a three step cycle: (1) genotype generation by PWPCR [8] followed by parallel strand conversion to dsDNA, (2) fitness evaluation and selection, by the generation of a set of RNA-protein fusions [10,11] (i.e., the phenotypes), followed by selection based on affinity to an immobilized selection motif, and (3) recombination, using a restriction enzyme-based crossover operation. (a)
_ xt
W as X
_ xs
Y
(b)
PNA (Watson-Crick strand)
at Z
tether H 2 NLys [C C T ]4
Pro
P
_ xi
X
Y
_ Ini
ai Z
as
[ G G A]
4
H[J J T]
_ xt
4
W
af
X
_ T
_ xf
Q
Ini
PNA (Hoogsteen strand)
(c) first edge strand _ _ x2 x 1 a2
initiation strand _ _ Ini a1 x1
a1
_ a1
_ a1
Pro P
second edge strand _ _ x3 x 2 a3
a2
_ _ a2 a2 r-region
_ a3
termination strand _ _ T x 3 Q Ini
a3
_ a3
Fig. 1. PWPCR representation and assembly (see text for details). (a) Generalized edge strand (top), initiation strand (middle), and termination strand (bottom) structure. (b) Targeted PNA2 /DNA triplex formation (at codeword, X). (c) Process of assembly from shorter strands, for a strand which computes the path, V1 → V2 → V3 .
3.1
Initialization
Initialization begins with the construction of four sets of ssDNA strands: (1) an edge strand for each edge, Vs → Vt in the pseudomodule graph of interest, GP (V , E ), where Vs and Vt denote the edge’s source and target vertices, respectively; (2) an initialization strand for each vertex, Vi in GP (V , E ) which is
A PNA-mediated Whiplash PCR-based Program
51
to serve as the initial state of a directed walk on GP (V , E ); (3) a termination strand for each vertex, Vf in GP (V , E ), which may serve as the final state of a directed walk on GP (V , E ); (4) a splinting strand for each vertex in GP (V , E ). Edge strand structure is shown in Fig. 1(a; top structure), where codewords xs , xt , as , and at are edge-specific, while codewords W , X, Y and Z are fixed, for all strands. 5’ to 3’ orientation is indicated by a vertical fin at the strand’s 3’ terminus. An overscore denotes Watson-Crick complementation. as and at identify the source and target vertices of the encoded edge, respectively, and serve to direct the process of PWPCR strand assembly (Fig 1(c)). W and Z encode the 5’ and 3’ halves of one strand of the restriction site of the restriction endonuclease, BseDI (C↓CTTGG) [22], respectively. For convenience, W Z is illustrated as a light, x’ed rectangle. Together, these words enable the directed recombination at this edge (c.f., Sec. 3.4). X, xs , Y , and xt collectively implement the transition, Vs → Vt , by directing primer extension during PWPCR (c.f., Sec. 2). xs and xt are the reverse complements of DNA triplet encodings of the transition’s source and target pseudo-module vertices, respectively. X = (GGA)4 is the target site for formation of a PNA2 /DNA triplex (Fig. 1b), a structure which is known to arrest polymerase extension [23], in order to implement the WPCR/PWPCR operation, polymerization stop. Y = X directs the insertion of word X between words xs and xt , in the extending 3’ tail of the assembled PWPCR strand, which will be used for targeted formation of an additional, backhybridization-inhibiting PNA2 /DNA triplex, during PWPCR. Initiation strand structure is shown in Fig. 1(a; middle structure), where code words xi and ai are strand-specific, while codewords P ro, P , X, Y , Z and Ini are fixed, for all strands. Black oval denotes 5’ biotinylation. P ro encodes a T7 promoter sequence [24] to direct transcription (c.f., Sec. 3.3). P directs primer annealing during PCR. ai identifies the initial vertex specified by the encoded strand, and directs the process of PWPCR strand assembly. X, xi , Y , and Ini collectively serve to initiate PWPCR by directing the appendage of codeword, xi to the 3’ terminus of the fully assembled strand (c.f., Sec. 3.2). Termination strand structure is shown in Fig. 1(a; bottom structure), where xf and af are strand specific, while W , X, T , Y , Q, and Ini are fixed for all strands. af identifies the terminal vertex specified by the encoded strand, and serves to 3’-terminate the process of PWPCR strand assembly. X, T , Y , and xf collectively serve to terminate PWPCR by directing the appendage of the codeword, T to the 3’ terminus of the assembled strand at completion of PWPCR (c.f., Sec. 3.2). Q is a codeword for primer annealing during PCR. Ini encodes a Shine-Delgarno sequence [25], spacer sequence, and start codon, and serves (following transcription) to initiate the translation process (c.f., Sec. 3.3). The splint strand for each vertex, Vi has structure, ai W Zai , where W Z is represented in Fig. 3 as a darkened, x’ed rectangle. Words, ai aid in splinting. The parallel assembly of the initial strand set, which encodes a set of walks on GP (V , E ), is most straightforwardly accomplished by an Adleman-like assembly process [26], implemented by mixing all strands under conditions appropriate for annealing and ligation (T4 DNA ligase). This process is illustrated in
52
John A. Rose, Mitsunori Takano, and Akira Suyama
Fig. 1(c), in the context of the path, V1 → V2 → V3 . In practice, a serial assembly process, beginning with a set of initiation strands, should be more efficient. The elimination of incompletely assembled strands may be accomplished by retaining only the strand set with (a) affinity to Ini, as determined by extraction using the dig-antidig system [27], and (b) a biotinylated 5’ end. The expansion of this architecture to enforce the assembly (and evolution) of PWPCR strands which compute deterministically is discussed in Sec. 5. (a)
Ini
1
Pro P
_ x1
_ Ini a 1
a1
_ x2
_ x 1 a2
a2
_ x3
_ x2
a3
a3
_ T
_ x3
_ Ini a 1
a1
_ x2
_ x 1 a2
a2
_ x3
_ x 2 a3
a3
_ T
_ x3
a2
_ x3
_ x 2 a3
a3
_ T
_ x3
2
x1
Pro P
_ x1
x1
Pro P
_ x1
_ Ini a 1
(b)
a1
_ x 1 a2
_ x2 1
q-1 cycles
2
x3
1
Pro P
_ x1
_ Ini a 1
a1
_ x2
a3
x2
x1
_ _ T x3 Q
3
(c) 4
Pro P
r-region
_ Q
5
Q
Ini
x1
x2
6
_ T x3 T _ Q _ T
Pro P
r-region
Q
Ini
x1
x2
x3 T
Fig. 2. PWPCR-based generation of a dsDNA library (see text for details). (a) The basic PWPCR process. (b) Termination of PWPCR by addition of word, T . (c) Conversion to dsDNA by double primer extension.
3.2
PWPCR-based Generation of a dsDNA Library
The generation of a dsDNA library suitable for expression as a set of proteins on GP (V , E ) is initiated by applying a low salt (20 mM [Na]+ ), excess bis-PNA
A PNA-mediated Whiplash PCR-based Program
53
wash to the initialized ssDNA mixture. This procedure has been reported to result in the rapid attainment of a near unity fractional saturation of dsDNA target sites (codeword X) [28], which is virtually irreversible, at temperatures beneath the PNA2 /DNA melting transition. This set of triplexes will be used to implement polymerization stop. The parallel generation of a protein-encoding 3’ tail for each library strand begins by applying a reaction temperature, Trx which promotes hairpin formation at words, Ini and Ini. As shown in Fig. 2 (a; top structure) for a PWPCR strand encoding the walk, V1 → V2 → V3 , addition of DNA polymerase then results in the self-directed, 3’ extension (horizontal arrow) of each strand, to the adjacent PNA2 /DNA triplex (white oval). This process 3’ appends word X and word xi , which encodes initial (N-terminal) pseudomodule in the strand’s encoded walk (x1 , in Fig. 2). Following extension, the ssDNA mixture is exposed to a low salt, excess bis-PNA wash, which results in the formation of a PNA2 /DNA triplex at word X (Fig. 2(a; middle structure)). As described in [8], this triplex destabilizes each newly extended hairpin, promoting denaturation and renaturation in the configuration required for the next extension (Fig. 2 (a; bottom structure)). The application of q − 1 further cycles of polymerase extension (q = 3, in Fig. 2) and bis-PNA treatment then facilitates the high-efficiency, self-directed appendage of q − 1 additional codewords to the 3’ end of each strand. PWPCR is terminated, for each strand, by the 3’ appendage of a termination codeword, T (Fig. 2(b)). The conversion to dsDNA is accomplished in parallel, for all terminated strands as follows. Bis-PNAs are first denatured and washed (Fig. 2(c), process 3). Unwanted secondary structure is then removed by the annealing and polymerase extension of primer, Q (Fig. 2(c)), process 4), which forms a protecting strand. The annealing and polymerase extension of a second primer T (Fig. 2(c)), process 5), then displaces the protecting strand (Fig. 2(c), process 6), and completes dsDNA conversion. 3.3
Fitness Evaluation and Selection via RNA-protein Fusion
The parallel evaluation and selection of the set of walks on GP (V , E ) represented by the dsDNA library is accomplished in a 7-step process which is essentially identical to the strategy for in vitro selection of proteins by RNA-protein fusion described in [12], which has been experimentally optimized for high efficiency. For clarity, this process (Fig. 3) is described in the context of a strand encoding the short walk, V1 → V2 → V3 . In the first step (Fig. 3(a)), the dsDNA library is used to direct the in vitro synthesis of an mRNA pool, using T7 RNA polymerase [24]. For each dsDNA, transcription is initiated at word, Pro, which encodes a T7 Promoter. Each mRNA transcript (see Fig. 3(b)) is synthesized using the biotinylated strand as the sense strand, and encodes both (1) the r-region, and (2) the triplet code for the walk on GP (V , E ) encoded by the transcribed dsDNA. These two regions are separated by word Ini, which contains a Shine-Delgarno sequence to direct ribosomal binding. Each mRNA transcript is then modified (Fig. 3(b), process
54
John A. Rose, Mitsunori Takano, and Akira Suyama
2) by the addition of a protecting cDNA strand, by the reverse transcriptasemediated extension of the primer, Q. This strand eliminates undesired secondary structure, protects the mRNA transcript’s r-region from specific interaction during affinity-based selection, and copies the r-region into the cDNA protecting strand, which may be conveniently recovered following selection. Next, a short ssDNA strand containing a 5’ puromycin (terminal p), an antibiotic which inhibits translation by mimicking the aminoacyl end of tRNA, is attached to each mRNA’s 3’ end, by splint-mediated ligation (Fig. 3(b), process 3). This group will serve to direct the synthesis of an RNA-protein fusion (RPF) [11] during normal translation.
(a)
T7 RNA polymerase
pro
1
P
x1
Q Ini
r-region
x3
x2
_ T
puromycin
(b)
r-region
P
2
p
RNA transcript x2 x1 Q Ini
x3
_ T
_ Q
3 Protein
5 4 r-region
P
(c)
p
N
Q Ini
x2
x1
_ Q
x3
bound selection motif r-region
P
Q Ini
6
N p
(d)
x3
x1
Pro P
,
7
_ _ Q Ini
(e) Pro P
_ x1
_ _ Ini a 1 a 1 x 2
_ _ x1 a 2 a 2 x3
_ x2 a 3 a 3
_ _ T x 3 Q Ini
Fig. 3. Fitness Evaluation and Selection (see text for details): (a) Transcription; (b) 3’ puromycin attachment; (c) Translation and RNA-protein fusion; (d) Selection by affinity to a bound motif; (e) cDNA recovery and reinitialization by PCR.
A PNA-mediated Whiplash PCR-based Program
55
Translation of the mRNA transcript set (Fig. 3(c), process 4) is accomplished using the E. coli in vitro translation system [29]. For each transcript, translation is initiated by ribosomal binding at the Shine-Delgarno sequence [25] encoded by word, Ini. In the normal course of translation, which proceeds 5’→3’ along the mRNA, the ribosome then pauses at the RNA-DNA interface, allowing the 3’ puromycin to enter the ribosome’s A-site, and be covalently attached by the ribosome to the C-terminus of the nascent peptide, forming a stable RNA-protein fusion (Fig. 3(c), process 5). This process essentially results in the physical linkage of the encoded protein’s genotype and phenotype. The generated protein consists of a pseudomodule sequence, punctuated by short Glycine coils (Gly4 ) generated by the embedded target sequences, X. This minor increase in the length of the coils at pseudomodule boundaries is expected have a negligible impact on the structure and function of the folded protein. In Fig. 3(c), the protein encoded by the path, V1 → V2 → V3 has been expressed as a 3-helix bundle. The RPF library is then subjected to a fitness evaluation and selection procedure (also referred to as screening), based on affinity to an immobilized selection motif (Fig. 3(d), process 6). This motif may be a nucleic acid, protein, or small molecule of interest, such as a transition state analog (TSA) [30]. RPF formation is reported to have negligible impact on the binding specificity of the nascent protein [12]. Following denaturation and purification of the attached set of cDNAs from the selected RPF set (Fig. 3, process 7), the purified cDNA set is PCR amplified using the biotinylated primer, P roP and the 5’ digoxygen-conjugated primer, IniQ. After amplification, which also implements point mutation (albeit somewhat problematically (c.f., Sec. 5)), the resulting dsDNA set is denatured, and the set of digoxygenin-conjugated strands are removed by (anti-dig) antibodies [27]. The remaining strands comprise a biotinylated ssDNA library, each element of which encodes a selected PWPCR program for generating a walk on the pseudo-module graph, GP (V , E ). 3.4
Restriction Enzyme-Based Recombination
Prior to recombination, a subset of psuedomodule-encoding vertices in GP (V ,E ), at which crossover is to be implemented should be identified. Single-point, two parent crossover will then be performed on a subset of the biotinylated ssDNA library, extracted separately for each selected vertex, Vi , by application of the following elementary crossover procedure. For clarity, this basic crossover operation is illustrated in Fig. 3.4 for crossover implemented at V2 , between a strand encoding the path, V1 → V2 → V3 and a second strand (not shown), which encodes a path of the form, Vj → V2 → V4 , where Vj is some vertex in GP (V , E ) other than V2 . The first step (Fig 3.4(a), process 1) is the extraction of a subset of the strands in the ssDNA library which encode a walk that visits Vi , by affinity extraction using the digoxygen-conjugated primer, ai (a2 , in Fig. 3.4(a)). Crossover point formation within the codeword string, ai ZW ai of each strand is then initiated by annealing of primer ai W Zai (Fig 3.4(b), process 2). Restriction with BseDI at the restriction site formed by this process at W Z (crossed box) (Fig 3.4(b), process 3), then results in the division of the strand into a pair
56
John A. Rose, Mitsunori Takano, and Akira Suyama
of restriction fragments, where the biotinylated fragment encodes a partial path ending at vertex i, while the unbiotinylated fragment encodes a partial path beginning at vertex i. The low temperature annealing and ligation of random pairs of biotinylated and unbiotinylated restriction fragments then implements the desired 2-parent, single-point crossover operation at V2 . The viability of each progeny strand (in the sense of encoding a valid, connected walk on GP (V , E )) is ensured by the representation of each parent strand (c.f., Sec. 3.1). In particular, the assembly process ensures that the order of the edges traversed by an encoded walk is identical to the physical ordering (i.e., 5’ to 3’) of the PWPCR rules encoded for executing the traversal, on the strand. Recombination at Vi is completed by strand anchorage, followed by a wash. This elementary crossover operation is to be repeated on a separate extract removed from the ssDNA library for each selected vertex in GP (V , E ), or until the library is exhausted. The merging of all recombined extracts into a single tube completes recombination, and the first round of the genetic program. This process of in vitro evolution should be repeated until a protein having the desired affinity is produced, as determined during fitness-evaluation and selection (screening), or until some predetermined maximum number of rounds have been executed.
_ a2
(a)
Pro P
_ x1
_ _ Ini a1 a 1 x 2
1 _ _ _ x 2 a3 a 3 T x 3 Q Ini
_ _ x 1 a2 a 2 x 3
3 _ _ a2 a2
(b) Pro P
_ x1
_ _ Ini a1 a 1 x 2
2
Pro P
_ x1
_ _ Ini a1 a 1 x 2
_ x1 a 2
_ x3
a2
_ a2
(c)
_ _ _ x 2 a3 a 3 T x 3 Q Ini
_ _ x 1 a2 a 2 x 3
_ _ _ x 2 a3 a 3 T x 3 Q Ini
_ a2
+
a2
_ x4
_ _ _ x 2 a4 a 4 T x 4 Q Ini
4
(d) Pro P
_ x1
_ _ Ini a1 a 1 x 2
_ _ x 1 a2 a 2 x 4
_ _ _ x 2 a4 a 4 T x 4 Q Ini
Fig. 4. Restriction site-based recombination of a PWPCR strand encoding the path, V1 → V2 → V3 , with a strand encoding a path of the form, Vj → V2 → V4 , where Vj is some vertex in GP (V , E ), other than V2 . This operation implemented at common vertex, V2 (see text for details).
A PNA-mediated Whiplash PCR-based Program
4
57
The Efficiency of PWPCR-based Library Generation
The model of PWPCR efficiency reported in [8], which treats the extension of each strand as a Markov chain, and estimates the probability of extension during each effective encounter with DNA polymerase, was applied to estimate the efficiency of library generation by PWPCR. The adopted word lengths were: |xi | = 42 nucleotides (nt) and |X| = |Y | = 12 nt, for a uniform module length of 18 residues; |P | = |Q| = 18 nt; |W | = |Z| = 3 nt; |ai | = 10 nt; |T | = 15 nt; |Ini| = 42 nt. The following modifications were also adopted (values applied in [8] listed in parenthesis): mean pseudomodule codeword GC content = 0.50 (8/15); [Mg2+ ] = 1.0 mM (1.5 mM); PNA target sequence, X = [G2 A]4 (A2 GA2 GA4 ). In [8], the loop lengths of extendable hairpins were modelled by a mean value, which increased as a function of extension number, r. Here, the 5’ to 3’ ordering of each strand’s set of rules for graph traversal allowed the explicit calculation of a loop length for each extendable hairpin, which decreases with r. This is significant, because the extension efficiency is a low order polynomial of the ratio of extendable and backhybridized (i.e., previous) loop sizes.
(b)
1.0 0.8
-2 -3 -4 -55 20
χ(q)
log10 ε r
(a)
5 81 15
r
5
346 73
0.4 0.2 0
77
10
0.6
o
Trx ( C)
0
10
20
30
40
50
q (edges)
Fig. 5. The efficiency of PWPCR-based library generation. (a) Efficiency per strand-polymerase encounter, r vs. extension process, r and Trx ; (b) completion efficiency per strand vs. path length, q (in edges). Fig. 5(a) illustrates the efficiency/effective polymerase encounter, r predicted for a q = 20 implementation (22 extensions), as a function of r and Trx . The first extension (data not shown) is predicted to proceed with nearunity efficiency, at Trx < 71◦ C. The predicted optimal Trx is nearly constant for r > 1, and ranges from 76.5 ◦ C (r = 2) to 77.8 ◦ C (r = 22). For each extension, 1 < r ≤ 22, application of the corresponding optimal Trx is predicted to result in an efficiency of r>1 ≈ 10−2 . Use of a constant Trx = 77.6◦ is predicted to maintain an efficiency of at least 82% of the optimal value, for all r > 1. The simple model of PWPCR kinetics employed in [8] was then applied to estimate the characteristic relaxation time to equilibrium, τ for PWPCR hairpins. In particular, the predictions of an equilibrium model can be expected to be valid only under conditions for which τ < 10tenc , where tenc is the mean time
58
John A. Rose, Mitsunori Takano, and Akira Suyama
between successive polymerase encounters for a given strand. At Trx = 77.6◦ C, predicted relaxation times range from 0.67 sec (r = 2) to 0.052 sec (r = 22), where the variation is due to the decreasing loop size of hairpins implementing successive extensions. The conditions suggested in [8] (27 units of Taq DNA polymerase, with a polymerization cycle time of 60 min), which implement a total of approximately 543 effective encounters/(strand cycle), yielding a mean time between encounters of tenc ≈ 6.7 sec are therefore predicted to just meet the requirement for equilibrium. Fig. 5(b) illustrates the estimated completion efficiency per strand, χ(q), as a function of encoded path length, q (in edges) for a combinatorial PWPCR mixture of 1.2 × 1013 strands, under buffer conditions of [8]. Simulations were performed assuming a uniform reaction temperature of Trx = 77.6 ◦ C, which was predicted to be within 1 C◦ of the optimal Trx for each value of q. As shown, a q = 20 implementation is predicted to complete 84% of the encoded strands, a substantial improvement over [8].
5
Practical Considerations
Based on the simulations presented in Sec. 4, the current method is predicted to be an efficient method of producing massively parallel, recombined gene libraries encoding for proteins shorter than about 600 residues, which compares favorably with the mean protein length of about 350 residues [4]. Although each round of crossover is certain to generate PWPCR strands encoding longer paths, the subset of recombined strands which encodes paths above a fixed threshold length can be removed by size fractionization [27]. The near constancy of the optimal Trx predicted with variations in q also indicates that a single Trx can be selected to implement PWPCR with near uniform efficiency, variations in strand length notwithstanding. A related, and critical issue is the requirement that the thermal stabilities of the set of pseudomodule-encoding DNA words be both approximately uniform, and substantially lower than that of the adjacent PNA2 /DNA triplexes. The uniformity of pseudomodule codeword thermal stability can be ensured by exploiting the degeneracy in the 3rd nucleotide, inherent in the triplet code. Furthermore, as the melting transition of the 10 nt, 20% G/C PNA2 /DNA triplex is reported to occur in a very narrow (5 C◦ in width) temperature range, centered at 85 ◦ C, use of the suggested Trx = 77.6◦ should result in negligible melting of the more stable (12 nt, 67% G/C) triplexes adopted for the current implementation. As pointed out in Sec. 3.3, the current architecture effectively implements point mutation during both PCR (and PWPCR). Each point mutation, however also results in the divergence of the mutated codeword and its matching, unmutated partner, on the strand. Due to the high stability reported for single mismatches in a longer Watson-Crick sequence [31], however this effect should have a negligible impact on PWPCR efficiency, given modest mutation rates. The present implementation does not enforce the assembly or evolution of strands which encode for deterministic PWPCR (i.e., so that each strand encodes
A PNA-mediated Whiplash PCR-based Program
59
for exactly 1 completed protein). If desired, determinism may be enforced in SAT Engine-like style [32], by the codeword assignment, aj = aj , ∀ j ∈ {1, · · · , q + 1}, combined with the use of a fully palindromic restriction site for BseDI (e.g., C↓CTAGG). Hairpins which encode multiple visits to any vertex, Vj will then form very stable hairpin loops at self-complementary words, aj ZW aj , which are then subject to digestion, prior to each round of PWPCR. The downside of this procedure is an increased potential for kinetic trapping in PWPCR hairpins, and a decreased recombination efficiency due to the potential for ligation of biotinylated pairs and unbiotinylated pairs of restriction fragments. Another important issue is the impact on protein structure and function on the Gly4 coils inserted between pseudomodule pairs. Given that these pairs are assumed by the basic model to be punctuated by turns, and glycine is a helixbreaker, the impact of Gly4 insertion on 2◦ structure is likely to be negligible, in general. the impact on 3◦ structure and function appears to be a more complex issue. For coils which primarily provide sufficient flexibility for a turn, a modest length increase is unlikely to impact overall protein functionality. For coils which contain an active residue, on the other hand, both length and placement within the coil may assume greater importance, and will have to be carefully considered during the decomposition of the parent protein into a pseudomodule graph.
Acknowledgements The authors would like to thank M. Hagiya and K. Sakamoto of the University of Tokyo for helpful discussions, and S. Suzuki for help during preparation of the figures. Support provided by a JSPS Postdoctoral Fellowship and Grant-in-Aid, and by JST CREST.
References 1. Stemmer, W., “Rapid evolution of a protein in vitro by DNA shuffling”, Nature 370, 389 (1994). 2. Crameri, A., et al., “DNA shuffling of a family of genes from diverse species accelerates directed evolution”, Nature 391, 288 (1998). 3. Bogarad, L. and Deem, M., “A hierarchical approach to protein molecular evolution”, Proc. Natl. Acad. Sci. USA 96, 2591 (1999). 4. Doolittle, R., “The multiplicity of domains in proteins”, Annu. Rev. Biochem 64, 287 (1995). 5. Kolkman, J. and Stemmer, W., “Directed evolution of proteins by exon shuffling”, Nat. Biotech. 19, 423 (2001). 6. Kaplan, P., “Parallel overlap assembly for the construction of computational DNA libraries”, et al., J. Theor. Biol. 188, 333 (1997). 7. Rose, J., The Fidelity of DNA Computation, Ph.D. Thesis, The University of Memphis, 155 (1999). 8. Rose, J., Deaton, R., Hagiya, M., and Suyama, A., “Equilibrium analysis of the efficiency of an autonomous molecular computer”, Phys. Rev. E 65, Article 021910 (2002); also in V. J. Biol. Phys. 3(2), (2002).
60
John A. Rose, Mitsunori Takano, and Akira Suyama
9. Rose, J., Deaton, R., Hagiya, M., and Suyama, A., “PNA-mediated whiplash PCR”, in Jonaska, N. and Seeman, N. (Eds.), DNA Computing (Springer-Verlag LNCS, Vol. 2340, Springer-Verlag, Berlin, 2002), 104. 10. Nemoto, N., Miyamoto-Sato, E., Husimi, Y., Yanagawa, H., “In vitro virus: Bonding of mRNA bearing puromycin at the 3’-terminal end to the C-terminal end of its encoded protein on the ribosome in vitro”, FEBS Lett. 414, 405 (1997). 11. Roberts, R. and Szostak, J., “RNA-Peptide fusions for the in vitro selection of peptides and proteins”, Proc. Natl. Acad. Sci. U.S.A. 94, 12297 (1997). 12. Liu, R., Barrick, J., Szostak, J., Roberts, R., “Optimized synthesis of RNA-Protein fusions for in vitro protein selection”, Methods Enzymol. 318, 268 (2000). 13. Sakamoto, K. et al., “State transitions by molecules”, BioSystems 52, 81 (1999). 14. Rose, J., Hagiya, M., Deaton, R., and Suyama, A., “A DNA-based in vitro genetic program”, J. Biol. Phys. 28, 493 (2002). 15. Wood, D. H., et al., “DNA starts to learn poker”, in Jonaska, N. and Seeman, N. (Eds.), DNA Computing (Springer-Verlag LNCS, Vol. 2340, Springer-Verlag, Berlin, 2002), 92. 16. Go, M., “Correlation of DNA exonic regions with protein structural units in Haemoglobin”, Nature 291, 90 (1981). 17. Go, M. and Nosaka, M., “Protein architecture and the origin of introns”, Cold Spring Harbor Symp. Quant. Biol. 52, 915 (1987). 18. de Souza, S., “Intron positions correlate with module boundaries in ancient proteins”, et al., Proc. Natl. Acad. Sci. USA , 14632 (1996). 19. Gilbert, W., “Why genes in pieces?”, Nature 271, 501 (1978). 20. Inaba, K. et al., “Structural and functional roles of modules in Hemoglobin”, J. Biol. Chem. 272, 30054-30060 (1997). 21. Inaba, K., Ishimori, K. Imai, K., Morishima, I., “Structural and functional effects of pseudo-module substitution in Hemoglobin subunits”, J. Biol. Chem. 273, 8080 (1998). 22. REBase restriction enzyme database, http://rebase.neb.com. 23. Giovannangeli, C., and Helene, C., “Triplex-forming molecules for modulation of DNA information processing”, Curr. Opin. Molec. Ther. 2, 288 (2000). 24. Milligan, J. and Uhlenbeck, O., “Synthesis of small RNAs using T7 RNA polymerase”, Methods Enzymol. 180, (Acad. Press, 1989), 51. 25. Kozak, M., “Comparison of the initiation of protein synthesis in prokaryotes, eukaryotes, and organelles”, Microbiol. Rev. 47, 1 (1983). 26. Adleman, L., “Molecular computation of solutions to combinatorial problems”, Science 266, 1021 (1994). 27. Sambrook, J., Fritsch, and E., Maniatis, T, Molecular Cloning: A Laboratory Manual, 3rd Ed., (Cold Spring Harbor Press, Cold Spring Harbor, New York, 2001). 28. Kuhn, H., et al., “Kinetic sequence discrimination of cationic bis-PNAs upon targeting of double-stranded DNA”, Nuc. Acids Res. 26, 582 (1998). 29. Ellman, J., et al., “Biosynthetic method for introducing unnatural amino acids site-specifically into proteins”, Methods Enzymol. 202, 301 (1991). 30. Griffiths, A. and Tawfik, D., “Man-made enzymes-from design to in vitro compartmentalisation”, Curr. Opin. Biotech. 11, 338 (2000). 31. Peyret, P., et al., “Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A·A, C·C and T·T mismatches”, Biochemistry 38, 3468 (1999). 32. Sakamoto, K., et al., “Molecular computation by DNA hairpin formation”, Science 288, 1223 (2000).
Engineering Signal Processing in Cells: Towards Molecular Concentration Band Detection Subhayu Basu, David Karig, and Ron Weiss Princeton University, Princeton, New Jersey {basu, dkarig, rweiss}@princeton.edu
Abstract. We seek to couple protein-ligand interactions with synthetic gene networks in order to equip cells with the ability to process internal and environmental information in novel ways. In this paper, we propose and analyze a new genetic signal processing circuit that can be configured to detect various chemical concentration ranges of ligand molecules. These molecules freely diffuse from the environment into the cell. The circuit detects acyl-homoserine lactone ligand molecules, determines if the molecular concentration falls within two prespecified thresholds, and reports the outcome with a fluorescent protein. In the analysis of the circuit and the description of preliminary experimental results, we demonstrate how to adjust the concentration band thresholds by altering the kinetic properties of specific genetic elements, such as ribosome binding site efficiencies or dna-binding protein affinities to their operators.
1
Introduction
Cells are complex information processing units that respond in highly sensitive ways to environmental and internal signals. Examples include the movement of bacteria toward higher concentrations of nutrients through the process of chemotaxis, detection of photons by retinal cells and subsequent conversion to bioelectrical nerve signals, release of fuel molecules due to hormones that signal hunger, coordinated secretion of virulence factors and degradative enzymes by bacterial cells using quorum sensing molecules, and cell differentiation based on signal gradients. We strive to engineer cells that process internal and environmental information in novel ways by integrating protein-ligand interactions with synthetic gene networks. Applications include patterned biomaterial fabrication, embedded intelligence in materials, multi-cellular coordinated environmental sensing and effecting, and programmed therapeutics. These applications require synthesis of sophisticated and reliable cell behaviors that instruct cells to process information and make complex decisions based on factors such as extra-cellular conditions and current cell state. In this paper, we propose a new genetic signal processing circuit for detecting tunable ranges of chemical concentrations of ligand molecules that freely diffuse into the cell. The signal processing is performed in engineered Escherichia coli hosts to detect acyl-homoserine lactone (acyl-HSL) quorum sensing molecules. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 61–72, 2003. c Springer-Verlag Berlin Heidelberg 2003
62
Subhayu Basu, David Karig, and Ron Weiss
OO O O N H
OO O O N H OO O O N H
LuxR
OO
OO O O N H
OLuxR NO H
P(R)
P(X)
X
Z1
P(Z)
Y
P(Y)
W
P(W)
GFP
Z2
Fig. 1. Gene Network for a chemical concentration band detector
The underlying mechanisms rely on ligand molecules binding to cytoplasmic proteins and the binding of cytoplasmic proteins to DNA segments that regulate the expression of other proteins. The genetic circuit consists of components that detect the level of acyl-HSL, determine if the level is within the range of two prespecified thresholds, and report the outcome with a fluorescent protein (Figure 1). The acyl-HSL signal is produced either by naturally occurring organisms [5] or by bacterial hosts engineered to secrete the molecules in order to perform cell-cell communications[12]. The acyl-HSL freely diffuses from the environment into cells and binds to the luxR cytoplasmic protein, enabling the protein to form a dimer. In turn, the dimer complex binds to the DNA and activates the lux P(R) promoter. The activated transcription results in the expression of regulatory proteins (X and Y) that control two series of downstream promoters. Here, DNA-binding repressor proteins X, Y, W, and Z, and their promoter counterparts are carefully chosen or engineered to have desired kinetic properties. The sub-circuit originating from the X protein determines the lower threshold of the acceptable acyl-HSL range, while the sub-circuit originating from the Y protein determines the high threshold. If the acyl-HSL chemical concentration falls within the range, the reporter green fluorescent protein (GFP) is expressed at high levels and can be detected externally. In the analysis of the circuit and the description of preliminary experimental results, we demonstrate how to adjust these thresholds by altering the kinetic properties of specific genetic elements, such as the ribosome binding site (RBS) efficiencies or the dna-binding protein affinities to their operators. The output of the band detection circuit can be coupled to other genetic circuits and regulatory responses. This circuit is therefore useful for monitoring many protein-ligand interactions that affect gene regulation and can serve as a modular component for a variety of signal processing tasks. For example, it can be used in cell-cell communications systems, in the detection of chemical gradients, and in synthetic cell aggregate systems that gather, process, and respond to environmental signals spanning large-scale areas. In the remainder of the paper, we describe relevant work and background (Section 2), introduce the design of the band detection circuit (Section 3), analyze the ability to modify band thresholds (Section 4), report on preliminary
Engineering Signal Processing in Cells
0
P1
R2 P2
[Off]
R3 CFP
63
YFP P3
I2 measure TC
P1
R2
P3
I2
P2
R3
YFP
CFP
Fig. 2. Genetic circuit to measure the device physics of an R3 /P3 cellular gate: digital logic circuit and the genetic regulatory network (Px : promoters, Rx : repressors, CFP/YFP: reporters) experimental results that demonstrate the functioning of certain sub-components of this circuit (Section 5), and offer conclusions and a discussion of the issues in implementing the full circuit (Section 6).
2 2.1
Background Quorum Sensing
Quorum sensing enables coordinated behavior among bacteria[2]. Specifically, acyl-homoserine lactones (acyl-HSLs) diffuse freely through cell walls and serve as intercellular communication signals. Significant accumulation of acyl-HSL results in interaction of this signal chemical with specific DNA-binding R-proteins. This bound complex then activates transcription of a certain gene or sets of genes. Synthesis of acyl-HSLs is mediated by specific I-genes. As an example, the quorum sensing system in Vibrio fischeri, which grows in a symbiotic relationship with sea organisms such as the Hawaiian sepiolid squid, regulates density dependent bioluminescence. In this system, the luxI gene codes for production of LuxI, which is responsible for synthesis of 3-oxohexanoyl-homoserine lactone (3OC6HSL). The luxR gene codes for the LuxR protein, which binds to accumulated 3OC6HSL to activate gene transcription. Previously, we successfully transferred this quorum sensing mechanism to E. coli hosts for use in engineered cell-cell communications[12]. 2.2
Synthetic Gene Networks
Other recent projects have also experimentally demonstrated forward-engineered genetic regulatory networks that perform specific tasks in cells. Becskei’s autorepressive construct[3] is a single gene that negatively regulates itself to achieve
64
Subhayu Basu, David Karig, and Ron Weiss
a more stable output. Gardner’s toggle switch[6] is a genetic system in which two proteins negatively regulate the synthesis of one another. This system is bistable, and sufficiently large perturbations can switch the state of the system. Elowitz’s represillator[4] is a genetic system in which three proteins in a ring negatively repress each other. This system oscillates between low and high values. For the above systems, the analysis and experimental results reveal that the genetic components must be matched to achieve correct system operation, as also discussed in [11]. 2.3
Device Physics of Genetic Circuit Components
We have previously defined genetic process engineering as a method for genetically altering system components until their device physics are properly matched. These components can then be combined into more complex circuits that achieve the desired behavior. For example, ribosome binding sites (RBS) can be mutated to alter rates of translation of mRNA into protein. We constructed genetic circuits to measure the device physics of cellular gates[10], as shown in Figure 2. In one instance of this network, the R2/P2 component consists of a lacI/p(lac) gate and the R3/P3 component consists of a cI/λP (R) gate. In this case, the level of the IPTG inducer molecule (I2) controls the level of the cI input repressor. With matching gates, the logic interconnect of this circuit should result in YFP fluorescence intensities that are inversely correlated with the IPTG input levels. The lowest curve in Figure 2 (RBS R1) shows the transfer curve of a cI/λP (R) inverter prior to genetic process engineering. The transfer curve relates IPTG input concentrations to the median fluorescence of cells grown in a culture at these IPTG concentrations. The cells were grown for several hours in log phase until protein expression reached a steady state. The unmodified cI/λP (R) circuit responds weakly to variations in the IPTG inducer levels. Genetic process engineering was used to modify the cI/λP (R) inverter to obtain improved behavioral characteristics. Ribosome Binding Site (RBS) sequences significantly control the rate of translation from messenger RNA(mRNA) molecules to the proteins for which they code. We replaced the original highly efficient RBS of cI with a weaker RBS site by site-directed mutagenesis and were able to noticeably improve the response of the circuit (Figure 2, RBS R2). In further genetic process engineering (Figure 2, RBS R2/OpMut4), a one base pair mutation to the cI operator site of λP (R) yields a circuit with an improved inverse sigmoidal response to the IPTG signal. The signal processing circuits described in Section 3 are assembled by combining multiple genetic components with matching characteristics into compound circuits. To effectively synthesize these compound circuits, the device physics of circuit components can be engineered using site-directed mutagenesis and molecular evolution techniques. Note that the genetic constructs for digital logic inverters described in this section are also used for the low threshold component of the band detection circuit.
Engineering Signal Processing in Cells 0.4
0.4 mRNA XY mRNA Z1
0.3 mRNA conc
mRNA conc
0.3 0.2 0.1 0 −1 10
65
0.2 0.1
0
10
1
[HSL]
(a) low threshold
10
0 −1 10
mRNAXY mRNA W mRNAZ2
0
10
1
[HSL]
10
(b) high threshold
Fig. 3. Low threshold and high threshold sub-circuits
3
Design of a Chemical Concentration Band Detection Circuit
The proposed construction for a chemical concentration band detector circuit, shown in Figure 1, expresses high levels of the reporter gene (GFP) only when the concentration of the acyl-HSL signal is within a specific range. The genetic circuit consists of three subcircuits: a low threshold detector, a high threshold detector, and a negating combiner. The series of transcriptional regulators that originates from the repressor protein X determines the lower threshold. When acyl-HSL binds to the R-protein, the molecular complex activates transcription of protein X. In turn, high concentrations of protein X repress the production of protein Z from the P(X) promoter (labeled as Z1 because protein Z can also be expressed from the P(W) promoter). Figure 3(a) shows a simulation of the steady state response of P(X) transcription of protein Z with respect to varying HSL concentrations. The simulations illustrate the average messenger RNA (mRNA) levels for X and Z1 in a cell that contains the band detect circuit, with all units denoting µmolar concentrations. As shown, mRNAZ1 is high only if the level of HSL is low. Notice that P(R) is multi-cistronic, coding for both X and Y that are expressed from mRNAXY . However, the translation rates of X and Y are different due to the fact that the carefully chosen RBS efficiencies for the respective proteins differ. In [10] and Section 2, we demonstrate the ability to quantitatively modify the transfer curve characteristics of genetic components by using a variety of RBSs. The second subcircuit, which determines the high threshold, consists of transcriptional regulators that originate from protein Y. Figure 3(b) shows the steady state relationship between HSL input levels and mRNA levels for XY, W, and Z2 . The RBS for Y is weaker than the RBS for X. Therefore Y is produced at lower rates than X for any given level of HSL (Figure 4(a)). As a result of the lower rate of protein Y expression, protein W is still highly expressed in response to high HSL levels when P(X) is already inactivated. This allows P(W) to act
66
Subhayu Basu, David Karig, and Ron Weiss 15
15
protein conc
protein conc
X Y 10
5
0 −1 10
0
[HSL]
10
5
0 −1 10
1
10
X W Z
10
(a) X versus Y expression
0
1
10
[HSL]
10
(b) X and W determine total Z
Fig. 4. Determinants of the band detector
20
20
Z GFP
15 protein conc
protein conc
15
10
10 Z GFP
5
5
0 −1 10
0
10
1
10
(a) original band detector
0 −1 10
0
10
1
[HSL]
10
(b) modified band detector
Fig. 5. Different band detectors from negating Z
as the high threshold detector by transcribing Z only when the HSL input is above a certain level. The last subcircuit combines the output of the two preceding subcircuits and negates their total product to produce the band detector. Figure 4(b) illustrates how the level of Z in the cell is determined by both X and W. Logically, Z is high when either X or W is low (Z = X NAND W). The output protein Z functions as a band-reject circuit. Finally, Figure 5(a) shows how the combined level of Z from transcription of both P(X) and P(W) represses the final GFP output, resulting in a band detector. By tuning reaction kinetics such as RBS efficiencies, protein decay rates, and protein-operator affinities, this synthetic gene network can be configured to respond to different signal ranges. For example, Figure 5(b) shows how to modify the band detector to accept a lower threshold by choosing ribosome binding sites for proteins X and Y that are three-fold more efficient. The simulation results described here were obtained by integrating ordinary differential equations that describe the biochemical reactions of the band de-
2
4
1.8
3.5
1.6
3
1.4
2.5
hsl−width
hsl−mid
Engineering Signal Processing in Cells
1.2 1
67
2 1.5
0.8
1
0.6 0.9
0.5 0.9 0.8
0.1 0.7
0.15 0.2
0.6 0.5 kxlate−X
0.8
0.1 0.7
0.15
0.35
kxlate−Y
(a) HSL midpoint of band detector
0.25
0.5
0.3 0.4
0.2
0.6
0.25 kxlate−X
0.3 0.4
0.35
kxlate−Y
(b) HSL width of band detector
Fig. 6. Impact of modifying X and Y RBS efficiencies on band midpoint and width
tector circuit. Kinetic rates published in the literature for various components, as well as educated guesses[11], were used in the simulations. The simulations demonstrate the qualitative effects of modifying genetic components in the forward engineering of synthetic gene networks. In the graphs, the nonlinear response of the components is due to repressor protein dimerization and the existence of multiple operator sites. We have used both of these common genetic regulatory motifs in previous experiments[10]. The next section analyzes the ability to change band thresholds by modifying various kinetic parameters of the circuit components.
4
Forward Engineering of Band Characteristics
This section examines forward engineering of band characteristics through the modification of two genetic regulatory elements: RBS’s and repressor/operator affinities. Two key characterizations of a band detect circuit are the midpoint of its detection range and the width of this range. The width is defined as the distance between the low and high cutoff points. The cutoff point is the HSL level at which the output of the band detector switches between what a downstream circuit considers high or detectable and what it considers low or undetectable. For the analysis below, we define the cutoffs as the HSL input levels at which the output GFP concentration is 0.3 µM. We assume that GFP concentrations above this threshold can be reliably detected by flow cytometry. Figure 6 depicts the effects of simultaneously changing the RBS efficiencies of X (kxlatex ) and Y (kxlatey ) on the midpoint and width of the band detector. The x and y axes represent the various RBS efficiencies, while the z axis depicts the HSL midpoint and HSL band width. As the RBS efficiency of protein X increases, a given HSL level results in additional translation of protein X. Thus, less mRNAZ1 is transcribed, and the low threshold component of the Z protein
68
Subhayu Basu, David Karig, and Ron Weiss
1.5
3 2.5
hsl−width
hsl−mid
2 1
1.5 1 0.5
0.5 4
0 4 0.5
3
0.5
3
1 1.5
2
2 1
k−bind−X2−P−X
1
3.5 4
k−bind−Y2−P−Y
(a) HSL mid point for band detector
2 2.5
1
3 0
1.5
2
2.5 k−bind−X2−P−X
3 0
3.5 4
k−bind−Y2−P−Y
(b) HSL width for band detector
Fig. 7. Impact of modifying X and Y protein/operator binding affinities
curve shifts left. This shift increases the band width and moves the midpoint left. As the RBS efficiency for Y increases, the high threshold component of the Z protein curve also shifts left, causing the band width to decrease and the midpoint to move left. The impact of simultaneously altering the binding affinities of the X2 and Y2 dimer proteins to their respective promoters is illustrated in Figure 7. If an X2 dimer binds more readily to the promoter responsible for transcription of mRNAZ1 , the low threshold component of the Z protein curve shifts left. This shift in turn causes the width of the band to increase and moves the midpoint to the left. As the strength of Y repression increases, the high threshold component of the Z protein curve shifts left, decreasing the band width and moving the midpoint left. A recurring trend in the analysis here is that changes in the RBS efficiency and the repression strength of Y have a greater impact on the shape of the band than comparable changes in the RBS efficiency and repressive strength of X. One reason that the constants associated with Y have greater impact is that Y influences the Z protein curve through an additional gain stage resulting from the W repressor. Simulations reveal that if the binding constant of W to its promoter and the RBS efficiency of W are reduced, the impact of Y on the HSL midpoint and band width are also reduced (graphs not shown). Another reason for the discrepancy is that Y controls Z values that correspond to a higher range of the HSL input signal than the Z values controlled by X.
5
Preliminary Experimental Results
In this section, we describe preliminary experimental results that demonstrate the functioning of certain subcircuits for performing band detection. Section 2 describes previous results of a low threshold subcircuit tunable by modifications of ribosome binding site efficiencies and repressor/operator affinities. In the next
Engineering Signal Processing in Cells
P3 P
69
Lambda P(R-O12-mut4) RBSII
CFP
EYFP
APr
pCMB-2 4397 bp
lacI
RBS II
pCMB-100
Kan(r)
P(LtetO-1)
4940 bp
P(LAC) tetR
RBS H T0 Term cI ColE1 ORI
p15A ori T1 Term
Fig. 8. pCMB-2 and pCMB-100
two sections, we present new experimental data for detecting a high threshold and for responding to acyl-HSL signals. 5.1
Implementation of a High Threshold Detection Component
To detect molecular concentrations above specific thresholds, we constructed the two plasmids in Figure 8 using standard DNA cloning techniques[1, 8]. The pCMB-2 plasmid contains an ampicillin resistance gene and a medium copy number origin of replication (ColE1 ORI). It also has a tet repressor (TetR) gene coding sequence that is transcribed from the ampicillin p(bla) promoter. The tetR regulates the P(LtetO-1) promoter on the pCMB-100 plasmid (kanamycin resistance, p15A medium copy number origin of replication). The lac repressor (lacI) and cyan fluorescent protein (CFP) from Clontech are transcribed from the P(LtetO-1) promoter. CFP reports the lacI levels. When the anhydrotetracycline (aTc) inducer molecule is introduced into the cell, it binds to TetR and prevents TetR from repressing P(LtetO-1). The aTc inducer molecule functions as the external input to the high threshold circuit component. The lacI protein represses P(lac) and regulates the expression of cI, the λ repressor protein[7]. cI is a highly efficient repressor that exhibits dimerization and cooperative binding to its operator. Finally, the λP (R−O12−mut4) promoter[10] on pCMB-2 is repressed by cI and regulates the expression of the enhanced yellow fluorescent protein (EYFP) output of the circuit. Therefore, the relationship between the aTc input and EYFP output describes the behavior of the high threshold component. To experiment with this circuit, we prepared twelve tubes each of 2ml LB ampicillin/kanamycin solution with different concentrations of aTc. We transformed E. coli STBL2 cells from Invitrogen with both the pCMB-2 and pCMB100 plasmids and picked a single colony into fresh media. The culture was distributed to the twelve different tubes that were shaken at 250RPM and @37◦ C
70
Subhayu Basu, David Karig, and Ron Weiss 10000
Median Fluorescence
1000
100
10
P(bla)
tetR
P(lac)
aTc
pCMB-2/pCMB-100
cI 1 0.1
P(tet)
lacI
CFP
λP(R)
YFP
1
10
100
[aTc] (uM)
Fig. 9. High threshold component of the band detector: circuit design and experimental results for approximately 6 hours until they reached an optical density at 600nm of approximately 0.3. The cells were washed with 0.22µm filter-sterilized phosphate buffered saline (PBS) twice, and the fluorescence levels of the cells were measured using Fluorescence-Activated Cell Sorting (FACS)[9] on a FACSVantage flow cytometer. The machine has two argon excitation lasers, one set at 458nm with an emissions filter of 485/22nm for detecting CFP, and the other laser set at 514nm excitation with an emissions filter of 575/26nm for detecting EYFP. Filters were obtained from Omega Optical. Figure 9 shows the median fluorescence level for the cell populations grown with different aTc inducer concentrations. The EYFP exhibits a strong sigmoidal relationship to the aTc input levels with a sharp transition from low to high output. The results in Figures 9 and 2 demonstrate components that respond specifically to signals bound by low or high thresholds. However, these subcircuits currently respond to either iptg or aTc inputs, and their behavior cannot be compared directly. To synthesize an operational band detector, it is likely that we will need to reduce the strength of the response for the high threshold. The next section describes a circuit that responds to acyl-HSL levels. 5.2
Implementation of an Acyl-HSL Detect Circuit
We previously constructed plasmids for performing cell-cell communications[12]. These plasmids are pSND-1, pPROLAR.A122, and pRCV-3. The pSND-1 plasmid, which has a p(Lac) promoter and a luxI gene sequence, is used to produce acyl-HSL. This plasmid has a ColE1 replication origin and ampicillin resistance. The luxI gene encodes an acyl-homoserine lactone synthesase that uses highly available metabolic precursors found within most gram negative prokaryotic bacteria (acyl-ACP from the fatty acid metabolic cycle, and S-adenosylmethionine from the methionine pathway) to synthesize N-(3-oxohexanoyl)-3-amino-dihydro2-(3H)-furanone, or 3OC6HSL. The pPROLAR.A122 plasmid is used as a negative control and contains only a p15A origin of replication and kanamycin resistance. The pRCV-3 plasmid, which contains a luxP(R) promoter followed by a GFP(LVA) coding sequence from Clontech, and a luxP(L) promoter followed by
Engineering Signal Processing in Cells
71
Fluorescence
1000.00
100.00
10.00
1.00 10
100
1000
HSL
Fig. 10. Median fluorescence of cell cultures with different levels of HSL a luxR coding sequence. pRCV-3 is used for quantifying the response to acyl-HSL signals. The results in [12] report on the relationship between acyl-HSL levels and the corresponding GFP fluorescence induced in pRCV-3. For high levels of an acyl-HSL extract, the level of fluorescence was observed to decrease. The decrease in fluorescence was likely due to the toxicity of the acyl-HSL extract. For this paper, we modified the experimental protocol in order to avoid using the acyl-HSL extract. Cells with pSND-1 plasmids were first grown @37◦ C to an optical density of 0.3. These plasmids produced acyl-HSL, which diffused freely through the cell membrane. The cells were centrifuged at 6000g, and the HSLcontaining supernatant was extracted. At the same time, different cells with pPROLAR.A122 were grown @37◦ C to an optical density of 0.3, and the supernatant was extracted following centrifugation at 6000g. The pPROLAR.A122 supernatant was used to dilute the HSL-containing supernatant to various concentrations. This was done to keep the nutrient concentration the same in all of the tubes while varying the HSL concentrations. Each dilution was aliquotted into a tube containing 1ml of fresh LB with ampicillin and kanamycin. We also added ampicillin and kanamycin as appropriate to keep the final concentrations of these antibiotics the same as in the original 1ml LB/ampicillin/kanamycin solution. A single colony of cells transformed with both the pRCV-3 and the pPROLAR.A122 plasmids was selected and grown in the separate tubes @37◦ C until the cultures reached optical density of approximately 0.3. Next, the cells were washed and resuspended in 0.22µm filter-sterilized PBS, and the fluorescence of the cell population was measured using a FACScan flow cytometer with an argon laser excitation at 488nm and emissions filter of 530/30nm. Figure 10 shows the median fluorescence of each of the different samples that exhibits a direct sigmoidal relationship to increasing HSL concentrations. The cells did not exhibit any decrease in fluorescence as the HSL concentration was increased.
72
6
Subhayu Basu, David Karig, and Ron Weiss
Conclusions
In this paper, we propose to couple protein-ligand interactions with synthetic gene networks for detecting molecular signal concentration bands. The analysis of the circuit demonstrates several factors that can help forward-engineer this circuit to acquire different thresholds for the band detection. We also presented preliminary experimental results of the functioning of the low threshold component, the high threshold component, and the ability to respond to acyl-HSL signals. In the effort to build the complete version of this circuit, great emphasis will be placed on the quantitative characterization of the components, including the study of cell population statistics. To arrive at components with the appropriate device physics, we plan to employ site-directed mutagenesis and molecular evolution techniques. The mechanisms employed and lessons learned in building this particular circuit are likely to be beneficial for the broader goal of building a variety of novel signal processing circuits in cells.
References [1] F. M. Ausubel, R. Brent, R. E. Kingston, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl. Short Protocols in Molecular Biology. Wiley, 1999. [2] B. L. Bassler. How bacteria talk to each other: regulation of gene expression by quorum sensing. Current Opinion in Microbiology, 2:582–587, 1999. [3] A. Becskei and L. Serrano. Engineering stability in gene networks by autoregulation. Nature, 405:590–593, June 2000. [4] M. Elowitz and S. Leibler. A synthetic oscillatory network of transcriptional regulators. Nature, 403:335–338, January 2000. [5] W. C. Fuqua, S. Winans, and E. P. Greenberg. Quorum sensing in bacteria: The luxr-luxi family of cell density-responsive transcriptional regulators. J. Bacteriol, 176:269–275, 1994. [6] T. Gardner, R. Cantor, and J. Collins. Construction of a genetic toggle switch in escherichia coli. Nature, 403:339–342, January 2000. [7] M. Ptashne. A Genetic Switch: Phage lambda and Higher Organisms. Cell Press and Blackwell Scientific Publications, Cambridge, MA, 2 edition, 1986. [8] J. Sambrook, E. F. Fritsch, and T. Maniatis. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, Plainview, NY, 1989. [9] H. M. Shapiro. Practical Flow Cytometry. Wiley-Liss, New York, NY, 1995. Third Edition. [10] R. Weiss and S. Basu. The device physics of cellular logic gates. In NSC-1: The First Workshop of Non-Silicon Computing, Boston, Massachusetts, February 2002. [11] R. Weiss, G. Homsy, and T. F. Knight Jr. Toward in-vivo digital circuits. In Dimacs Workshop on Evolution as Computation, Princeton, NJ, January 1999. [12] R. Weiss and T. F. Knight Jr. Engineered communications for microbial robotics. In DNA6: Sixth International Workshop on DNA-Based Computers, DNA2000, pages 1–16, Leiden, The Netherlands, June 2000. Springer-Verlag.
Temperature Gradient-Based DNA Computing for Graph Problems with Weighted Edges Ji Youn Lee1 , Soo-Yong Shin2 , Sirk June Augh3 , Tai Hyun Park1 , and Byoung-Tak Zhang2,3 1
Cell and Microbial Engineering Laboratory School of Chemical Engineering, 2 Biointelligence Laboratory School of Computer Science and Engineering, 3 Center for Bioinformation Technology (CBIT), Seoul National University, Seoul 151-742, Korea {jylee, syshin, sjaugh, thpark, btzhang}@bi.snu.ac.kr
Abstract. We propose an encoding method of numerical data in DNA using temperature gradient. We introduce melting temperature (Tm) for this purpose. Melting temperature is a unique characteristic to manipulate the hybridization and denaturation processes that used in the key steps in DNA computing such as the solution generation step and the amplification step. DNA strands of lower melting temperature tend to denature with ease and also be easily amplified by slightly modified polymerase chain reaction, called denaturation temperature gradient polymerase chain reaction. Using these properties, we implement a local search molecular algorithm using temperature gradient, which is contrasted to conventional exhaustive search molecular algorithms. The proposed methods are verified by solving an instance of the travelling salesman problem. We could effectively amplify the correct solution and the use of temperature gradient made the detection of solutions easier.
1
Introduction
Since Adleman’s pioneering work[1], DNA computing has been applied to various fields of research including combinatorial optimization [9], massive parallel computing [7], Boolean circuit development [11], nanotechnology [17], huge database [12], and so on. Though many researches achieve the good results, most of these applications do not consider the problem of representing numerical data in DNA molecules. However, many real world applications involve graph problems which have weighted edges. Examples include the shortest path problem, the travelling salesman problem, the minimum spanning tree problem, and the Steiner tree problem [4]. The numerical data representation method in DNA strands is the one of the important issue to extend the field of DNA computing. Though some researchers tried to represent the numerical data using DNA, the results are not satisfactory yet. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 73–84, 2003. c Springer-Verlag Berlin Heidelberg 2003
74
Ji Youn Lee et al.
In this paper, we propose a novel encoding method that uses temperature gradient to overcome the drawbacks of the previous works. We represent the numerical data using melting temperature gradient, and implement a local search molecular algorithm with the temperature gradient method. The travelling salesman problem (TSP) is used as a benchmark problem for the proposed technology. TSPs are interesting in that their solution requires representing the path weights. This is contrasted with the Hamiltonian path problem where the connection cost is binary. Our bio-lab experiments show that the correct DNA strands are effectively amplified and the detection of optimal solutions made easier. In Section 2, we describe the molecular encoding of numerical data in DNA molecules. Section 3 explains the molecular algorithm and experimental procedure for solving TSP. Section 4 presents the results and discusses them. The conclusions are drawn in Section 5.
2 2.1
Molecular Encoding for Edge Weights of Graph Previous Encoding Methods
Firstly, Narayanan and Zorbalas proposed DNA algorithms to solve travelling salesman problems [8]. Their method makes the length of weight sequences proportional to the edge costs in the edge sequences. This method is inefficient in representing a wide range of weights because, in some cases, one edge sequence with a high weight must be very long. For the same reason, it is also inefficient to represent real values. In addition, the fact that larger weights are encoded as longer sequences is contrary to the biological fact; the longer the sequences are, the more likely they hybridize with other DNA strands, though we have to find the shortest DNA strands. Yamamura et al. proposed a concentration control method to represent the weights [18]. The concentration of each DNA is used as input and output data, i.e. the numerical data is encoded by concentrations of DNAs. This method enables local search among all candidate solutions instead of exhaustive search. However, the concentration control method has some drawbacks in detecting the solutions. One cannot be sure that the most intensive band (in the gel) has the optimal solutions, because the most intensive band can be made by a number of partial non-optimal solutions with low concentration rather than optimal solutions with high concentration. In addition, it is technically difficult to extract a single optimal solution from the most intensive band. In previous work [15], we proposed a method for representing real-values in fixed-length DNA strands using GC contents. Edge sequences contain two components: link sequences and weight sequences. The weight of edges is represented by varying the amount of A/T pairs and G/C pairs in weight sequences. Since there are two hydrogen bonds formed between A and T and three hydrogen bonds between G and C, the hybridizations between the G/C pairs are preferred to those between A/T pairs. Therefore, the search procedure can be guided by including more A/T pairs to higher weight sequences and more G/C
Temperature Gradient-Based DNA Computing for Graph Problems
75
pairs to smaller weight sequences. And we verified our method by the computer simulation. 2.2
Melting Temperature Control Encoding
Based on previous work, we improve the weight representation method by allowing the hybridization process to be controlled by melting temperature. Though the number of hydrogen bonds is an important factor deciding the thermal stability of DNA strands, it is not the only one. The melting temperature (Tm) is the more direct characteristic of the stability of the DNA hybrid and Tm is determined by more complex function consists of various factors including GC content. So, we adopt a melting temperature to encode the DNA strands. Many methods have been proposed to determine the Tm value of DNA duplex. In the absence of destabilizing agents, like formamide or urea, Tm depends on three major parameters: the GC content and thermodynamic factors, the strand concentration, and the salt concentration. If we ignore the reaction conditions, Tm only depends on the GC content and thermodynamic factors. The classical method is using GC content, salt concentration, and the length [16]. But recently a statistical method using the thermodynamic parameters such as ∆H and ∆S is proposed [5, 10]. The latter model is known as the nearest-neighbor (NN) model that is more accurate and applicable to DNA duplex up to 108 bp. We use both the GC content method and the nearest-neighbor method to calculate the Tm. If the sequences are short, the NN method is used, otherwise the classical method is mainly applied. Though Tm of the DNA duplex up to 108 bp can be accurately determined by the NN method, Tm of the longer DNA strands formed after hybridization and ligation depends on the classical method rather than the NN method.
3 3
0 5
3
3 3
1
4 7
9 3
2
AGCT TAGG P1A P1B
1
11 11 3 5
6 3
3 ATCC ATCA TACC P1B W12 P2A
2
7 ATCC GCCT GCTA P1B W13 P3A
3
Fig. 1. Graph for the travelling salesman problem (left) and its example of encoding in DNA (right). The vertex sequence is 5 → 3 and the edge sequence is 3 → 5 . The optimal path for this problem is ‘0 → 1 → 2 → 3 → 4 → 5 → 6 → 0’. The melting temperature control encoding scheme is illustrated for the travelling salesman problem in Fig. 1. Basic representation is similar to Adleman’s
76
Ji Youn Lee et al.
method [1]. The difference is that edge sequences have the weight sequence part in the middle of their sequences, W12 , W13 (Refer to the right part of Fig. 1). To design the sequences for weighted-graph problems using the proposed method, the vertex sequences are designed firstly. Each vertex sequence is designed to have a similar Tm using both the nearest-neighbor and the classical method. That is why the vertex sequences should not affect the hybridization fidelity. Then, the edge sequences are generated. The edge sequence consists of the two parts such as the link sequence (P1B , P2A , P3A ) and the weight sequence (W12 , W13 ). This scheme is motivated from [9]. The link sequences are based on the vertex sequences to be linked and have also a similar Tm, since the link sequences of the edge sequences are decided by the vertex sequence. The weight sequences are designed to have a various Tm according to their representative cost. To produce more variation in generating low cost paths, the weight sequences with smaller weights have a lower Tm. If DNA strands have a lower Tm, it can be easily denatured and also easily amplified by denaturation temperature gradient polymerase chain reaction. For example, as shown in Fig. 1, the first part (P1B ) of the edge sequence is complementary to the last half (P1B ) of the starting vertex sequence. And the last part (P2A ) is complementary to the first half of the ending vertex sequence (P2A ). The weight sequence W12 with cost 3 contains more A/T pairs than the weight sequence W13 with cost 7 to lower the Tm. 2.3
Travelling Salesman Problems
We selected travelling salesman problem as a benchmark. The travelling salesman problem (TSP) is to find a minimum weight (cost) path for a given set of vertices (cities) and edges (roads). In addition, the solution path must contain all the cities given, each only once, and begin from the specified city to which the tour ends [4]. We solve the TSP with 7 nodes, 23 edges and 5 weights as shown in Fig. 1. For convenience, we represent the weights by decimal numbers such as 3, 5, 7, 9, and 11. The example of DNA sequences for solving TSP is shown in Table 1. These sequences are generated by the NACST sequence generator (NACST/Seq) [6]. The sequences are generated to prevent mis-hybridization and unexpected secondary structure using seven constraints such as similarity, H-measure, Hmeasure in the 3’-end, GC ratio, continuity, the unexpected secondary structure, and melting temperature [14, 13]. For the vertex sequences, Tm ranges 55o C to 60o C using the NN model with 1M salt concentration and 10nM oligomer concentration, GC content is restricted to 50%, the unexpected secondary structure such as hairpin formation is prohibited. In the weight sequences, Tm is increased by the cost; GC content is varied from 30% to 70% (see Table 1). To avoid the extreme case, i.e. the sequence consists of all C or G, we set the range as 30% ∼ 70%. As the result of evolutionary algorithms, NACST/Seq assigns the 30% to weight 3, 40% to weight 5, and so on. NACST/Seq increases GC content by 10%, as the weight is increased by two. After hybridization and ligation, the long sequences are generated which are not fit to the NN model. Therefore, evolutionary algorithm changed the GC ratio more precisely than melting temperature.
Temperature Gradient-Based DNA Computing for Graph Problems
77
Table 1. Vertex sequences and weight sequences for TSP. Tm is calculated by the NN method with 1M salt concentration and 10nM oligomer concentration. Vertex sequences Sequence (5 → 3 ) AGGCGAGTATGGGGTATATC CCTGTCAACATTGACGCTCA TTATGATTCCACTGGCGCTC ATCGTACTCATGGTCCCTAC CGCTCCATCCTTGATCGTTT CTTCGCTGCTGATAACCTCA GAGTTAGATGTCACGTCACG Weight sequences Edge cost Sequence (5 → 3 ) 3 ATGATAGATATGTAGATTCC GGATGTGATATCGTTCTTGT 5 GGATTAGCAGTGCCTCAGTT 7 TGGCCACGAAGCCTTCCGTT 9 GAGCTGGCTCCTCATCGCGC 11 No. 0 1 2 3 4 5 6
Tm GC% 60.73 50 59.24 50 59.00 50 56.81 50 58.13 50 59.44 50 56.97 50 Tm GC% 47.89 30 54.62 40 58.37 50 64.51 60 68.88 70
Additionally, the higher weight sequences have higher melting temperatures with the NN method.
3 3.1
Molecular Algorithms for TSP Previous Work
Most of the existing DNA computing methods to solve combinatorial optimization problems are similar to Adleman’s in several ways. First, all possible solutions are generated and then the correct solutions among them are selected. Also the previous works [8, 15, 18] on travelling salesman problems or shortest path problems are similar to each other. They only differ in their weight encoding schemes such as sequence length, GC content, and strand concentration. The former two studies were not verified by lab experiments, the last method is proved by lab experiments, though the results are not satisfactory. And last method [18] gives a bias to search space by concentration control. 3.2
Temperature Gradient Algorithms
In order to utilize the massive parallelism of biochemical materials and to guide search space, it is useful to give some biases in the experimental steps. For this purpose, we give some constraints in DNA concentration and the PCR step. By varying the concentration of each oligomer strands [18], we can generate more paths that contains a smaller sum of weights. Also by modifying the PCR protocol, we can specifically amplify and detect the correct solution easier. DNA strands of low melting temperature can
78
Ji Youn Lee et al.
be more strongly amplified in PCR by varying the denaturation protocol. Usually, we vary the annealing temperature to optimize the reaction condition. But by denaturing the dsDNA in the lower temperature in the starting cycles and gradually increasing the denaturation temperature, we can specifically amplify the DNA strands with a lower melting temperature. So we can amplify the correct solutions more than other solutions that have the same length but a higher melting temperature. The whole procedure to solve TSP is shown in Fig. 2. In step 1, we generate the solution set. Steps 2 ∼ 5 find the solution satisfying the restrictions (the starting vertex is the same as the end vertex). Step 6 separates the sequences visiting all vertexes. In steps 7 ∼ 8, we amplify the solutions based on its path cost and find the optimal solution. Finally, we check the path of the solution. Step 1: Hybridization and ligation [Generation of random paths] Step 2: PCR with primer vout (comp.) [Check the end vertex] Step 3: Gel electrophoresis Seperation by size and elution Step 4: PCR with primer vin [Check the start vertex] Step 5: Gel electrophoresis Seperation by size and elution
Step 6: Affinity-separation with a biotin-avidin magnetic beads system [Check the each vertex] Step 7: DTG-PCR with v in and vout (comp.) [Amplify the correct solutions more]
Step 8: Gel electrophoresis separation by size and Tm: TGGE Step 9: Sequencing and readout
Fig. 2. The molecular algorithms for solving TSP.
Oligomer Synthesis All 7 vertexes and 5 weights are designed in 20 mer ssDNA as listed in Table 1. And 23 edges in 40 mer ssDNAs are designed according to their vertexes and weight sequences. All oligomers are 5’-phosphorylated. Other 5’-biotinylated vertexes were prepared for affinity-separation. All oligomers were synthesized at Bioneer Co. Hybridization and Ligation Edge and weight oligomers were differently added by their weights. The amount decreased by 20% according to the weight increase by 2, set on the basis of weight 3 as a 100%. Oligomer mixture was heated to 95o C and slowly cooled to 20o C by 1o C per 1 minute, and stored at 4o C. Mix 5µl of hybridization mixtures and 350 units of T4 DNA ligase (TaKaRa, Japan), ligase buffer (66mM Tris-HCl, pH 7.6, 6.6mM M gCl2 , 10mM DTT, 0.1mM ATP), and add H2 O to a total volume of 10µl. We incubated the reaction mixture for 16 hours at 16oC.
Temperature Gradient-Based DNA Computing for Graph Problems
79
PCR Amplification and Gel Electrophoresis All PCR amplifications were performed on a PTC-200 DNA Engine (MJ Research, MA, USA). For normal amplification, 0.5µM of primer and AccuPower PCR PreMix (Bioneer, Korea) which containing 1 unit of T aq DNA polymerase in 10mM T ris-HCl, pH 9.0, 1.5mM M gCl2 , 40mM KCl, 0.25mM each dNTP were dissolved to distilled water to a total volume of 20µl. And PCR was processed for 34 cycles at 95o C for 30 seconds, at (T m − 5)o C for 30 seconds and at 72o C for 30 seconds. Initial denaturation and prolonged polymerization was executed for 5 minutes. All gel electrophoresis were performed with 2% Agarose-1000 (GibcoBRL, NY, USA) in 0.5X tris-borate-EDTA buffer and the gel was ethidium bromide stained. 50bp DNA Ladder (GibcoBRL, NY, USA) was used as a marker. Affinity Separation We roughly followed the affinity separation protocol in the Adleman’s experiments. The ssDNA for the affinity purification were produced by replacing the forward primer with 5’-biotinylated analog. The amplified product was annealed to streptavidin paramagnetic particles (Promega, Madison, WI) by incubating in 200ml of 0.5× saline sodium citrate (SSC) for 1 hour at room temperature with constant shaking. Particles were washed three times in 300ml of 0.5× SSC and then heated to 80o C in 100ml of ddH2 O for 5 minutes to denature the bound dsDNA. The aqueous phase with ssDNA was retained. For affinity separation, 1nmol of 5’-biotinylated vertex strands was annealed to particles as above and washed three times in 300ml of 0.5× SSC for 1 hour at room temperature with constant shaking. Particles were washed four times in 300ml of 0.5× SSC to remove unbound ssDNA and then heated to 80o C in 100ml of ddH2 O for 5 minutes to release ssDNA bound to the complementary vertex 1. The aqueous phase with ssDNA was retained. This process was then repeated with each vertexes. Denaturation Temperature Gradient PCR For denaturation temperature gradient PCR (DTG-PCR), the denaturation temperature was kept initially 70o C and gradually decreased 1o C per one cycle and kept 95o C for the remaining 10 cycles. Other conditions are identical with normal PCR described in the PCR amplification section. Temperature Gradient Gel Electrophoresis For TGGE analysis, we used 8% denaturing polyacrylamide gel containing 20% formamide and 4.2M urea. The electrophoresis was performed over 55o C ∼ 70o C temperature range at 200 volt for three hours. After the electrophoresis, the gel was silver stained. Cloning and Sequencing After the TGGE, the DNA strands in main band were excised, eluted and PCR amplified. This PCR product was cloned into pBluescript SK+ and sequenced using T7 primer.
80
Ji Youn Lee et al.
(A)
M
1
2
3
(B)
M
1
350bp
(C)
M
1
2
(D)
M
1
M
1
Fig. 3. Gel electrophoresis on a 2% agarose gel. The lanes contain: lane M denotes DNA size marker (50 bp ladder), (A) lanes 1,2: first ligation result, lane 3: oligomer mixture, (B) lane 1: second ligation result, (C) lanes 1,2: first PCR with Vout result, (D) lane 1: second PCR with Vin result
4 4.1
Results and Discussion Random Path Formation and Size Sieving
The start point to solve TSP is generating initial pool, random path formation. Searching a correct solution in the candidate pool is an important step to solve the problem. However, more generation of the possible solution in the candidate pool is readily achievable if we utilize the advantages of molecular computing. The reaction rate of biochemical reaction is related with the reaction constant and the reactant concentrations. Therefore, concentration gradient can be useful to obtain the correct solution. This approach was already tried by [18]. By increasing the concentration of DNA sequences of smaller weights and vice versa, paths which contains a smaller sum of weights will be more frequently generated. In our work, the concentration gradient was only a tool to generate more possible solutions. The hybridization and ligation results are shown in Fig. 3 (A). Compared with the oligomer mixture, the ligated DNA strands became elongated. But,
Temperature Gradient-Based DNA Computing for Graph Problems
81
there are few copies around the 300 bp that indicates the length of paths including 8 vertexes. So we executed the ligation reaction again and obtained an upper shifted ligation product as shown in Fig. 3 (B). However, still shorter DNA strands mainly occupy the ligation product. Ligation is an essential step to generate random paths, so efficient ligation is needed to produce DNA strands long enough to solve large problems. 4.2
PCR with In and Out Vertex and Affinity Purification
After the hybridization and ligation, double-stranded DNAs with sticky ends were generated. We cannot use the primer pair because the start and the end point are identical in TSP and the primer pair is exactly complementary to each other. So we executed PCRs with only one primer. First, we executed PCR with a primer that is complementary with the end vertex (vertex 0). Blunt-ended double-stranded DNAs that end with the vertex 0 will be generated by this PCR. Subsequently, we executed the second PCR with vertex 0, and we could amplify the DNAs that start with vertex 0. We sieved the PCR product by 2% agarose gel electrophoresis (Fig. 3 (C), (D)). The second PCR product appeared in several bands in the gel. We excised and eluted the band around 300 bp and amplified with 5’-biotinylated vertex 0 as a primer. The amplified product was used in the following affinity separation with streptavidin paramagnetic particles. 4.3
Denaturation Temperature Gradient PCR
The DNA strands that passed the implementation procedures up to step 6 have the same length and contains same oligomer segments. So, they have similar characteristics and cannot be easily separated. However, we designed the weight sequence to have different GC content, so the Tm of the correct solution is the lowest among the candidate solutions. Therefore, if we decrease the denaturation temperature to a certain level, mainly DNA strands of correct solutions will be denatured at that temperature, and be amplified. As the denaturation temperature increases, other DNA strands also will be amplified. But the amount of correct solutions will be exponentially increased cycle by cycle, and occupy the major part of the solution. By simple modification of the typical PCR, we can amplify the correct solution and detect it easily. We can show the effectiveness of DTG-PCR by the relative amplification in Fig. 4 (A). Shorter DNA strands in the lane 2 are more amplified by DTG-PCR when compared with the normal PCR product in the lane 1 (See the boxed area in Fig. 4). After the DTG-PCR, we obtained the DNA strands as shown in amplification in Fig. 4 (B). This band might contains four different DNA strands of the possible Hamiltonian paths, ‘0 → 1 → 2 → 3 → 4 → 5 → 6 → 0 (the sum of weights: 21)’, ‘0 → 1 → 6 → 5 → 4 → 3 → 2 → 0 (31)’, ‘0 → 2 → 1 → 3 → 4 → 5 → 6 → 0 (27)’, ‘0 → 2 → 3 → 4 → 5 → 6 → 1 → 0 (31)’. Those strands have the identical length, and they cannot be separated by normal gel electrophoresis. However, we can separate these strands by the temperature gradient gel electrophoresis (TGGE), because they have distinct
82
Ji Youn Lee et al.
(A)
M
1
2
(B)
M
1
(C)
Main band 300bp (8vertex)
Fig. 4. Gel electrophoresis on a 2% agarose gel. The lanes contain: lane M denotes DNA size marker (50 bp ladder), (A) lane 1: normal PCR result, lane 2: DTG-PCR result, (B) lane 1: the final DTG-PCR result, (C) TGGE separation of solutions melting behaviors. Similar work has studied using denaturing gradient gel electrophoresis (DGGE) [3]. 4.4
Temperature Gradient Gel Electrophoresis and Sequencing
TGGE is analogous to DGGE and it is based on the correlation of the melting characteristic of a DNA strand. TGGE is an extremely sensitive method and even fragments differing in only one nucleotide in sequence can be separated by TGGE. Usually this property was used to detect point mutations. But we utilized this electrophoresis method to separate the correct solution from other possible Hamiltonian paths. The GC content of the strands varies from 40.67% to 44.00%, and the melting temperature of the DNA strands are respectively 84.80oC, 86.51oC, 85.68oC, 86.62oC by NN model [5] with 1M salt concentration and 10nM oligomer concentration, and 92.69oC, 94.05oC, 93.51oC, 94.05oC by GC content method. The PCR mixture containing those possible Hamiltonian paths was separated by TGGE, and one main band and other sub bands were observed in Fig 4 (C). The DNA strands in main band were cloned and sequenced, and it showed that the path 0 → 1 → 2 → 3 → 4 → 5 → 6 → 0 has the lowest cost. The sequencing result confirmed the whole experimental procedures. 4.5
Scalability
The solvable problem size is restricted to the capabilities of the current biochemical tools. In case of solving TSP, PCR is the critical step in the procedure. We can amplify DNA strand of up to 40,000 bases with current PCR technology. Though considering this limitation, we can scale up to 1,000 cities problems roughly, if we apply this temperature gradient encoding method. And The another important point of scalability is the ability of separation and detection. We
Temperature Gradient-Based DNA Computing for Graph Problems
83
can separate DNA strands whose GC content differs by 0.49 % with the current resolution of TGGE and this difference is easily achievable with our encoding method. However, there exist some limitations in scaling up to 1,000 city problems. The first and critical problem is that it is impossible to discriminate the solutions which have the identical melting temperatures. And it is not also discriminative when there exist two solutions that melting temperature differences are not exceeding 0.2o C. The second problem is the presence of elaborate and timeconsuming implementation steps. In the experimental view, 1,000 times affinity separation is almost impossible. This is the most time-consuming step in our implementation steps. But, if we introduce the affinity separation method proposed by [2], the affinity separation steps can be automated with high efficiency and can became easier. Other possible obstacle in our encoding method arises when the DNA strands with stable mismatches have lower melting temperature and are amplified exclusively. So the sophisticated sequence design to avoid the stable mismatches is needed.
5
Conclusions
We introduced a temperature gradient method to solve the graph problems with numerical data. This method can overcome the restrictions of other encoding methods and be easily implemented by the simple modification of the existing experimental techniques. We can more strongly amplify the correct solution relatively during the reaction and the solution can be easily detected. This method drives the DNA pool to contain more correct solutions, rather than to search randomly for the correct solution in the DNA pool. We showed the feasibility of our computing method by solving the travelling salesman problem with adequate biochemical tools. The combination of the orthogonal design of DNA sequences and denaturation temperature gradient PCR provides a novel method to solve general graph problems with weighted edges. This is applicable to any Tm-involved implementation steps to amplify low melting temperature DNA strands.
Acknowledgments This research was supported in part by the Ministry of Education & Human Resources Development under the BK21-IT Program and the Ministry of Commerce, Industry and Energy through the Molecular Evolutionary Computing (MEC) Project. The RIACT at Seoul National University provides research facilities for this study.
References [1] L. M. Adleman. Molecular computation of solutions to combinatorial problems. Science, 266:1021–1024, 1994.
84
Ji Youn Lee et al.
[2] R. S. Braich, N. Chelyapov, C. Johnson, P. W. K. Rothemund, and L. Adleman. Solution of a 20-varaible 3-SAT problem on a DNA computer. Science, 296:499– 502, 2002. [3] J. Chen, E. Antipov, B. Lemieux, W. Cede˜ no, and D. H. Wood. In vitro selection for a MAX 1s DNA genetic algorithm. In Proceedings 5th International Workshop on DNA-Based Computers, pages 23–46, 2000. [4] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-completeness. W. H. Freeman and company, 1979. [5] J. SantaLucia Jr. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA, 95:1460–1465, 1998. [6] D. Kim, S.-Y. Shin, I.-H. Lee, and B.-T. Zhang. NACST/Seq: A sequence design system with multiobjective optimization. In Preliminary Proceedings of 8th International Meeting on DNA Based Computer, pages 241–251, 2002. [7] Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith. DNA compouting on surface. Nature, 403:175–179, 2000. [8] A. Narayanan and S. Zorbalas. DNA algorithms for computing shortest paths. In Proceedings of Genetic Programming 1998, pages 718–723, 1998. [9] Q. Ouyang, P. D. Kaplan, S. Liu, and A. Libchaber. DNA solution of the maximal clique problem. Science, 278:446–449, 1997. [10] R. Owczarzy, P. M. Vallone, F. J. Gallo, T. M. Paner, M. J. Lane, and A. S. Benight. Predicting sequence-dependent melting stability of short duplex DNA oligomers. Bioploymers, 44:217–239, 1998. [11] G. G. Owenson, M. Amos, D. A. Hodgson, and A. Gibbsons. DNA-based logic. Soft Computing, 5(2):102–105, 2001. [12] J. H. Reif, T. H. LaBean, M. Pirrug, V. S. Rana, B. Guo, C. Kingsford, and G. S. Wickham. Experimental construction of a very large scale DNA database with associatice search capability. In The 7th International Workshop on DNA-Based Computers, pages 241–250. [13] S.-Y. Shin, D. Kim, I.-H. lee, and B.-T. Zhang. Multiobjective evolutionary algorithms to design error-preventing DNA sequences. Technical Report BI-02-003, Seoul National University, School of Computer Science and Engineering, Seoul National University, Seoul, Korea, March 2002. [14] S.-Y. Shin, D.-M. Kim, I.-H. Lee, and B.-T. Zhang. Evolutionary sequence generation for reliable DNA computing. In Proceedings of Congress on Evolutionary Computation 2002, pages 79–84, 2002. [15] S.-Y. Shin, B.-T. Zhang, and S.-S. Jun. Solving traveling salesman problems using molecular programming. In Proceedings of Congerss on Evolutionary Computation 1999, pages 994–1000, 1999. [16] J. G. Wetmur. DNA probes: applications of the principles of nucleic acid hybridization. Crit. Rev. Biochem. Mol. Biol., 26:227–259, 1991. [17] E. Winfree, F. Lin, L. A. Wenzler, and N. C. Seeman. Design and self-assembly of two-dimensional DNA crystals. Nature, 394(6693):539–545, 1998. [18] M. Yamamura, Y. Hiroto, and T. Matoba. Solutions of shortest path problems by concentration control. In Proceedings of 7th International Workshop on DNABased Computers, pages 231–240, 2001.
Shortening the Computational Time of the Fluorescent DNA Computing Yoichi Takenaka1 and Akihiro Hashimoto2 1
2
Graduate School of Information Science and Technology, Osaka University, Machikaneyama 1-3, Toyonaka, Osaka 560-8531, Japan, [email protected] Faculty of Informatics, Osaka Gakuin University, Kishibe-Minami 2-36-1, Suita, Osaka 564-8511, Japan, [email protected]
Abstract. We present a method to shorten the computational time of the fluorescent DNA computing. Fluorescent DNA computing is proposed to solve intractable computation problems such as SAT problems. They use two groups of fluorescent DNA strands. One group of fluorescent DNA represents that a constraint of the given problem is satisfied, and another group represents that a constraint is unsatisfied. The calculation is executed by hybridizing them competitively to DNA beads or spots on DNA microarray. Though the biological operation used in the fluorescent DNA computing is simple, it needs the same number of beads or spots on microarray as the number of candidate solutions. In this paper, we prove that one bead or spot can represent plural candidate solutions through SAT problem, and show the algorithm and an experimental result of the fluorescent DNA computing.
1
Introduction
Since Adleman demonstrated the possibility of solving NP-complete problems by using DNA, biomolecular computation has become a new vista of computation that bridges computer science and biochemistry [1]. DNA provides a massive computational parallelism that allows us to tackle intractable combinatorial problems exhaustively. In the area of formal language theory, a great number of theoretical studies [3,12,13] were inspired by this experimental possibility. Recently, there is a growing interest in using SAT as a practical tool for solving real-world problems. In fact, some problems are reported to be solved more efficiently by a SAT-solver than by the specialized algorithm[2,7]. Many researchers on DNA computing also report their studies through SAT. In 1998 and 2000, Smith and Liu proposed a surface-based approach to DNA computation [8,16]. In 2000, Sakamoto et al. proposed a fluorescent DNA computing that uses DNA hairpin formation where the computational paradigm differs from the Adleman-Lipton paradigm [14]. In 2001, Takenaka et al. proposed a fluorescent DNA computing that uses DNA beads, fluorescent DNA and mix & divide method[11][17]. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 85–94, 2003. c Springer-Verlag Berlin Heidelberg 2003
86
Yoichi Takenaka and Akihiro Hashimoto
In this paper, we present an algorithm that improves the computational time of fluorescent DNA computing through SAT problem. The fluorescent DNA computing uses fluorescent DNA strands and DNA microarray for its calculation. DNA microarray is used to measure the expression of thousands of genes simultaneously[9,15]. There are many spots on the microarray and each spot holds multiple copies of one DNA strand. DNA strands can be labeled by fluorescers such as Cy3 and Cy5. When we hybridize two types of labeled DNA strands on microarray, we can measure the proportion of the number of Cy3labeled DNA strands and Cy5-labeled strands by intensity of the fluorescence. In the fluorescent DNA computing, each candidate solution for a given problem one-to-one corresponds to a DNA strand that is spotted on the microarray. Make Cy3-labeled DNA strands if the candidate solution satisfies a constraint of the problem. Then, make Cy5-labeled one if the candidate solution doesn’t satisfy one of the constraints. After competitive hybridization of these strands on the microarray, the spots only with Cy3 fluorescence hold the strands that correspond to the true solution. As the biological operation used in the fluorescent DNA computing is simple, it holds the potential to solve large-scale problems. In fact, Takenaka et al. shows the way to solve 24-variable SAT problem with DNA beads, which is one of the derivative technologies of DNA microarray and a DNA bead corresponds to a spot in microarray[17]. However, it has a shortcoming; it takes exponential time to read out the solution because one spot represents only one candidate solution in the computing. In this paper, we present an algorithm to avoid this shortcoming by letting one spot represents plural candidate solutions.
2
Definitions
This section introduces the notation used in the paper. Variables are denoted x1 , · · · , xn , and can be assigned truth-values 0 (False) or 1 (True). A literal l is either a variable x or its negation x. A CNF formula ϕ is a conjunction of clauses ω1 , · · · , ωm , where ωi is a disjunction of literals. A clause is said to be satisfied if at least one of its literals assumes value 1, unsatisfied if all of its literals assume value 0. A formula is said to be satisfied if all its clauses are satisfied, and be unsatisfied if at least one clause is unsatisfied. The SAT problem is to decide whether there exists a truth assignment to the variables such that the formula becomes satisfied. There are 2n assignments on the n variables and the assignments are denoted y1 , · · · , y2n . We define a function fϕ (y) that returns the number of clauses in CNF formula ϕ that the assignment y doesn’t satisfy. In the followings, we use a CNF formula (1) as a SAT instance for explanation. It contains 5 variables and 11 clauses. ϕ=
(x2 ∨ x4 ∨ x5 ) ∧ (x2 ∨ x3 ∨ x4 ) ∧ (x1 ∨ x2 ∨ x5 ) ∧ (x1 ∨ x2 ∨ x3 ) ∧ (x1 ∨ x3 ∨ x5 ) ∧ (x3 ∨ x4 ∨ x5 ) ∧ (x2 ∨ x3 ∨ x4 ) ∧ (x1 ∨ x2 ∨ x5 ) (1) ∧ (x1 ∨ x2 ∨ x4 ) ∧ (x1 ∨ x4 ∨ x5 ) ∧ (x3 ∨ x4 ∨ x5 )
Shortening the Computational Time of the Fluorescent DNA Computing
87
Table 1. Assignments and its number of unsatisfied clauses x5 = 0 00 x1 x2 01 11 10
00 1 2 1 1
x3 x4 01 11 1 1 1 3 1 2 1 1
x5 = 1 10 1 2 0 1
00 01 11 10
00 2 2 2 1
x3 x4 01 11 3 2 1 2 1 2 1 1
10 1 1 1 1
Table 1 shows the value of fϕ (y) for each assignment. The CNF formula (1) is satisfiable and a true assignment is (x1 x2 x3 x4 x5 ) = (11100).
3
Fluorescent DNA Computing
DNA strands can be labeled by fluorescers such as Cy3, Cy5, and R110. Assume there are many copies of a DNA strand and they are divided into two groups, and DNA strands in each group are labeled by a different fluorescer. We can measure the proportion of the number of DNA strands between two groups by this labeling technique and “competitive hybridization”. In the followings, we show the steps to measure the proportion. 1. Label the strands of each group by a different fluorescer respectively. 2. Spot the complementary strands on the surface of glycidal methacrylate plate. 3. Hybridize the strands of two groups to the spot competitively. 4. Wash out the strands that don’t hybridize to the spot. 5. Measure the intensity of two fluorescences. The proportion of the intensity is equivalent to the proportion of the number of DNA strands between two groups. We call this measurement “competitive hybridization”. The competitive hybridization is used in the microarray technology that can provide scientists with genome-scale insight into the state of a cell under different conditions. There are many spots on the microarray and each spot holds multiple copies of one DNA strand or one cDNA of a gene. A single microarray experiment measures the relative level of expression of every gene in the organism at the same time[4]. The fluorescent DNA computing uses competitive hybridization and microarray to calculate the solution. Each spot of microarray represents a candidate solution for a given problem. Consider candidates to be the DNA strands that encode candidate solutions. Also, consider co-candidates to be the complementary strands of candidates. The DNA computing by competitive hybridization requires the following five steps.
88
Yoichi Takenaka and Akihiro Hashimoto
00111 01111 11111 10111 00110 01110 11110 10110 00101 01101 11101 10101 00100 01100 11100 10100 00011 01011 11011 10011 00010 01010 11010 10010 00001 01001 11001 10001 00000 01000 11000 10000 2nd step: Labeled by Cy5
1st clause
3rd clause
00100 10100
01010 01110
00000 10000
01000 01100
2nd clause 01001 11001
4th clause 5th clause
01000 10000
11th clause
3rd step: Labeled by Cy3
Fig. 1. Complement strands made in the second and third step
1. Make a microarray each of whose spot holds multiple copies of a candidate. 2. Synthesize all the co-candidates that are labeled with a fluorescer. 3. For each constraint of the problem, synthesize co-candidates that don’t satisfy the constraint and label them with another fluorescer. 4. Hybridize competitively two groups of the co-candidates to the microarray. 5. Measure the intensity of the two fluorescent lights of each spot. We describe these five steps through the formula (1). As this instance has five variables, there are a total of 25 or 32 candidate solutions. In the first step, synthesize all the candidates and attach them to the spot respectively. In the second step, synthesize the co-candidates, and label them with a fluorescer. With loss of generality, use Cy5 as the fluorescer. Left side of Fig. 1 shows the complementary strands to be synthesized. In the figure, each square represents co-candidate and the number in the square means the assignment of (x1 x2 x3 x4 x5 ). Constraints of the SAT are clauses to be satisfied. In the third step, for each clause, synthesize complementary strands of candidates that don’t satisfy the clause. For the first clause (x2 ∨ x4 ∨ x5 ), we synthesize complementary strands of (x1 x2 x3 x4 x5 ) = (00000), (00100), (10000), (10100). Right side of Fig. 1 shows the complementary strands made in this step. Throughout the step, the co-candidate of the assignment yi is synthesized fϕ (yi ) times. For example, the strand of (x1 x2 x3 x4 x5 ) = (01000) are synthesized in second and third clauses. With loss of generality, we use Cy3 as the fluorescer. In the fourth step, competitively hybridize co-candidates synthesized in the second and third steps to candidates on the microarray. After competitive hybridization, the spots with the true assignment hold only Cy5 labeled strands, and the spots with the false assignment hold both Cy5 labeled strands and Cy3 labeled strands. As the assignment (x1 x2 x3 x4 x5 ) = (11100) holds the proportion of Cy5/Cy3=1/0, it is a true assignment and the formula (1) is satisfiable. Now, we count the computational time of the fluorescent DNA computing. As the computing uses O(2n ) spots on the DNA microarray in the first step,
Shortening the Computational Time of the Fluorescent DNA Computing
89
Fig. 2. DNA strands are synthesized by eight rounds of combinatorial synthesis, where in each word, w1-w8, is added in a separate column of a DNA synthesizer.
it seems to need O(2n ) time. In fact, it takes O(n) time with DNA beads and mix & divide combinatorial synthesis [17]. A DNA bead is approximately a 5µm diameter glycidal methacrylate bead that carries multiple copies of a DNA sequence. In [10], DNA strands are composed of eight words (W1 ∼ W8 ) and are synthesized by eight rounds of the mix & divide method. Figure 2 shows schema of the method. The second step takes O(n) time and the third step takes O(nm) time if you encode according to the mix & divide method used in the first step. The fourth step needs a constant time. In the fifth step, as measurement of a fluorescent light takes a constant time per spot and there are 2n spots, it takes O(2n ) time. The total time of the computational time is dominated by the fifth step. Therefore, it takes O(2n ) time.
4
Idea and Mathematical Background
The bottleneck of the fluorescent DNA computing is the number of spots on microarray. As a spot (or a bead) represents one candidate solution, the time required to measure all the fluorescent light becomes O(2n ). Assume that if one spot can represent plural candidate solutions, the computation time will be shortened. This is the idea to shorten the computational time of fluorescent DNA computing. To realize the idea, we bias the quantity of synthesized DNA strands (cocandidates) between the second and third steps of the fluorescent DNA comput-
90
Yoichi Takenaka and Akihiro Hashimoto
Table 2. Proportions of the two fluorescer Cy5/Cy3 x5 = 0 00 x1 x2 01 11 10
00 1/α 1/2α 1/α 1/α
x3 x4 01 11 1/α 1/α 1/α 1/3α 1/α 1/2α 1/α 1/α
x5 = 1 10 1/α 1/2α 1/0 1/α
00 1/2α 1/2α 1/2α 1/α
00 01 11 10
x3 x4 01 11 1/3α 1/2α 1/α 1/2α 1/α 1/2α 1/α 1/α
10 1/α 1/α 1/α 1/α
ing in section 3. Let the quantity of synthesized DNA per one assignment for each clause in the third step be α times more than that in the second step. The proportion of two fluorescers Cy5/Cy3 becomes 1/αfϕ(y) for an assignment y. Table 2 shows the proportions 1/fϕ(y) for formula (1). Let β be the number of candidate solutions that are represented by one spot, and the spot holds same amount of DNA strand (candidate). Then, we define the value γ as the proportion of two fluorescers Cy5/Cy3, which is shown in equation (2). β 1 i=1 1+αf (yi ) αf (yi ) i=1 1+αf (yi )
γ = β
(2)
As f (yi ) becomes larger, the numerator becomes smaller and the denominator becomes larger. Therefore, γ monotonically decreases as f (yi ) increases. Equation (3) shows the value γ of the formula (1) with β = 32. γ=
1+ 0+
20 α+1 20α α+1
+ +
9 2α+1 18α 2α+1
+ +
2 3α+1 6α 3α+1
.
(3)
Using this γ, we prove the existence of threshold T that holds following condition. The formula is
satisfied by an assignment in the spot if γ > T (4) unsatisfied by any assignments in the spot otherwise
Consider the value γ of the spot in two cases: Case1 γupper : minimum value of γ when the formula is satisfied by an assignment in the spot. Case2 γlower : maximum value of γ when the formula is unsatisfied by any assignments in the spot. The threshold T exists if γupper ≥ γlower . Case1: The formula is satisfied by an assignment in the spot. With loss of generality, let y1 be the assignment which satisfies the formula, i.e. f (y1 ) = 0. The lower bound of γupper is calculated as following equations.
Shortening the Computational Time of the Fluorescent DNA Computing
β 1 αf (yi ) 0+ γ = 1+ 1 + αf (yi ) 1 + αf (yi ) i=2 i=2 β β 1 α·m 0+ ≥ 1+ 1+α·m 1+α·m i=2 i=2
β
(where m is the number of clauses) 1 α·m = 1 + (β − 1) (β − 1) 1+α·m 1+α·m 1 α·m (β − 1) > lim 1 + (β − 1) m→∞ 1+α·m 1+α·m 1 = β−1 1 =⇒ γupper > β−1
91
(5)
(6)
(7) (8) (9) (10)
Case2: The formula is not satisfied by any assignments in the spot. From the condition of Case2, each assignment unsatisfies at least one clause. Therefore, f (yi ) ≥ 1 for every i. As the value of γ monotonically increases as f (yi ) decreases, γ holds the largest value when f (yi ) = 1 for every i. Therefore, β β α·1 1 (11) γlower = 1+α·1 1+α·1 i=1 i=1 β β·α 1 = (12) = 1+α 1+α α The threshold T exists if the equation γupper > γlower holds. From the two 1 cases, if β−1 is larger than α1 , the equation holds. As α and β are plus constants, T exists if α > β − 1 and T holds the equation γupper ≥ T > γlower . QED
5
Algorithm
Using the idea described in Section 4, we show the fluorescent DNA computing for a SAT. Let n be the number of variables in the formula, m be the number of clauses, n δ be the number of spots on a microarray, β be 2δ and α be a constant that is larger than β −1. Consider candidates as the DNA strands that encode candidate solutions, and co-candidates as the complementary strands of candidates. The algorithm of the fluorescent DNA computing with our idea is followings. 1. Make disjointed δ groups of β candidates. Then, attach each group of candidates to a spot on a microarray.
92
Yoichi Takenaka and Akihiro Hashimoto
2. Synthesize all the co-candidates that are labeled with a fluorescer. 3. For each clause ω, synthesize co-candidates that don’t satisfy the constraint. Then, adjust the amount of DNA strands as 2αp times as DNA strands synthesized in step 2, where p is the number of literals in ω. Label them with another fluorophore. 4. Hybridize competitively two groups of co-candidates to the microarray. 5. Measure the intensity of two fluorescences for each spot. 6. If the proportion of the two fluorescences for a spot γ is larger than the threshold T , the formula is satisfiable. Else, the formula is unsatisfiable. We use threshold T as average of γupper and γlower , where γupper and γlower are calculated by equation (7) and (12) respectively. The algorithm can only determine the formula is satisfiable or not. When we need to find the assignment that satisfies the formula, we use binary search after the above algorithm. Suppose the algorithm determine a spot with β candidates holds an assignment that satisfies the formula. We divide the β candidates into two groups and re-call the above algorithm. You can reduce the number of candidate solution to β/2, and repeat it until the number of candidate solutions becomes one. In the followings, we show how the algorithm solves SAT through the formula (1). Let the parameters be α = 4, β = 4 and δ = 8. Firstly, we encode 32 candidate solutions to 40-mer DNA sequence. The first half of 20-mer encodes three variables x5 ∼ x3 , and the second half of 20-mer encodes two variables x2 ∼ x1 . We use “DNASequenceGenerator” to determine eight 20-mer sequences [5]. Table 3 shows the encoding scheme. For example, the assignment (x5 x4 x3 x2 x1 ) = (00000) is encoded by AAAGCTCGTCGTTTAAGGAATAGGTGCGTGCATAACTGGG. In the first step, we make a microarray each of whose spot holds four kinds of candidate solutions. Fig. 3 shows the spots and the candidate solutions attached on them. The four assignments in a spot have same values of (x5 x4 x3 ). We use Cy5 and Cy3 as the fluorescer in second and third step. Amount of the DNA strands per a clause generated in third step is α/23 = 4/8 = 1/2 times as much as amount of the DNA strand generated in second step. Before hybridization we adjust density of the DNA strand to 100µM. After hybridization in the fourth step, the ideal values and experimental values of γ of each spot are described in Table 4. As γupper = 4/11 and γlower = 1/4, the threshold T is 27/88 ≈ 0.307. Because an experimental value γ of spot B is 0.335 and it is larger than T , we can determine that the spot B holds an assignment that satisfies the formula. The algorithm takes O(nm + δ) time to determine if the formula is satisfiable or not, and takes O(nm log n + δ) to find an assignment to be satisfied. As δ is 2n /β, the computation time is O(2n ) when β = 1. And when β = O(2n ), the computation time becomes O(nm log n). Therefore, we have to enlarge the value β to reduce the computation time.
Shortening the Computational Time of the Fluorescent DNA Computing
93
Fig. 3. Four candidate solutions are attached on each spot. Table 3. Variables and corresponding DNA sequence (x5 x4 x3) (000) (001) (010) (011) (100) (101) (110) (111)
6
Sequence AAAGCTCGTCGTTTAAGGAA GAAGCCTACTGTACTCTGCG TATCGTGATTTGGAGGTGGA CAGCCACGTAGTAGAGCTAG TACCCAATCGAACTGATAAG TCGGTCAACGGAGGGGGCTC GCGTTTTTGCGAGGCATGTG GCCTAAAGAATTGATCGCTT
(x2 x1) (00) (01) (10) (11)
Sequence TAGGTGCGTGCATAACTGGG CTAAGTGCGGCTGCATGACC TGGGGTTTTATCTTACGACC TGAGATTTTTAACGCCGTTA
Conclusion
We present a method to shorten the computation time of the fluorescent DNA computing for satisfiability problems. The method allows one bead or one spot can represent plural candidate solutions. Then we show an experimental result of the algorithm. As the computation time of the fluorescent DNA computing depends on the number of beads or spots, the method can dramatically reduce computation time of the fluorescent DNA computing.
References 1. Adleman, L.: Molecular computation of solutions to combinatorial problems. Science 266(11), (1994) 1021–1024. 2. Bejar, R., Manya, M.: Solving the round robin problem using propositional logic, In Proc. of 17th National Conference on Artificial Intelligence (AAAI’2000), AAAI press/MIT press (2000), 262–266.
94
Yoichi Takenaka and Akihiro Hashimoto
Table 4. The ideal values and experimental values γ of each spot. Spot A B C D
ideal γ 0.216 0.607 0.250 0.172
experimental γ 0.282 0.335 0.255 0.235
Spot E F G H
ideal γ 0.154 0.250 0.204 0.154
experimental γ 0.124 0.213 0.253 0.234
3. Calude, C.S., P˘ aun, G.: Computing with cells and atoms. Taylor & Francis Publisher, London, (2000). 4. DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, (1997), 680-686. 5. Feldkamp, U., Saghafi, S., Banzhaf, W., Rauhe, H.: DNASequenceGenerator: a program for the construction of DNA sequences. DNA7, 7th International Meeting on DNA Based Computers, Preliminary Proceedings, (2001), 179–188. 6. Garey, R., Johnson, S.: Computers and Intractability, a guide to the theory of NPcompleteness. Freeman and Company (1991). 7. Kautz, H., Selman, B.: Pushing the envelope: planning, propositional logic, and stochastic search, Proc. of 13th National Conference on Artificial Intelligence (AAAI’96), AAAI press/MIT press 2, (1996) 1194–1201. 8. Liu, Q. et al.: DNA computing on surfaces. Nature 403, (2000) 175–179. 9. Lockhart, D., et al.: Expression monitoring by hybridization to high–density oligonucleotide arrays. Nature Biotechnol., 14, (1996) 1675–1680. 10. Brenner, S., et al.: In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs. PNAS 97, (2000) 1665–1670. 11. Brenner, S., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology 18, (2000) 630–634 12. Paun, G., Rozenberg, G., Salomaa, A.: DNA computing: new computing paradigms. Springer Verlag, (1998). 13. Rozenberg, G., Salomaa, A.: Handbook of formal languages. Springer Verlag, 1–3, (1997). 14. Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T., Hagiya, M.: Molecular computation by DNA hairpin formation. Science 288, (2000) 1223–1226. 15. Schena, M., Shalton, D., Davis, R.W., Brown, B.O.: Quantative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, (1995) 467–470. 16. Smith L. M. et al.: A surface-based approach to DNA computation. J. Computational Biology 5, (1998) 255–267. 17. Takenaka, Y., Hashimoto, A.: A proposal of DNA computing on beads with application to SAT problems. DNA7, 7th International Meeting on DNA Based Computers, Preliminary Proceedings, (2001), 331–339.
How Efficiently Can Room at the Bottom Be Traded Away for Speed at the Top? Extended Abstract Pilar de la Torre Department of Computer Science, University of New Hampshire, Durham, New Hampshire 03824, USA [email protected]
Abstract. Given exponential 2n space, we know that an AdlemanLipton [1,9] computation can decide many hard problems – such as boolean formula and boolean circuit evaluation – in a number of steps that is linear in the problem size n. We wish to better understand how to design biomolecular algorithms that trade away “weakly exponential” 2n/c , c > 1, space to achieve low running times and analyze the efficiency of their space-time utilization relative to that of their best extant classical/biomolecular counterparts. We present deterministic and probabilistic parallel algorithms for the Covering Code Creation and k-SAT problems which are based on the biomolecular setting as abstracted by a randomized framework that augments that of the sticker model of Roweis et al. [13]. We illustrate the power of the randomized framework by analyzing the space-time efficiency of these biomolecular algorithms relative to the best extant classical deterministic/probabilistic algorithms [6,14] which inspired ours. This work points to the proposed randomized sticker model as a logical tool of independent interest.
We consider hard computational problems for which the time required by the n best extant classical algorithm is “weakly exponential”, that is 2 c , where c > 1. We wish to better understand how to design biomolecular algorithms that trade away weakly exponential space to achieve as low of a running time as possible, and to analyze the efficiency of their space-time utilization relative to that of their best extant classical/biomolecular counterparts. A bit more formally, let a and b denote classical/biomolecular algorithms with space-time product requirements sta (n) = sa (n) × ta (n) and stb (n) = sb (n) × tb (n) respectively, We propose to compare their relative efficiency by comparing their respective space-time product requirements. We will say that algorithm a is space-time (product) efficient relative to algorithm b up to at most an f (n) factor if sta (n) = O(stb (n)f (n)). We will say that algorithm a is at least an f (n) factor more space-time (product) efficient than algorithm b, if sta (n)f (n) = O(stb (n)). Given a problem for which the best currently known classical time-bound is at most weakly exponential, we strive M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 95–111, 2003. c Springer-Verlag Berlin Heidelberg 2003
96
Pilar de la Torre
to come up with biomolecular/classical algorithms solutions that: (1) are as fast as possible, (2) are space-time product efficient relative to the best extant classical/biomolecular algorithm up to at most a factor f (n) that is bounded by a polynomial in n, and (3) f (n) is as slow growing a function of n as possible. We propose a natural framework for probabilistic computation which consists of the sticker model of Roweis et al. [13]. augmented by two probabilistic primitives: rand-initialize, and rand-partition. The intent is for these primitives to abstract the randomness enjoyed by biomolecules and offer logical access and control of randomness to the algorithm designer. Based on ideas which this writer perceives are part of the DNA computing community folklore, we present thought experiments that describe possible physical implementations for these primitives in the spirit of the sticker model. We illustrate the power of this randomized sticker setting by presenting deterministic and probabilistic parallel algorithms for the Covering Code Creation and k-SAT that exploit parallelism not only at the string (biomolecule) level, but also at the set-of-strings (tube) level and at the bits-within-string level. Notable among our contributions are the following: We develop biomolecular implementations of the previously best classical determinitic and probabilistic algorithms for the two computational problems [6,14] and analyze their relative space-time efficiency. While attempting attempting to do that, we discovered a classical algorithm for Covering Code Creation that is, by an exponential factor, space-time more efficient than the previously best algorithm [6]. We present a randomized biomolecular algorithm that runs in constant time and is, by an n factor, more space-time efficient than the best previously known randomized classical Covering Code Creation [6] algorithm, and is an exponential factor more space-time efficient that the best currently known deterministic or biomoecular competitor. We analyze relative space-time efficiency among a collection of new and previously known classical/biomolecular algorithms. The results of the analysis are summarized in Figure 1. For the Covering Code Creation Problem they are summarized in TABLE I and those for the k-SAT problem in TABLE II. Our results show that, for each of the problems considered, the best known probabilistic sticker algorithm is, by an exponential factor, more space-time efficient than the best known deterministic sticker or classical competitor. This and prior related work, such as that in [2,13,3,7], point to biomolecular settings for probabilistic algorithm design as logical tools of independent interest that deserve and remain to be systematically investigated.
How Efficiently Can Room at the Bottom Be Traded Away
97
TABLE I: Covering Code Creation Problem sec
ref
3
DANTSIN et al. [6]
CLASSICAL DETERMINISTIC
2[1−h(ρ)]n
3
DANTSIN et al. [6]
CLASSICAL DETERMINISTIC
3.4
THIS PAPER
3.3 3.4 3 3.5
algorithm’s time
algorithm’s space×time
22
2[1−h(ρ)]n
2[1−h(ρ)]n 2 2
2[1−h(ρ)+δ]n
Θ(n)
2[1−h(ρ)+δ]n
2[1−h(ρ)+δ]n
STICKER DETERMINISTIC
2[1−h(ρ)+δ]n
2[1−h(ρ)+δ]n
Θ(n)
2[1−h(ρ)+δ]n
THIS PAPER ♠
CLASSICAL DETERMINISTIC
2[1−h(ρ)+o(1)]n
Θ(n)
THIS PAPER ♠
STICKER DETERMINISTIC
2[1−h(ρ)+o(1)]n 2[1−h(ρ)+o(1)]n
DANTSIN et al. [6] ♣
CLASSICAL RANDOMIZED
2[1−h(ρ)]n
STICKER RANDOMIZED rand-initialize
2[1−h(ρ)]n
THIS PAPER ♣
model
covering code covering code size space n
n
2[1−h(ρ)+o(1)]n 2[1−h(ρ)+o(1)]n Θ(n)
2[1−h(ρ)+o(1)]n
Θ(n)
2[1−h(ρ)]n
2[1−h(ρ)]n
2[1−h(ρ)]n
Θ(1)
2[1−h(ρ)]n
TABLE II: k-SAT Problem sec
ref
4
DANTSIN et al. [6]
CLASSICAL DETERMINISTIC
22
4
DANTSIN et al. [6]
CLASSICAL DETERMINISTIC
Θ(n)
4.1
THIS PAPER
STICKER DETERMINISTIC
4.1
THIS PAPER ♠
CLASSICAL DETERMINISTIC
4.1
THIS PAPER ♠
STICKER DETERMINISTIC
2 4.2
model
algorithm’s space
STICKER RANDOMIZED rand-initialize rand-partition
n
2−
2 k+1
2−
2−
2 k+1
n
+ o(1)
Θ(n)
2−
2−
+
2 n k
algorithm’s space×time
n 2 k+1
2 k+1
n
2−
n
2 k+1
+ o(1)
Θ(mn)
2−
2 n k
Θ(knm)
n 2 k+1
2−
2 k+1
+
2 k+1
+
2−
n
2−
+
Θ(mn)
Θ(n)
¨ SCHONING CLASSICAL [14] RANDOMIZED ♣ THIS PAPER ♣
algorithm’s time
n
22
n n
2−
2 k+1
+ o(1)
2−
2 k+1
+ o(1)
2−
2−
n n
2 n k
2 n k
Fig. 1. The results of our relative space-time efficiency analysis, among a collection new and previously known classical/biomolecular algorithms for the Covering Code Creation problem, are summarized in TABLE I and those for the k-SAT problem in TABLE II. Each row of a table corresponds to an algorithm, where the entry in column one indicates the section discussing the algorithm in the article cited in column two. For the table of a given problem, a row annotated with a ♣(randomized) or ♠(deterministic) symbol indicates that its corresponding algorithm has been identified as best of its type based on space-time relative efficiency. Each table entry corresponding to an exponential expression has been represented by a simplified expression that is within a polynomial factor in n of the actual entry.
98
Pilar de la Torre
Related Work The “room at the bottom” in our title is intended to evoke Feynman’s seminal article [8] referenced by Adleman in [1]. Our algorithmic ideas build on those of Sch¨ oning on probabilistic k-SAT [14], and of Dantsin et al. [6] on covering code creation and deterministic k-SAT. Our formulation of the augmented probabilistic sticker model setting and biomolecular algorithms were inspired, and enlightened, by the DNA based models of Lipton [9], Adleman [1]. Bach et al. [2], and Roweis et al. in [13], and the biomolecular algorithm realizations of Chen and Ramchandran [3] and Diaz et al. [7] for classical probabilistic k-SAT algorithms [12,11,14].
1
Augmented Model
We briefly review the basic concepts behind the standard sticker model and explain how we adapt it to describe our algorithms. Our proposed augmentation to the standard sticker model consists of adding two randomized operations, rand-initialize and rand-partition, whose definition and proposed physical realization will be explored below. These are aimed at providing the algorithm designer with access to, and control of, the native randomness enjoyed by DNA molecules. Each of them abstracts a modeling assumption that this writer, as a newcomer to the field, perceives as part of the folklore within the DNA computing literature. The significance of these operations is illustrated by the role they play in the biomolecular algorithmic solution to two computational problems: the Minimal Covering Code Creation Problem in Subsection 3.5 and the k-SAT Problem in Subsection 4.2. 1.1
Operations
As in the standard sticker model, the operations of the augmented model are partitioned into blind operations and boundary operations. Blind Operations. These are the standard combine, separate, set, and clear, to which we add the following rand-partition operation. Let S be a strings multiset and d is a divisor of |S|. • rand-partition(1 : d , S; A, B) creates two multisets A and B such that: (1) A ∪ B = S, and (2) for each string instance s in S, s belongs to A with probability 1/d. For d = 2, rand-partition will also be denoted rand-halving(S; A, B). Boundary Operations. These are the two standard operations read and initialize, to which we add the following rand-initialize operation. Let K, L, and N be integers satisfying L ≤ K and N ≤ 2L . • rand-initialize(K, L, N ; Rrand−init ) creates the (K, L, N ) randomized library multiset Mrand−init = SL,N {0K−L}, where SL,N consists of N strings, which have been drawn one at a time independently at random with replacement from {0, 1}L.
How Efficiently Can Room at the Bottom Be Traded Away
99
For future reference, let us recall that initialize(K, L; Minit ) creates the L (K, L) library set Minit = {0, 1} {0K−L}, for K and L satisfying K ≤ L [13]. It should be noted that the rand-initialize operation can be realized as a computation that requires one initialize, and O(L) set operations, and O(L) rand-partition operations. Our goal, however, is to obtain constant time physical realizations for initialize and rand-initialize. 1.2
Computation and Control
Computation. A computation in the augmented stickers setting begins with either an initialize(K, L; M ) operation or a rand-initialize(N, K, L; M ) operation, which is followed by a sequence of blind operations. Starting from the “initial set” M , this sequence of operations is to yield a designated “answer set”. We consider two possible ways of ending a computation, which depends on the overall purpose of the computation. If the purpose is to decide a language, as in the standard sticker model, then the computation ends by applying the read operation to the “answer set”. In our augmented setting, however, the purpose of a computation may also be to generate a multiset of strings, in which case the answer is the “answer set” itself. Control. The selection and execution of the sequence of blind operations is provided by a sequential or a parallel classical algorithm. Through the configuration of sets and modification of their bit strings, the algorithm can exploit parallelism at two levels: tube parallelism at the set-of-strings level, and molecule parallelism at the string level. In particular, our algorithms exploit tube parallelism via vector-parallel versions of the boundary operations, which will be denoted by combine, separate, set, clear, and rand-partition. They operate on µ-tuples of sets of strings A = A1 , . . . , Aµ .
2
Physical Realization
Possible constant time physical realizations of the standard sticker operations are offered in [13]. Our primary goal here is to come up with a constant time realization for each of the operations rand-partition and rand-initialize that is compatible with those for the standard sticker operations. As it turns out, operation rand-initialize yields a biomolecular algorithm for Covering Creation Problem presented in Section 3.5. This algorithm is trading efficient with respect to the best known classical algorithm for the problem. Specifically, it has as low a running time as possible, its space-time product is polynomially equivalent to that of the best known classical Covering Code Creation algorithm, and it uses a constant number of tubes. Due to space limitations, we outline one physical realization setting below and postpone discussion of alternative settings to the journal version.
100
2.1
Pilar de la Torre
Augmented Physical Setting
As in the original Sticker Model physical setting, information is represented in terms of memory strands and sticker strands. The augmented physical setting however, employs an additional type of strand that we term coin strand or simply coin. A memory strand consists of a sequence r0 r1 . . . rL rL+1 . . . rK of K + 1 regions, each of which is M bases long. The first region, r0 , is a universal region for which there is no sticker-strand available. Based on region r0 , and assisted by an appropriately designed enabling probe, the memory complexes within a given mixture can be separated from sticker strands and coin strands. Sticker strands are M base long strands r1 , . . . , rK where each ri is complementary to region ri . Coin strands are also M base long strands κ1 , κ2 , . . . , κK . Each κi is complementary to sticker-strand ri , and hence identical to region ri of each memory strand. Randomized Partition. A possible realization of rand-halving(S; A, B) can be based in volume-halving: divide the contents of S by pouring one half into A and the rest into B. Analogously, for d > 2, a corresponding volume division yields a realization for rand-partition(1 : d , S; A, B). Randomized Initialization. We now outline a thought experiment that points to how one would attempt to go about generating the library Mrand−init = SL,N {0K−L}, specified by operation rand-initialize(K, L, N ; Mrand−init) in a stickers and coins setting, which is meant to be understood in the context of the physical realization of the sticker model operations as presented in [13]. This experiment was inspired in part by Roweis et al. [13] in their original implementation of the library {0, 1}L{0K−L } which, remarkably enough taps into the native randomness of the biomolecular setting. Thought Experiment. The idea behind this logical experiment for generating a (K, L, N ) randomized library set can be viewed as being composed of two phases: strand synthesis and bit extraction. Strand Synthesis Phase: First, we create two tubes A and B. Tube A contains N strands, which are synthesized identical copies of the memory strand r0 r1 r2 . . . rL rL+1 . . . rK . Tube B comprises, for each i from 1 to L, N coin-strands κi resulting from synthesizing N identical copies for the region ri . Second, we set bits 1, . . . , L to 1(on) in all the memory-strands in A. This could be done by first adding excess sticker-strands r1 ,, . . . , rK , and then removing the excess sticker-strand copies by isolating into tube C the memory-complexes based on the universal region r0 . Bits Extraction Phase: As the initial step, we combine the contents of B and C into a tube R. Within R, the native randomness enjoyed by molecules is first going to be triggered and subsequently extracted as random bits. By raising the temperature of the mixture in R, we trigger the dissociation of all stickerstrands. At this point, for each sticker-strand in R there are two complementary
How Efficiently Can Room at the Bottom Be Traded Away
101
regions available in R: one as a coin-strand and the other as a sub-strand of a memory-strand. In this model, each of them is capable of competing for the sticker-strand attachment with identical probability of success. Next, by letting the solution cool back down we enable the competition. This causes each sticker to randomly choose to attach itself to either a memory-strand or a coin-strand. In this way, one random bit gets recorded on each of the bit regions 1, . . . , L of each memory-strands. Finally, by a separation based on the universal region r0 , the memory-complexes are isolated into the mother tube Mrand . This concludes the description of our thought experiment. We thus model each recorded bit as a random bit (which comes up heads with probability one-half and tails with probability one-half) independently across memory-strands and across bit positions 1, . . . , L within a memory-strand. Thus, at this point, the contents of Mrand−init encode the set SL,N {0K−L } that we have called the (K, L, N ) randomized library multiset.
3
The Covering Code Creation Problem
This section considers two computational problems about covering codes. It presents deterministic classical and sticker model based algorithms for the δ(n)Almost Minimal Covering Creation problem, and a probabilistic sticker model algorithm for the Minimal Covering Creation problem. 3.1
Background
A code of length n is a subset of the Hamming space Hn = {0, 1}n of all nbit strings called points or codewords. The Hamming distance between points x and y is the total number d(x, y) of bit positions in which x and y differ. The sphere or ball of center x, radius r, and normalized radius ρ = r/n is the subset y if y B(x, r) = {y : d(x, y) ≤ r} of Hn . We say that ball B(x, r) covers point r is V (n, r) = |B(x, r)| = belongs to B(x, r). The volume of B(x, r) in H n i=0 n i
. A set S covers x if x belongs to S. A family F of subsets of set X is a set cover of X if every element of X is covered by some set in F . A covering code, or simply a covering, of length n and radius α, is a length n code C such that F = {B(x, α) : x ∈ C} is a set cover of Hn . We say that C has covering radius r if r is the smallest α such that {B(x, α) : x ∈ C} is a set cover of Hn . Let K(n, r) denote the size of the smallest covering code with length n and radius r = ρn. If ρ satisfies 0 < ρ < 1/2, then to within a polynomial factor in n, V (n, r) and K(n, r) can be estimated in terms of 2h(ρ)n and β(n), where h(x) is the binary entropy function (h(0) = 1 and 1/2 h(x) = −x lg x − (1 − x) lg(1 − x) for 0 < x < 1) and β(n) = [nρ(1 − ρ)] . Lemma 1 ([4]). If n and r = ρn are integers satisfying 0 < ρ < 1/2, then 2[1−h(ρ)]n ≤ 2n /V (n, r) ≤ K(n, r) ≤ n2n /V (n, r) ≤ nβ(n)2[1−h(ρ)]n ≤ ν(n), where ν(n) = nβ(n)2[1−h(ρ)]n differs from K(n, ρn) by at most a polynomial factor.
102
Pilar de la Torre
These bounds on K(n, ρn) motivate the following definitions: Definition 1. (1) A covering code C of length n and normalized radius ρ is called minimal if 1 ≤ |C|/2[1−h(ρ)n] ≤ poly(n), that is its size differs from 2[1−h(ρ)]n by at most a polynomial factor. (2) Let δ(n) be a function such that 0 < δ(n) ≤ d < 1 for some constant d. A covering code C of length n and normalized radius ρ is called δ(n)-almostminimal if, 1 ≤ |C|/2[1−h(ρ)]n ≤ 2δ(n)n , that is its size differs from 2[1−h(ρ)]n by at most the exponential factor 2δ(n)n . (3) If δ(n) = δ is a positive constant, we say that C is almost minimal. In particular, if δ(n) = c lg n/n for some constant c > 0, then C is minimal. As observed in [6], the problem of creating a minimal covering for a given length n and radius ρn, can be viewed as being that of finding a subcollection of F = {B(x, ρn) | x ∈ Hn } with the minimum number of balls that covers X = Hn , which is an instance of the Minimum Set Cover problem. Definition 2. Problem. Minimum Set Cover Input: A set X and a family F of subsets of X that covers X. Output: A minimum size subfamily of F that covers X. finds a set The well known classical Greedy Set Cover(X, F ; U) algorithm cover whose size is within 1 + ln |X| of optimal in worst-case time O( S∈F |S|) (see, for example, [5]). Let Greedy Code(n, ρ; C, |C|) denote the classical algorithm that, given n and ρ, computes a covering code C of length n, radius ρn, and cardinality |C| as follows: first it invokes Greedy Set Cover(X, F ; U) with X = Hn , F = {B(x, ρn) : x ∈ Hn }, subsequently it assigns to C the collection of codewords that are centers of balls in U, and it concludes by determining the number |C| of elements in C. The covering C created by this algorithm is a minimal, because |C| is within a factor 1 + ln |Hn | = θ(n) of the minimum, and hence, by Lemma 1 differs from 2[1−h(ρ)]n by at most a polynomial time of this factor. The running algorithm is within a polynomial factor of θ V (n, r)2n = θ 22n poly(n)2h(ρ)n , which is superexponential when ρ satisfies 1 < ρ < 1/2. Lemma 2 ([6]). For each n ≥ 1 and 0 < ρ < 1/2, a minimal covering C of length n and radius ρn can be created by the deterministic classical algorithm Greedy Code(n, ρ; C) in Ω( 2[2+h(ρ)]n worst-case time. As we will discuss next, it is possible to achieve subexponential computing times if we allow the covering to be almost minimal. Systematic treatment of codes, coverings, and the Minimum Set Cover problem can be found, for example, in [4,10,5] and references therein.
How Efficiently Can Room at the Bottom Be Traded Away
3.2
103
Covering Creation Problems
We formulate the optimization covering creation problem as follows Definition 3. Problem. Minimal Covering Creation Input: Integers n and r = ρn, satisfying n ≥ 1 and 0 < ρ < 1/2. Output: A covering C, of length n and radius ρn, that is minimal (i.e. |C| differs from 2[1−h(ρ)]n by at most a polynomial factor poly(n)). The approximation version of the problem is specified by a parameter δ(n) satisfying 0 < δ(n) ≤ d < 1 for some constant d. Definition 4. Problem. δ(n)-Almost Minimal Covering Creation Input: Integers n ≥ 1 and r = ρn where 0 < ρ < 1/2. Output: A covering C, of length n and radius ρn, that is δ(n)-almost minimal (i.e. |C| differs from 2[1−h(ρ)]n by a factor 2δ(n)n ). 3.3
Covering Creation: Classical Deterministic Algorithms
Two main techniques are often used in code theory to find good codes: randomization, which we will apply in the following Section 3.4, and divide-and-conquer, which is behind the deterministic algorithms we are about to describe. The deterministic divide-and-conquer algorithms build long codes from short codes by code concatenation (direct sum). The classical covering creation algorithm presented in [6] proceeds in two phases: Phase one creates a minimal covering Cλ of short length λ = λ(n) by applying the greedy set cover algorithm; Phase two generates the long length n covering Cn by concatenating n/λ copies of Cλ . Exploring an Algorithmic Space-Time Trade-Off In [6], two cases of the above construction are considered: (1) the case λ(n) = c where c is a positive constant (with respect to n), and (2) the case λ(n) = n/c. For the case λ = n/c, the resulting covering code Cn solves the Minimal Covering Creation Problem, but it requires exponential space. For the case λ = c, Cn solves the δ(n)-Almost Minimal Covering Creation Problem where δ is a constant, but it requires a different c (and hence a different algorithm where c depends on δ) for each given δ satisfying 0 < δ < 1/2. In the following reformulation of the results in [6] we let the parameter λ = λ(n) remain temporarily undetermined. In this way we uncover a new covering code algorithm that we will consider as case (3) λ(n) = (1/3 lg n. As Corollary 1 below shows, this algorithm compares favorably with the other two in that the size of the covering it computes is exponentially smaller than that produced by the algorithm corresponding to case λ = c, and it requires only polynomial workspace, unlike the exponential requirement of the algorithm corresponding to case n/c.
104
Pilar de la Torre
We now describe a family of algorithms parameterized by λ = λ(n), each of which creates a covering code of length n and radius ρn based on λ. Each builds such covering Cn in two stages: the first stage computes a good short code Cλ n/λ and the second generates the power code Cn = Cλ . c-Covering Code(n, ρ; Cn , |Cc |) assume. n ≥ 1, 0 < ρ < 1/2, c | n, and an available copy of Cc computed offline by Greedy Code(c, ρ; Cc , |Cc |). 1. Compute Cn = Ccn/c .
Flex Code n, ρ, λ(n); Cn , |Cλ(n) | assume: n ≥ 1, 0 < ρ < 1/2, λ(n) | n. 1. Greedy Code(λ(n), ρ; Cλ(n) , |Cλ(n) |) n/λ(n) 2. Compute Cn = Cλ(n) .
Lemma 3. Let 0 < ρ < 1/2, λ = λ(n, ρ) a positive integer, and n a multiple of λ. Let Cλ be the covering of length λ and radius ρλ and size at most 1/2 λ3/2 [ρ(1 − ρ)] 2[1−h(ρ)]λ computed by algorithm Greedy Code in time O(23λ ) and space O(2λ ). Then, n/λ
1. Cλ
is a code of length n, radius ρn, and size at most 2[1−h(ρ)]n 2[
3 lg λ 2λ
]n .
2. It can be computed by c-Covering Code if λ = c, and Flex Code otherwise. 3. The workspace requirement for this classical computation is O((n/λ(n))23λ(n) ), which is polynomial for λ(n) = O(lg n) and is exponential otherwise. 4. The time requirement is O(23λ + 2[1−h(ρ)]n 2
3 lg λ 2λ n
).
The following corollary give us bounds for the size of the covering computed by the algorithms under consideration and their time and space requirements. The algorithms corresponding to parameter values λ(n) = c and λ(n) = n/c where c is a constant, were considered in [6]. Corollary 1. 1. Given an integer n multiple of 6, a covering code of length n, radius ρn, and size at most poly(n)2[1−h(ρ)]n can be computed by Flex Code with λ(n) = n/6. This algorithm requires time poly(n)2[1−h(ρ)]n , and exponential workspace Ω(2n/2 ). 2. Given δ > 0 there is an integer constant c = c(δ, ρ) such that for each multiple n of c a covering code of length n, radius ρn, and size at most poly(n)2[1−h(ρ)]n 2δn can be computed by the classical algorithm c–Covering Code. This algorithm requires poly(n)2[1−h(ρ)]n 2δn time, and polynomial O(n) workspace. 3. Given an n multiple of an integer 13 lg n, a covering code of length n, radius lg lg n ρn, and size at most 2[1−h(ρ)+o( lg n )]n is computed by the classical algorithm lg lg n Flex Code with λ(n) = (1/3) lg n. It requires 2[1−h(ρ)+o( lg n )]n time, and polynomial θ(n) workspace.
How Efficiently Can Room at the Bottom Be Traded Away
3.4
105
δ(n)-Almost Minimal Covering Creation
This section gives two sticker model algorithms for the Covering Code problem. Power Code(λ, Cλ , n, L ; T1 ) 1. Let (K, L) be a sticker memory library T1 with K = L + n, L = (n/λ) lg |Cb | + L . That is, T1 = {0, 1}(n/λ) lg |Cλ |+L {0}n . 2. For t = 1 to n/λ { 2.1 Separate T1 into tubes T1 , . . . T|Cλ | , based on (t − 1) lg |Cλ | + 1, . . . , t lg |Cλ |: For l ← 1 to lg |Cλ | { For all σ, 1 ≤ σ ≤ 2l , in parallel { separate (Tσ , (t − 1) lg |Cλ | + σ; T2σ−1 , T2σ ) } 2.2 For each w ∈ Cλ = {w1 , . . . , w|Cλ | } and each strand s in Tj , copy wj into s’s substrand sL+(t−1)λ+1 sL+tλ . For all j, 1 ≤ j ≤ |Cλ |, in parallel { For l ← 1 to λ { If the lth bit of wj equals 1, then j , (t − 1)λ + j). } } set(T 2.3. Parallel Merge T1 , . . . , T|Cλ | ; T1 } } c-Code-Sticker(n, ρ, L ; T1 , |Cc |) assume. n ≥ 1, 0 < ρ < 1/2, c | n, a copy of Cc computed offline by Greedy Code(c, ρ; Cc , |Cc |). Power Code(c, Cc , n, L ; T1 ) .
Flex Code-Sticker n, ρ, λ(n), L ; T1 , |Cλ(n) | assume: n ≥ 1, ρ, 0 < ρ < 1/2, λ(n) | n. λ(n), ρ Cλ(n) 1. Greedy Set Cover(n, , |Cλ(n) |) 2. Power Code λ(n), Cλ(n) , n, L ; T1
They are obtained through the classical algorithms above by replacing their step 2 with the following sticker model implementation of the computation of n/λ Cn = Cλ from Cλ . Given a subset L of {0, 1}K and an integer µ, multi-suffixµ (L) (multi-prefixµ (L)) will denote the multiset of strings of length µ, each of which is the length n suffix (prefix) of one of the strings from multiset L. Given an integer n > 1 and a length λ = λ(n) covering code Cλ as input, the following procedure generates as output a set T1 of sticker model memory complexes. These complexes represent a binary language L of length K = (n/λ) lg |Cλ | + L +n with the following properties: (1) multi-suffixn (L) is the multiset of strings representing elements of Cn , each of which appears with multiplicity 2L , and (2) (n/λ) lg |Cλ |+L multi-prefixK−n (L) = {0, 1} . Theorem 1. Given δ > 0 and ρ satisfying 0 < ρ < 1/2, there is a constant c = c(δ, ρ) such that for each multiple n of c the invocation c-CodeSticker(n, ρ, L ; T1 , |Cc |) computes a covering code of length n, radius ρn, and size at most 2[1−h(ρ)]n 2δn . The algorithm has the following requirements. 1. The time required by the classical preprocessing is O(1), and that required by the biomolecular requires is Θ(n) operations from {initialize, set, clear, combine, separate}. 2. The classical phase requires O(1) space. The biomolecular phase requires O(n) tubes of O(n + L ) length strands and O(2[1−h(ρ)+δ]n poly(n)) space. 3. The space-time product required is asymptotically equal to that of the classical algorithm of Dantsin et al. in [6], which the current biomolecular algorithm implements.
106
Pilar de la Torre
Theorem 2. Given a number ρ such that 0 < ρ < 1/2, for each multiple n of (1/3) lg n, the invocation Flex Code-Sticker n, ρ, λ(n), L ; T1 , |Cλ(n) | comlg lg n putes a covering code of length n, radius ρn, and size at most 2[1−h(ρ)+o( lg n )]n .
The algorithm has the following requirements. 1. The time required is O(n) for classical preprocessing and O(n) biomolecular operations from {initialize, set, clear, combine, separate}. 2. The space required by the preprocessing phase is θ(n). The biomolecular phase lg lg n requires 2[1−h(ρ)+o( lg n )]n space, and O(n) tubes of O(n + L ) length strands. 3. The space-time product requirement is asymptotically equal to that of the best known polynomial space classical algorithm for the problem (which is our algorithm in Part 3 of Corollary 1) that the current biomolecular algorithm implements. The algorithm is an exponential number of times more space-time efficient than the best previously known algorithms [6]. 3.5
Minimal Covering Creation Problem
We now describe a beautiful classical probabilistic algorithm from [6] that, with high probability, solves the Minimal Covering Creation problem. We present a rand-initialize based probabilistic sticker model algorithm that implements it, is space-time efficient with respect to it up to constant factors, and is θ(n) times faster. The size of a covering created by these algorithms is, in fact, within a n factor of minimum. Dantsin et al. [6]’s probabilistic algorithm is implied by the following fact, which is the heart of their probabilistic proof for the upperbound on K(n, ρ) stated in Lemma 1. Lemma 4 ([6]). Let S be a set of n2n /V (n, r) elements that have been chosen uniformly at random with replacement from Hn satisfying n > 1. Then S is a covering code, of length n and radius at most r, with probability at least 1−2n /en . As observed in [6], given n ≥ 1 and ρ satisfying 0 < ρ < 1/2, Lemma 4 yields a classical algorithm, which with high probability computes a minimal covering code of length n, radius ρn, and size n2n /V (n, r). By Lemma 1, this size is bounded by, and polynomially equivalent to, ν(n) = nβ(n)2[1−h(ρ)]n . Rand Code(n, ρ; Crand ){ Set C = ∅ and repeat the following step ν(n) times: STEP: For i from 1 to n { Set bi to a random bit}; Add b1 · · · bn to Crand } }
We observe that the covering code Crand created by Rand Code is none other than a multiset Sn,ν(n) of ν(n) strings drawn uniformly and independently at random with replacement from Hn . We therefore conclude that the following randomized algorithm, based on the sticker model augmented with operation rand-initialize, implements Rand Code. Rand Code-Sticker(n, ρ; Trand ){ Invoke rand-initialize(n, n, ν(n); Trand ) }
How Efficiently Can Room at the Bottom Be Traded Away
107
Our conclusions are summarized below. Let us recall that functions h(·) and β(·) were defined in Section 3.1. Theorem 3. For every n ≥ 1 and ρ satisfying 0 < ρ < 1/2, an invocation to the probabilistic sticker model algorithm Rand Code-Sticker(n, ρ; Trand ), based on primitive rand-initialize, creates a minimal covering code Trand of length n, radius ρn, and size at most ν(n) = nβ(n)2[1−h(ρ)]n . 1. The time required is one rand-initialize operation that takes O(1) physical biomolecular steps. 2. The space required is nβ(n)2[1−h(ρ)]n and Θ(1) tubes of length Θ(n) strands. 3. Its space-time product is θ(n) times smaller than that of the best known randomized classical algorithm for the Minimal Covering Creation problem, which the current biomolecular algorithm implements, and its running time is as low as possible. It is θ(n) times more space-time efficient than the best known classical algorithm.
4
The k-SAT Problem
This section summarizes our results on deterministic and randomized sticker model based algorithms for the k-SAT problem. Due to space limitations we attempt to outline the algorithms and summarize the results of the analysis, and leave their elaboration for the journal version of this extended abstract. Definition 5. Problem. k-SAT Input. A k-CNF Boolean formula F = F (X1 , . . . , Xn ) = γ1 ∧ · · · ∧ γm of n variables, where each of the m clauses γi = li,1 ∨ · · · ∨ li,k has exactly k ≥ 3 literals and each literal li,j is either one of the variables Xν or its negation ¬Xν . Decision Output: Either “satisfiable”, if F (x1 , . . . , xn ) = 1 for some truth assignment x = (x1 , . . . , xn ) in Hn = {0, 1}, or “unsatisfiable”.Search Output: Either an x = (x1 , . . . , xn ) in Hn such that F (x1 , . . . , xn ) = 1, if one exists, or “unsatisfiable”. Representing F as an m-tuple γ1 , . . . , γm specifies a fixed order among the clauses. All of the following k-SAT algorithms can be viewed as having the following two phase structure, which offers a remarkable degree of parallelism. k-SAT Multiroot-Local-Search(n, m, F = γ1 , . . . , γm ) 1. Roots-Generation: Generate N initial assignments in set I ⊆ Hn = {0, 1}n . 2. Local-Search:For all x in I, invoke a Local Search(n, m, F = γ1 , . . . , γm , x) of r steps.
In Sch¨ oning’s k-SAT algorithm [14], Phase 1 is performed by probabilistic sampling and Phase 2 by a random-walk based local search involving r = 3n steps where n is the number of variables in the input instance. For each of the two k-SAT algorithms in Dantsin et al. [6] Phase 1 is performed by one of their two deterministic covering code creation algorithms (which correspond to cases
108
Pilar de la Torre
λ(n) = n/c and λ(n) = c in Part 1 and Part 2 of Corollary 1), and Phase 2 by a derandomized version of Sch¨ oning’s local-search algorithm, for the m clauses n k-SAT instance F , consisting of r = k+1 steps. 4.1
Deterministic Sticker Model Based k-SAT Algorithms
This section summarizes our results on two sticker model based k-SAT algorithms each of which, through parameter λ(n), are expressed by the parametrized algorithm k-SAT-SKINNY Sticker(n, λ(n), m, F = γ1 , . . . , γm ). For λ(n) = c, our biomolecular algorithm implements the classical deterministic k-SAT algorithm of Dantsin et al. [6], which uses algorithm c-Covering Code to perform Phase 1. For λ(n) = (1/3) lg n, our biomolecular algorithm implements a new classical k-SAT algorithm that results from the algorithm of Dantsin et al. using Flex Covering Code(n, ρ, λ(n) = (1/3) lg n; Cn ) to perform Phase 1. Algorithm: k-SAT-SKINNY Sticker (n, λ(n), m, F = γ1 , . . . , γm ) n Set L = k+1 1 + lg k.
Phase 1. Roots-Generation:
If λ(n) = c (, k) = c then c - Code-Sticker n, ρ, L ; MCn , |Cλ(n) | . If λ(n) = (1/3) lg n then Flex Code-Sticker n, ρ, λ(n), L ; MCn , |Cλ(n) | . Set I = 1 + (n/λ(n)) lg |Cλ(n) |. Phase 2.
Local-Search:
Local Search SKINNY-Sticker(MCn , I, L , m, γ1 , . . . , γm )f V ← MCn ; For level ← 1 to r do f Split by Unsat Clause(I, L , level, V, m, γ1 , . . . , γm ; U1 , . . . , Um , Sanswer ) Read(Sanswer ): If Sanswer = ∅, then fReport a truth assignment from Sanswer and Halt g elsef Flip Assignments(I, L , level, U1 , . . . , Um , m, γ1 , . . . , γm ; V1 , . . . , Vm ) Parallel-Merge(U1 , . . . , Um ; V ) g g Report “unsatisfiable”. Split by Unsat Clause(I, L , level, V, m, γ1 , . . . , γm ; U1 . . . , Um , R)) f L ← I + L For µ ← 1 to m do fcomment: Separate into Uµ assignments failing to satisfy γµ . For κ ← 1 to k do f Let Xµκ be the variable in γµ ’s κth literal lκ . If lκ = Xµκ then separate(R, L + µκ ; R, Runsat ) else separate(R, L + µκ ; Runsat , R) combine(Runsat , Uµ ; Uµ ) g g g
How Efficiently Can Room at the Bottom Be Traded Away
109
Flip Assignment(I, L , level, U1 , . . . , Um , m, γ1 , . . . , γm ; U1 , . . . , Um )f L ← I + L ; For all µ, 1 ≤ µ ≤ m, in parallel dof , comment: Flip bit values in Uκ Partition by–Flipping–Variable(I, level, Uµ , γµ ; Vµ,1 , . . . , Vµ,k ); For κ ← 1 to k in parallel do f Let Xν be the variable corresponding to γµ ’s κth literal separate(Vµ,κ , L + µκ , Aµκ , Bµ,κ ); clear(Aµ,κ , L + µκ ); set(Bµ,κ , L + µκ ) Parallel–Merge(Vµ,1 , . . . , Vµ,k ; Uµ ) g g g Partition by Flipping–Variable (I, level, Uµ , γµ ; Uµ,1 , . . . , Uµ,k ) For i ← 1 to 1 + lg k do f Set νi = i + (level − 1) 1 + lg k + I; Separate Uµ,1 , . . . , Uµ,2i , ν1 , . . . , ν2i ;
Uµ,1 , . . . , Uµ,2i , . . . , Uµ,1+(k/2i ) , . . . , Uµ,2i +(k/2i ) g
Theorem 4. Given 4 > 0 and k ≥ 3 there is an integer constant c = c (4, k) such that, for each multiple n of c , an k-SAT-SKINNY Sticker(n, λ(n), m, F ) invocation, with λ(n) = c , solves k-SAT instances F of n variables and m clauses. The algorithm has the following requirements. 1. The required time is θ(n) classical preprocessing and a sequence of O(knm) bio operations from {initialize, set, clear, combine, separate}. 2. The space required is θ(n) for the classical preprocessing. The biomolecular n phase requires [2 − 2/(k + 1) + 4] poly(n) space and O(km) tubes of O(n lg k) length strands. 3. The space-time product requirement is within a polynomial factor O(n) of that of the best previously known polynomial space classical algorithm (which is Dantsin et al’s algorithm in [6]) that the current biomolecular algorithm implements. Theorem 5. Let k ≥ 3, λ(n) = (1/3) lg n, and n be a multiple of λ(n). Invoking algorithm k-SAT-SKINNY Sticker(n, λ(n), m, F ) solves k-SAT instances F of n variables and m clauses. The algorithm has the following requirements. 1. The time required is O(n) classical preprocessing and a sequence of O(knm) operations from {initialize, set, clear, combine, separate}. 2. The space required is θ(n) preprocessing workspace. For the biomolecular phase the space requires O(km) tubes, O(n lg k) length strands, and is within n a polynomial factor of [2 − 2/(k + 1) + o((lg lg n)/ lg n)] space. 3. Its space-time product requirement is smaller, by at least the exponential facn tor [1 + 4(k + 1)/(k − 1)] , than that of the best previously known polynomial space deterministic classical algorithm for k-SAT , which is presented in Sec 4 of [6]; this makes k-SAT-SKINNY Sticker exponentially more space-efficient than that algorithm. 4.2
Probabilistic Sticker Model Based k-SAT Algorithm
This section summarizes our results on the probabilistic sticker model based implementation of Sch¨ oning’s k-SAT algorithm [14]. This classical algorithm
110
Pilar de la Torre
has probabilistic elements in both of its phases. The creation of the pool of roots for the multiroot local-search is implemented by a single invocation to rand-initialize. The randomized k-way choice of a literal within an unsatisfied clause required by the local-search update of the assignments is implemented using the rand-partition primitive.
Algorithm: k-SAT-Rand-Sticker (n, m, F = γ1 , . . . , γ , α) Phase 1. Roots-Generation:
Set K = n, L = n, and N = αpSch (n) 2 − Invoke rand-initialize(n, K, L, N ; M ).
2 n . k
Phase 2. All-Sources Local-Search: All-Sources Local-Search Randomized-Sticker(M, n, m, γ1 , . . . , γm ) Repeat 3n times f Separate by First Unsat Clause(M, m, γ1 , . . ., m, γm ; M1 , . . ., Mm , Sanswer ) Read(Sanswer ): if Sanswer = ∅, then report an element from Sanswer , and Halt. else f Flip Unsat Assignments(M1 , . . . , Mm , m, γ1 , . . . , γm ; M1 , . . . , Mm ) Parallel-Merge(M1 , . . . , Mm ; M ) g g Report: “probably unsatisfiable” Separate by First Unsat Clause(V, m, γ1 , . . . , γ ; M1 . . . , Mm , R)) For µ ← 1 to m do f , Comment: Separate from R into Mµ all truth assignments not satisfying γµ . For κ ← 1 to k do f Let Xµκ be the variable in γµ ’s κth literal lκ . If lκ = Xµκ then separate(R, µκ ; R, Runsat )else separate(R, µκ ; Runsat , R) combine(Runsat , Mµ ; Mµ ) gg Flip Unsat Assignment(M1 , . . . , Mm , m, γ1 , . . . , γ ; M1 , . . . , Mm ) For all µ, 1 ≤ µ ≤ m, in parallel do f , Comment: Flip bit values guided by γµ . For κ ← 1 to k do f rand-partition(1 : k − κ , Mµ ; Mµ,κ , Mµ ) g For κ ← 1 to k in parallel do f Let Xµκ be the variable in γµ ’s κth literal lκ . separate(Mµ,κ , µκ ; Aµ,κ , Bµ,κ ) clear(Aµ,κ , µκ ); set(Bµ,κ , µκ ) g Parallel-Merge(Mµ,1 , . . . , Mµ,k ; Mµ ) g
Theorem 6. An invocation of k-SAT-Rand Sticker(n, m, F, α) succeeds in finding a satisfying assignment for the k-SAT instance F of n variables and m clauses, if one exists, with high probability (1 − e−a ). The requirements of this algorithm are as follows. 1. The time required is O(n) rand-partition operation and Θ(knm) operations from {set, clear, combine, separate}. n 2. The space required is N = αpSch (n) 2 − k2 and Θ(kmn) tubes of O(n) length strands. 3. The space-time product requirement is within a polynomial O(km) factor of that of the best currently known classical randomized k-SAT algorithm [14]),
How Efficiently Can Room at the Bottom Be Traded Away
111
which the current algorithm implements. Hence k-SAT-Rand Sticker is spacetime efficient relative to the currently best known classical counterpart up to an O(km) factor. Acknowledgments. I am grateful to Anne Condon for many fruitful discussions. Thanks are due to Mark Bochert and an anonymous referee for suggestions that improved the quality of the presentation.
References 1. L. M. Adleman. Molecular computation of solutions to combinatorial problems. Science, 266:1021–1024, November 11, 1994. 2. E. Bach, A. Condon, E. Glaser, and C. Tanguay. DNA models and algorithms for NP-complete problems, 1996. 3. K. Chen and V. Ramachandran. A space-efficient randomized DNA algorithm for k-SAT. In DNA Computing, 6th International Workshop on DNA-Based Computers, DNA 2000, Leiden, The Netherlands, June 2000, volume 2054 of LNCS. Springer Verlag, Berlin, Heidelberg, New York., 2000. 4. G. Cohen, I. Honkala, S. Litsyn, and A. Lobstein. Covering Codes. NorthHolland, 1997. 5. T. Cormen, C. Leiserson, R. Rivest, and S. Stein. Introduction to Algorithms. McGraw-Hill, second edition, 2001. 6. E. Dantsin, A. Goerdt, E. A. Hirsch, R. Kannan, J. Kleinberg, C. Papadimitriou, P. Raghavan, , and U. Sch¨ oning. A deterministic (2 − 2/(k + 1))n algorithm for k-SAT based on local search. To appear in Theoretical Computer Science. 7. S. Diaz, J. L. Esteban, and M. Ogihara. A DNA-based random walk method for solving k-SAT. In DNA Computing, 6th International Workshop on DNABased Computers, DNA 2000, Leiden, The Netherlands, June 2000, volume 2054 of LNCS. Springer Verlag, Berlin, Heidelberg, New York., 2000. 8. R. P. Fynman. There’s plenty of room at the bottom. In Miniaturization, D. H. Gilber Editor, Reinhold, New York, pages 282-296, 1961. 9. R. J. Lipton. DNA solution of hard computational problems. Science, 268:542–545, 28, 1995. 10. F. J. MacWilliams and N. J. A. Sloane. The Theory of Error-Correcting Codes. North-Holland, Amsterdam, Holland, 1977. 11. R. Paturi, P. Pudlak, M. E. Saks, and F. Zane. An improved exponential-time algorithm for k -SAT. In IEEE Symposium on Foundations of Computer Science, pages 628–637, 1998. 12. R. Paturi, P. Pudlak, and F. Zane. Satisfiability coding lemma. Chicago Journal of Theoretical Computer Science, 1999. 13. S. Roweis, E. Winfree, R. Burgoyne, N. V. Chelyapov, M. F. Goodman, P. W. K. Rothemund, and L. M. Adleman. A sticker-based model for DNA computation. Journal of Compuataional Biology, 5(4), pages 615–629, 1998. 14. U. Sch¨ oning. A probabilistic algorithm for k-SAT and constraint satisfaction problems. Prooceedings of 40th Symposium on Foundations of Computer Science, pages 410–414, 1999. IEEE Computer Society Press, Los Alamitos, CA.
Hierarchical DNA Memory Based on Nested PCR Satoshi Kashiwamura1, Masahito Yamamoto1 , Atsushi Kameda2 , Toshikazu Shiba3 , and Azuma Ohuchi3 1
3
Graduate School of Engineering, Hokkaido University North 13, West 8, Kita-ku, Sapporo 060-8628, Japan {kashiwa, masahito}@dna-comp.org http://ses3.complex.eng.hokudai.ac.jp/index.html 2 Japan Science and Technology Cooperation (JST) Honmachi 4-1-8, Kawaguchi 332-0012, Japan [email protected] CREST, Japan Science and Technology Cooperation (JST) and Graduate School of Engineering, Hokkaido University North 13, West 8, Kita-ku, Sapporo 060-8628, Japan {shiba, ohuchi}@dna-comp.org
Abstract. This paper presents a hierarchical DNA memory based on nested PCR. Each DNA strand in memory consists of address blocks and a data block. In order to access specific data, we specify the order of the address primers, and nested PCR are performed by using these primers. Our laboratory experiments are also presented to demonstrate the feasibility of the proposed memory.
1
Introduction
DNA is an excellent memory unit because it has the capability to store vast amount of data in a very small space. Focusing on this aspect of DNA, our study aims to propose and to construct DNA memory with high memory capacity at an extremely minute scale. We present an implementation of a hierarchical DNA memory based on nested PCR, which is called Nested Primer Molecular Memory (NPMM). In an NPMM, a data block sequence is concatenated with address block sequences. Address blocks consist of primer-annealing sites for the addressing of data. In order to access specific data, we specify the order of address primers, and nested PCR are performed by using these primers. In the NPMM systems, a high-level reaction specificity and a very secure memory system can be achieved. In this paper, we focus on the addressing of the data but do not discuss the encoding of the data into DNA sequences or the decoding of the sequences. The structure and design strategy of NPMM is described, and our laboratory experiments are presented in order to demonstrate the feasibility of NPMM. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 112–123, 2003. c Springer-Verlag Berlin Heidelberg 2003
Hierarchical DNA Memory Based on Nested PCR
2
113
Related Works
The most famous work on Molecular Memory may be the efforts of Head et al., who proposed memory with an aquatic feature. In their approach, any problem could be solved by handling aquatic memory appropriately. This is called as aqueous computing [1][2]. Their idea overcomes the overhead of the sequence design problem, which is a serious issue in conventional DNA computation. On this point, their memory is an excellent breakthrough, but it does not focus on achieving the high memory capacity and compactness of DNA. Actually, it may be difficult to enlarge their memory system experimentally. On the other hand, Baum was the first to propose to constructing memory (in his ward, database) by focusing on the high memory capacitance and compactness of DNA[3]. In his model, a massively parallel associative search in a vast database made of DNA can be realized by affinity separation methods, but he did not show the feasibility of his idea, experimentally. Rief et al. refined the previous memory idea by using various information theoretical coding techniques [4], and large DNA database was constructed experimentally [5]. They constructed DNA database, which is the random pool constructed by the random assembly of short subsequences. We have a question about the effectiveness of the random pool. Here, we propose a new DNA memory system based on PCR. Our system has new advantages that were not in the previous models, which are described in the next section. In this paper, we do not discuss the associative search in our model, but associative search will become possible by using a few devices.
3
Nested Primer Molecular Memory
In the case of conventional memory, data is encoded to {0,1} bit sequence data and then stored in media. In the case of Nested Primer Molecular Memory (NPMM), data is encoded to {A,T,G,C} base sequence data and then stored in a DNA strand. NPMM is a aqueous solution in which DNA (data) are mixed, and each DNA sequence in NPMM consists of the areas where stored data and the data address are expressed. If we specify the address of target data, we can get the target data from NPMM. 3.1
Data Structure of NPMM
NPMM consists of some DNA sequences in which data are stored. Each DNA sequence consists of three types of blocks: data block, address blocks, and Re block. – Data block is the site where data are encoded into a base {A,T,G,C} sequence.
114
Satoshi Kashiwamura et al.
– Address blocks are the sites where addresses are specified and consist of some blocks at the 5’end of the data block. – Re block is the site where the reverse primer should be hybridized, and is the common sequence in all DNA sequences in NPMM. In this paper, the number of address blocks is fixed to three (named A block, B block, C block) and the number of sequences in each address block is three (labeled 0, 1, 2). The oligo (working as a forward primer) that should be hybridized to the sequence labeled i (i ∈{0,1,2}) in X block (X ∈{A,B,C}) is denoted by Xi . We define a reverse primer as Re, the data sequence following the Ai ,Bj ,Ck annealing site as dataijk , and the overall DNA sequence as templateijk (Fig. 1).
Fig. 1. Data structure of templates.
We define P and T as P = {Ai , Bj , Ck , Re|i, j, k ∈ {0, 1, 2}} and T = {templateijk |i, j, k ∈ {0, 1, 2}}. 3.2
Addressing the Data
The addressing procedure is as follows; Select Re ∈ P . Repeat this procedure the number of address blocks. Select p ∈ P , then perform PCR using p and Re, sequentially. By following the above procedure, we can obtain one sequence coding the target data. A set of used p and its order serve as the data address in NPMM, and we denote the address by [p1 , p2 ,. . .]. (pi (i is integer): forward primer used in ith PCR.) In order to deepen our understanding, we show an instance of extracting data101 from NPMM in Fig. 2. We know the address of data101 is [A1 , B0 , C1 ]. For the first PCR, A1 and Re are used in NPMM. Then, after PCR, the diluted solution consists of nine DNA containing an A1 annealing site, because only these nine DNA are amplified. For the second PCR, B0 and Re are used in the diluted solution after the first PCR. Then, only three DNA containing a B0 annealing site are amplified out of the nine DNA. Finally, C1 and Re are used in the diluted solution after the second PCR. As a result, there is only sequence coding data101 in the solution; hence we can obtain the target data data101 .
Hierarchical DNA Memory Based on Nested PCR A1,Re
B0,Re
C1,Re
PCR
PCR
PCR
templateijk
template1jk
i, j , k Î {0,1,2}
j , k Î {0,1,2}
template10k k Î {0,1,2}
27 mix
9 mix
3 mix
115
template101
Fig. 2. Target data sequence coding data101 is extracted from a 27-templates mix by [A1 , B0 , C1 ]. (The shortening of the length of templates by PCR is omitted here.)
3.3
Merits of NPMM
NPMM provides a high level of data security It is impossible to acquire a readout of the target data unless all primers are provided. Accordingly, each primer works as a key, and each key is independent of other keys. In other words, the primers work as distributed keys. Furthermore, even if you could get all of the primers, the order of each primer to specify an address would be complicated in permutation, i.e. even in the case of the previous example shown in Fig.2, there are 5040(= 10 P4 ) ways of how to select primers. Since we must get primers as keys and must know the order of each primer in order to acquire a readout of the specific data, a very secure memory system with some distributed keys would be realized. NPMM has a large capacity with a high reaction specificity The number of primers we should design is much less than that of data stored in NPMM. Because of the abundant use of PCR, a very high level of specificity of primers is absolutely essential for NPMM. Needless to say, an increase in the number of primers we have to design decreases the reaction specificity of the primers. NPMM realizes a large memory capacity in spite of a small number of primers. In fact, the capacity of NPMM grows exponentially with an increase in address blocks. The memory capacity of NPMM is determined as follows. M (bit) = 2 × Data(bp) × P rimerBlock M : memory capacity of NPMM. Data: length of the sequence in the data block. Block: number of address blocks. P rimer: number of primers in each address block. One base is equal to 2 bits because one base consists of 4 elements (A,T,G,C). In addition, using nested PCR raises the specificity of the final reaction products.[9]
116
Satoshi Kashiwamura et al.
Ease of extracting the target data from NPMM We can extract the target data from NPMM by using only PCR and Sequencing. PCR, as well as sequencing, is a very sophisticated, common-use, automated and easy tool. Therefore, anyone can extract the target data from NPMM easily, without special tools or experimental skill. 3.4
Potential Applications
NPMM can be used for media in which very secure data is stored. NPMM can also be used as media to store the data that is not usually referenced but shouldn’t be erased (e.g. accumulated log files). Another application is media to store human genetic information. The advances in genetic information have posed privacy problems in recent years. At present, an individual’s genetic information is read out and then converted to digital data. Genetic information is stored as this digital data in some media. Since genetic information is read out once, genetic information may leak out. On the other hand, the individual’s genetic information could be saved in NPMM as a raw gene, without reading and translating individual genetic information. Accordingly, NPMM would protect the privacy of genetic information.
4
NPMM Design
In order to investigate the feasibility of NPMM, we designed a small-sized NPMM. The evaluation function for the sequence design was originally formulated by consulting some references [6][7][8]. 4.1
Size of NPMM
In this paper, we set the size of NPMM: 27(33 ) kinds of data can be stored in this NPMM, and the memory capacity is 135 bytes. The number of primers to be designed is 10 (=3 × 3 + 1). The other parameters are described as follows. – length of data sequences: 20 bp – length of primers (p ∈ P ): All 15 bp – length of template (t ∈ T ): 80 bp (=15+15+15+20+15) 4.2
Design of Sequences
We designed data sequences and primer sequences with the following procedure. 1. Design 27 data sequences randomly. 2. Design 10 primer sequences so as to prevent mishybridization of each template and each primer by computer simulation as follows.
Hierarchical DNA Memory Based on Nested PCR
117
Strategy of primer design In this instance, we must design P because data sequences are designed randomly, and T is created automatically when P and data sequences are designed. We took following the 3 evaluation items into consideration in designing P . Each evaluation value is calculated on each evaluation item. 1. GC content 2. Hamming distance 3. 3’end complementary P is designed so as to optimize the evaluation functions, which are calculated as linear sums of each evaluation value by Simulated Annealing Method. GC content It is necessary to make the T m (melting temperature) of each primer regularity to achieve a high specificity in the PCR reaction. Therefore, each primer should share the GC content. The evaluation value (GC value) of GC content on P is calculated as follows. GC value = max((GCdef ine − GCp )2 ) + GC max number/|P |, where GCdef ine is the target value of the GC content of the primer and GCp is the GC content of p (∃p ∈ P ). GC max number is the number of the primers that is evaluated as the worst value in P and |P | is the number of elements in P . The smaller the GC value is, the better P is for GC content. Hamming distance We propose this evaluation item in order to prevent mishybridization between a primer and a template or a primer and a primer. Given two sequences x = x1 x2 . . . xl and y = y1 y2 . . . yl (xi , yi ∈ {A, T, G, C}), H(x, y) indicates the hamming distance between two sequences x, y, which indicate the number of bases such as xi = yi . To evaluate the hamming distance between x, y with different length(|x| ≤ |y|), we define HM (x, y). To count HM (x, y), we find total |y| − |x| + 1 H(x, yi ), where yi (1 ≤ i ≤ |y| − |x| + 1) is subsequence of length |x| in y. Additionally, in order to take into consideration the case where x mishybridizes to y with the sticky end at 5’ or 3’end of y, we find total (2|x| − 2) H(x , y ) (|x| − 1 at 5’, 3’end, respectively), where x , y are the subsequences forming duplex between x and y, respectively. HM (x, y) indicates the maximum in |y| − |x| + 1 H(x, yi ) and 2|x| − 2 H(x , y ). The evaluation value (H value) of Hamming distance on P is calculated as follows. For ∃p ∈ P , ∃t ∈ T , H value = max(HM (p, t), HM (p, t)) + H max number/ALL combinations H max number indicates the number of the pairs between p and t that is evaluated as the worst value. ALL combinations indicates the number of all combinations between p and t.
118
Satoshi Kashiwamura et al.
3’end complementary We propose this evaluation item in order to prevent mispriming (a misextension reaction caused by mishybridization at 3’end of a primer). Given two sequences x = x1 x2 . . . xl and y = y1 y2 . . . yl (xi , yi ∈ {A, T, G, C}), E(x, y) indicates the sum of the suffix of each base such as xi = yi (i.e., x = AT T GC, y = AAGGC, E(x, y) = 1 + 4 + 5 = 10). To evaluate the 3’end complementary between two sequences x, y of different length(|x| ≤ |y|), we define EM (x, y). EM (x, y) indicates the maximum E(x, y ) between x and y (|y| − |x| + 1 subsequences of length |x| in y). The evaluation value (E value) of the 3’end complementary on P is calculated as follows. For ∃p ∈ P , ∃t ∈ T , E value = max(EM (pn , t), EM (pn , t)) + E max number/ALL combinations For sequence x, xn indicates the subsequence of length n (0 ≤ n ≤ |x|) in x from 3’end (i.e., x = AT GT AGCCAT GG, x5 = CAT GG). E max number indicates the number of pairs between p and t that is evaluated as the worst value. Evaluation Function The evaluation value of P (f itness) is calculated by the following function. The smaller f itness is, the better P is. f itness = GC weight×GC value +H weight×H value +E weight×E value ∗ ∗ weight indicates the weight (positive integer) of each item. ∗ ∗ value is translated ∗ ∗ value to [0.0,1.0] (scale conversion).
5
Laboratory Experiments
5.1
Extracting Target Data Sequence Using PCR
Here, in order to verify that a target data sequence is extracted from NPMM by means of the above addressing method, we perform laboratory experiments for small NPMM. In the previous section, although we designed 10 primers and 27 templates, here we select 4 templates out of 27 templates and deal with a small NPMM (T4-NPMM) consisting of template000, template001 , template010 and template011 . These sequences are shown in Table. 1. Thus, we can obtain data000 by addressing [B0 , C0 ], data001 by [B0 , C1 ], data010 by [B1 , C0 ], and data011 by [B1 , C1 ]. We obtain these 4 data as described in Protocol 1 below. The results are shown in Fig. 3. Protocol 1. Amplify the target data Equipment and reagents – KOD DASH DNA Polymerase (TOYOBO) – 10 × KOD DASH buffer (TOYOBO)
Hierarchical DNA Memory Based on Nested PCR
119
Table 1. Designed sequences of templates and primers Name A0 B0 C0 data000 data010 template000 template010
Length Sequence Name 15mer GCAAAGAGCCTGTGA Re 15mer CAGTGTAAGTTCGTG B1 15mer TCCATGCGCTCTAAT C1 20mer GAGCATGTTCACTCTGGACG data001 20mer TTATAAACTCTCTTGACCCC data011 80mer A0 B0 C0 data000 Re template001 80mer A0 B1 C0 data010 Re template011
Length Sequence 15mer CATCAATGTCTGGCG 15mer AACGGAAAGATGCCT 15mer TACCAAACCGAGGTC 20mer AGTAAGAGTCTAGCCTAGCG 20mer GCTGGCATTACACGTCTCAG 80mer A0 B0 C1 data001 Re 80mer A0 B1 C1 data011 Re
– dNTP MIX (2 mM of each dNTP) – DNA thermal cycler (Biometra)
A. Primary PCR using Bj (0 ≤ j ≤ 1) 1. In a PCR reaction tube, add the following. – – – – – – – –
distilled water 13.875 µl 10× PCR buffer 2.5 µl dNTP mix 2.5 µl T4-NPMM (each 10 pM) 1 µl B0 or B1 (5 µM) 2.5 µl Re (5 µM) 2.5 µl KOD DASH Polymerase 0.125 µl Total volume is 25 µl
The reaction in primary PCR using B0 is named B0 af ter, the one using B1 is named B1 af ter. 2. Perform 23 cycles of PCR in automated thermal cycler: – – – –
denature (94◦ C for 10 sec) anneal (50◦ C for 30 sec) polymerize (72◦ C for 5 sec) perform the final polymerization step for an additional 60 sec
B. Secondary PCR using Ck (0 ≤ k ≤ 1) 1. Dilute B0 af ter and B1 af ter 164 folds. 2. For each dilution, add 1 µl of each to a PCR reaction tube containing as follows: – – – – – – –
distilled water 13.875 µl 10× PCR buffer 2.5 µl dNTP mix 2.5 µl C0 or C1 (5 µM) 2.5 µl Re (5 µM) 2.5 µl KOD DASH Polymerase 0.125 µl Total volume is 25 µl
The reaction in B0 af ter using C0 is named B0 C0 af ter, the one using C1 is named B0 C1 af ter; the reaction in B1 af ter using C0 is named B1 C0 af ter, the one using C1 is named B1 C1 af ter. 3. Perform 23 cycles of PCR using the same cycle parameters as for the primary PCR (part A, step 2) 4. Run 5 µl of the 6 reactions on a 10% PolyAcrylamide Gel, then visualize by ethidium bromide. Determine whether a strong band of the expected size is obtained for each reaction (Fig. 3).
120
Satoshi Kashiwamura et al.
Fig. 3. M: 100 bp ladder. M’: DNA marker of length 65, 50, and 35 bp. Lane 1: B0 af ter. Lane 2: B1 af ter. Lane 3: B0 C0 af ter. Lane 4: B0 C1 af ter. Lane 5: B1 C0 af ter. Lane 6: B1 C1 af ter.
5.2
Detection of Amplified Sequence
In Fig. 3, we can expect that each lane (numbered 3-6) consists of only one sequence coding: data000 , data001 , data010 and data011 , respectively. To confirm this, we make the following experiments with PCR, because 50 bp is not an adequate length to read out the sequence by a sequencer. For diluted B0 C0 af ter, we perform PCR using Re and data000 primer, which is the sequence of length 15 in data000 at 5’end (shown in Table. 2). In the same way, PCR is performed on the diluted B0 C0 af ter with data001 primer, data010 primer, data011 primer, and Re. If, and only if, a sequence of length 35 bp is amplified by using data000 primer, then B0 C0 af ter consist of the only sequence containing data000 . Thus to check out the sequence in B0 C0 af ter, a total of 4 PCR is needed. For B0 C1 af ter, B1 C0 af ter, and B1 C1 af ter, we performed the same operations described in Protocol 2 below. Table 2. Sequence of data primers Name Length Sequence Name Length Sequence data000 primer 15mer GAGCATGTTCACTCT data001 primer 15mer AGTAAGAGTCTAGCC data010 primer 15mer TTATAAACTCTCTTG data011 primer 15mer GCTGGCATTACACGT
Protocol 2. Detection by using data primers 1. Dilute B0 C0 af ter, B0 C1 af ter, B1 C0 af ter and B1 C1 af ter 164 folds. 2. Add 1 µl of each dilution to a PCR reaction tube containing the following (for each diluted mix, each data primer is used, therefore a total of 16 reactions are needed). – – – –
distilled water 13.875 µl 10× PCR buffer 2.5 µl dNTP mix 2.5 µl data000 primer, data001 primer, data010 primer, or data011 primer (5 µM) 2.5 µl
Hierarchical DNA Memory Based on Nested PCR
121
– Re (5 µM) 2.5 µl – KOD DASH Polymerase 0.125 µl – Total volume is 25 µl
3. Perform 25 cycles of PCR using the same cycle parameters as for the primary PCR (part A, step 2 in Protocol 1). 4. Run 5 µl of each of the 16 reactions on a 10% PolyAcrylamide Gel, then visualize by ethidium bromide (shown in Fig. 4).
Fig. 4. M: DNA marker of length 65, 50, and 35 bp. The index of 1: use data000 primer. 2: use data001 primer. 3: use data010 primer. 4: use data011 primer
Discussions Figure 4 indicates that almost the only sequence containing data000 exists in B0 C0 af ter. The desired results are obtained for other solutions as well. Although some weak bands appear, those are primer dimmers because these weak bands also appear without templates. (data not shown). From these results, we can conclude that a desired target data is extracted from T4-NPMM. 5.3
Amplification Using Concatenation Primer
Here, we perform PCR for T4-NPMM with a concatenation primer made up of two consecutive primers (in this experiment, we use B0 C0 , which is B0 concatenated C0 ). The purpose of this experiment is to see whether we can extract the target data by only one PCR procedure. Of course, whether that is possible depends on annealing temperature(Ta ). We perform the experiment as described in Protocol 3 below. We can regard this PCR as competitive PCR between template000 and template010 , so the number of cycles in PCR using B0 C0 is not a critical parameter [10]. Protocol 3. Amplify the target data using B0 C0 A. PCR to T4-NPMM using B0 C0 1. in a PCR reaction tube, add the following. – distilled water 13.875 µl – 10× PCR buffer 2.5 µl – dNTP mix 2.5 µl
122
Satoshi Kashiwamura et al. – – – – –
T4-NPMM 1 µl B0 C0 (5 µM) 2.5 µl Re (5 µM) 2.5 µl KOD DASH Polymerase 0.125 µl Total volume is 25 µl
– – – –
denature (94 C for 10 sec) anneal (50, 58, 66 or 74◦ C for 30 sec) polymerize (72◦ C for 5 sec) perform the final polymerization step for an additional 60 sec
2. Perform 25 cycles of PCR at 4 kinds of Ta: ◦
B. Detection of sequence amplified with B0 C0 1. Dilute the reactions at each annealing temperature 164 folds. 2. Add 1 µl of each dilution to a PCR reaction tube as in Protocol 2: – – – – – – –
distilled water 13.875 µl 10× PCR buffer 2.5 µl dNTP mix 2.5 µl data000 primer, data001 primer, data010 primer, or data011 primer (5 µM) 2.5 µl Re (5 µM) 2.5 µl KOD DASH Polymerase 0.125 µl Total volume is 25 µl
3. Perform PCR using the same cycle parameters as in Protocol 1, and set aside each solution from the thermal cycler every 2 cycles from 17 to 25 cycles. 4. Run 5 µl of all the reactions (80 samples) on a 10% PolyAcrylamide Gel, then visualize by ethidium bromide.
Fig. 5. Results of Detection Experiment, as in the previous subsection for T4NPMM amplified by using B0 C0 at Ta 66◦ C. M: DNA marker of length 65, 50, and 35 bp. Index of 1: the solution set aside from the thermal cycler at the 17th cycle. 2: 19th cycle. 3: 21st cycle. 4: 23rd cycle. 5: 25th cycle. Discussion In the case of PCR using B0 C0 at Ta 50, 58, or 66◦ C, some sequences are amplified. In the other case at annealing temperature 74◦ C, no sequence is amplified. Figure 5 shows the results of the data detection experiment after the PCR solution at 66◦ C. From the fact that the solutions amplified with data000 primer and data010 primer are visualized at almost the same step rate, the solution contains two sequences including data000 and data010 , respectively. This indicates that mispriming occurs between template010 and B0 C0 in spite of a high temperature such as 66◦ C. Therefore, we believe that decreasing the times of PCR would be difficult. (We have no data at higher Ta than 66◦ C (below 74◦ C), but this is no concern if each primer is lengthened.)
Hierarchical DNA Memory Based on Nested PCR
6
123
Concluding Remarks and Future Works
In this paper, we proposed a DNA memory with high capacity, high data security and high specificity of chemical reaction and we showed the feasibility of NPMM through some experiments. In the future, we should consider the design strategy for a set of primers that can be used without dependence on the data sequence. Also, we should investigate the scaling-up of NPMM. Since NPMM has the potential for high reaction specificity, even if the size of NPMM become larger, we believe that NPMM would work appropriately. Scaled-up NPMM will be realized in the near future.
References 1. T. Head, M. Yamamura, and S. Gal: “Aqueous Computing. writing on molecules”, Proceedings of CEC 99, (Congress on Evolutionary Computation), pp. 1006-1010 (1999) 2. T. Head, G. Rozenberg, R. S. Bradergroen, C. K. D. Breek, P. H. M. Lommerse and H. P. Spaink: “Computing with DNA by operating on plasmids”, Biosystems, Vol.57, pp. 875-882 (2000) 3. E. B. Baum: “Building an Associative Memory Vastly Larger Than the Brain”, Science, Vol.268, pp. 583-585 (1995) 4. J. H. Reif and T. H. LaBean: “Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using Error-Correcting Codes and VectorQuantization”, Sixth International Meeting on DNA Based Computers (DNA6), in Lecture Notes in Computer Science, pp. 145-172 (2001) 5. J. H. Reif, T. H. LaBean and M. Pirrung: “Experimental Construction of Very Large Scale DNA Databases with Associative Search Capability”, Seventh International Meeting on DNA Based Computers (DNA7), in Lecture Notes in Computer Science, pp. 231-247 (2002) 6. M. Arita, A. Nishikawa, M. Hagiya, K. Komiya and H. Gouzu and K. Sakamoto: “Improving Sequence Design for DNA Computing”, Proceedings of GECCO’00 (Genetic and Evolutionary Computation Conference), pp. 875-882 (2000) 7. M. Garzon, R. Deaton, L. F. Nino and Ed Stevens: “Encoding Genomes for DNA Computing”, Proceedings of the Third Annual Genetic Programming Conference, pp. 684-690 (1998) 8. F. Tanaka, M. Nakatsugawa, M. Yamamoto, T. Shiba and A. Ohuchi: “Developing Support System for Sequence Design in DNA Computing”, Seventh International Meeting on DNA Based Computers (DNA7), in Lecture Notes in Computer Science, pp. 129-137 (2002) 9. M. J. McPherson, P. Quirke and G. R. Taylor: “PCR A Practical Approach”, IRL PRESS, (1991) 10. M. J. McPherson, B. D. Hames and G. R. Taylor: “PCR2 A Practical Approach”, IRL PRESS, (1995).
Binary Arithmetic for DNA Computers Rana Barua1 and Janardan Misra2 1
Division of Theoretical Statistics and Mathematics, Indian Statistical Institute, 203 B.T. Road, Calcutta 700 108, India. [email protected] 2 Texas Instruments India Ltd. Bangalore 560017, India Present Address: School of Computing, National University of Singapore [email protected]
Abstract. We propose a (recursive) DNA algorithm for adding two binary numbers which require O(log n) bio-steps using only O(n) different type of DNA strands, where n is the size of the binary string representing the largest of the two numbers. The salient feature of our technique is that the input strands and the output strands have exactly the same structure which makes it fully procedural unlike most methods proposed so far. Logical operations of binary numbers can easily be performed by our method and hence can be used for cryptographic purpose.
1
Introduction
One of the earliest attempts to perform arithmetic operations (addition of two positive binary numbers) using DNA is by Guarneiri et al [11], utilizing the idea of encoding differently bit values 0 and 1 as single-stranded DNAs, based upon their positions and the operand in which they appear. This enabled them to propagate carry successfully as horizontal chain reaction using intermediate place holders because of the presence of appropriate complementary substrands, which annealed together. PCR then allowed one to insert correct value of carry and to further propagate it. Though their technique yields the correct result of addition of two given binary numbers, it is highly non-procedural in nature since the output strands are vastly different in structure from the input strands (which themselves are coded differently). The later attempts were by Vineet Gupta et al [14]. They performed logic and arithmetic operations using the fixed bit encoding of the full corresponding truth tables. They construct the strands for bits in first operand (level one) and corresponding to each bit value (0 or 1), all possible bit values (00, 01, 10, 11)1 in second operand(level two) such that in the next phase when first operand - strands are pored into the pot containing all possible second operand 1
In fact they use along with usual dnts other nucleotides like Uracil(U), 2,6diaminopurine(P) to achieve some additional complementary structures
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 124–132, 2003. c Springer-Verlag Berlin Heidelberg 2003
Binary Arithmetic for DNA Computers
125
- strands (including the correct one ) annealing results in a structure which can be interpreted according to the type of the operation. In case of arithmetic operation, in later stages of computation, they add all possible intermediate results and successively propagate carry from the lowest weighted bit to the highest weighted bit. Though the encoding works well with logic operations, it does not seem to be so easy for the arithmetic operation as the technique requires that all the possible intermediate results be coded and added manually one by one during processing. This is a labor intensive and time consuming job. Other later attempts are due to Z. Frank Qiu and Mi Lu [16], which use substitution operation to insert results (by encoding all possible outputs of bit by bit operation along with second operand) in the operand strands. Though they propose to extend their method to higher radix like octal, decimal etc; the possible number of encoding of different intermediate results seems to be exponential. Moreover, the cleansing operation which makes the output similar to first input, for reuse in further operations, is not an error resistant operation. Ogihara, Ray and Amos, Dunne [15,1] present methods to realize Boolean circuits (with bounded fan-in) using DNA strands in a constructive fashion. Here the problem is that of constructing large Boolean circuits for arithmetic operations, (manually or automating the process) rendering the technique of theoretical importance only. Other new suggestions to perform all basic arithmetic operations are by Atanasiu [4] using P systems, and by Frisco [9] using splicing operation under general H systems, and by Hubert and Schuler [13] In the present article, we describe a simple, new and efficient recursive DNA computing technique for basic arithmetic operations like addition, multiplication and subtraction of binary numbers, based upon their set representations. The salient features of our technique are the following. 1. Our method is fully procedural i.e., the structure of the output strands is exactly similar to that of the input strands. Thus the result can be further reused without any changes, making iterative as well as parallel operations feasible. 2. The number of different DNA strands required is at most of the order of size of the binary number. That means it avoids the problem of increasing volume. 3. The number of bio-steps required by the technique for addition is, on the average, O(log2 n) and for multiplication it is O((log2 n)2 ), where n is the size of the binary numbers. 4. All logical operations on binary numbers can be performed very easily using our method.
126
2
Rana Barua and Janardan Misra
Recursive DNA Arithmetic
2.1
Underlying Mathematical Model
Let α = αn . . . α1 and β = βn . . . β1 be two n -bit binary numbers, where αi , βi ∈ {0, 1} for all 1 ≤ i ≤ n and α1 , β1 denote the less significant bits. Let X[α] = {i: αi = 1} and X[β] = {j: βj = 1}; i.e. they are the sets containing the positions where binary representation of these numbers sets bits to 1. For any set Z of integers and any integer i, we define Z + i = {z + i: z ∈ Z} and let Z + = Z + 1. For any two sets of integers X1 , X2 we define X1 ⊕ X2 = {x: x ∈ X1 ∪ X2 but x ∈ / X1 ∩ X2 } (symmetric dif f erence). We denote by V al(X) the binary number represented by set X (of positive integers); e.g. V al(X[α]) = α and V al(X[β]) = β and V al(φ) = 0. In terms of this symbolism we can state the abstract recursive procedures for addition and multiplication of the two given positive binary numbers as follows. Addition. Add(α, β) = V al(RecursiveAdd(X[α], X[β])); where RecursiveAdd(Y, Z) = Y if Z = φ = Z if Y = φ = RecursiveAdd((Y ⊕Z), (Y ∩Z)+ ); otherwise. It is easy to see that this procedure terminates and that for two binary numbers α, β, Add(α, β) represents α + β. This follows from the fact that α + β = α ⊕ β + 10 × (α ∧ β), where ⊕ denotes bitwise addition modulo 2 and ∧ bitwise multiplication.(Note that 10 above is the binary representation of 2.) Multiplication. The multiplication procedure can be realized, using successive additions of left shifted α’s, according to the following formula. α×β =
α × 2j−1 ,
1≤j≤n,bj =1
where we also use α etc. to denote the integer whose binary representation is α. Since multiplication of a binary number with power of 2, say 2j , is obtained by a left shift of the number by j, we have X[2j × α] = (X[α]) + j.
Binary Arithmetic for DNA Computers
127
Hence, we have Mul(α, β) = Add({V al(X[α] + (j − 1))}βj =1 ), where Add(α, β, γ) = Add(Add(α, β), γ) etc. Subtraction operation for integers can be performed as 2’s complement addition ([12] p 23). Division. Once we can perform addition and subtraction operation then mapping of division operation in terms of these can be done using any of the standard digital arithmetic techniques ([12] p 250). For instance we may consider nonrestoring division ([12] p 253) which requires an average of n additions or subtractions. 2.2
DNA Algorithm
Since the procedure described above is recursive in nature and as can be seen easily in context of currently available DNA tool - kit operations and other high level operations as suggested in [6], the most important operation to be realized is incrementing all the integers in the sets like X[α] by one. Actually the coding of numbers and various other steps basically rest upon the ease with which this step can be realized. Keeping this point in mind we propose the following DNA algorithm: DNA Encoding of Binary Numbers Note that each binary number is represented by a set of integers which are positions where bits are set to 1. Thus each binary number is represented as a test tube ( a multi set of strings over Γ = {A, C, G, T }) of DNA double strands encoding the integers ( positions where bit is set to 1) from 1 to n, such that the DNA strand for integer i is (cf. [5] for notation) dsi = S0 (GAAT T GC 5 )i GAAT T CS1 . ( Note that GAAT T C is the restriction site for EcoRI. ) Here S0 , S1 may be any suitable 20 to 30 base-pair long DNA double strand not containing GAAT T C as a substrand. (An alternate way of encoding position i, for minimizing errors due to sliding, is discussed later in section 2.5) Thus the test tube T [α] representing binary number α is T [α] = {dsi : i ∈ X[α]}. We first present the DNA-based implementation for addition. 1. Addition Step0. Check whether T [α] or T [β] is empty. If yes, then the other tube contains the final result (which can be obtained by detecting the presence of all different strands using gel electrophoresis or extraction technique).
128
Rana Barua and Janardan Misra
Else go to step 1. Initially T [α] and T [β] represent the test tubes encoding α and β respectively. Step1. Melt the double strands in test tubes T [α], T [β] to extract up-strands (using ↓ S0 ) from T [α] and down-strands (using ↑ S0 ) from T [β]. Now mix these two extracts so that complementary strands can get annealed to form stable double strands. As can be seen, the resulting double strands in the tube are exactly those coming from T [α] T [β] and single strands are those coming from T [α] − T [β] (up-strands containg ↑ S0 ) and from T [β] − T [α] (down-strands with ↓ S0 ). Using standard DNA toolkit operations, single strands can be separated from double strands and then, using PCR, the single strands can be complemented using S0 , S1 as primers. Denote by T [α] the tube containing these double strands obtained afterPCR (i.e. T [α] ⊕ T [β]) and by T [β] the annealed double strands (i.e. T [α] T [β]). Step2.(Increment by One). Add restriction enzyme EcoRI to T [β] to cut all the double strands at their 3 end. This restriction enzyme activity leaves double strands with 5 hanging ends, of the form S0 (GAAT T GC 5 )i G ↓ AAT T. Now up-strands ↑ AAT T GC 5 GAAT T C and ligation enzyme are added to the test tube which results in the strands of the form S0 (GAAT T GC 5 )i GAAT T ↑ GC 5 GAAT T CS1 . These strands can now be polymerased to form S0 (GAAT T GC 5 )i+1 GAAT T CS1 . Thus T [β] contains strands representing (X[α] X[β])+ . Step3. Go back to step0. 2. Multiplication Step1. For each j ∈ X[β], j ≥ 2, construct test - tubes Tj [α] similar to Step 2 above with the difference that for annealing we add ↑ AAT T GC 5 (GAAT T GC 5 )j−2 GAAT T CS1 instead of adding ↑ AAT T GC 5 GAAT T CS1 (add it when j = 2). For j = 1, take T1 [α] = T [α]. Step2. If the test tubes obtained in step1 are Tj1 [α], Tj2 [α], Tj3 [α] · · ·, do the following: Step2.1 Perform Addition operation (described above) concurrently with successive pairs of tubes (Tj1 [α], Tj2 [α]), (Tj3 [α], Tj4 [α]) · · · Let the result be kept in Tj11 [α], Tj12 [α], . . .. Step2.2 Repeat Step2.1 until a single tube T is obtained.
Binary Arithmetic for DNA Computers
129
3. Subtraction The subtraction operation can be done utilizing ideas from the conventional digital arithmetic, that is to say as per 2’s complement method. To perform α − β we do the following: Step0 By gel electrophoresis, determine whether α ≥ β or β ≥ α. Assume α ≥ β. Step1. Construct T [α], T [β] and T that consists of dsi for all i ∈ {1, . . . , n}. Step2. Obtain T1 = T − T [β] as described above in Addition operation or as a set extract operation described in [6]. Step3. Perform addition operation with T [α] and T1 and keep the result in T1 . Step4. Perform addition operation with T1 and T [1] (the latter consisting of only one type of DNA strands viz. S0 GAAT T GC 5 GAAT T C encoding position 1). Step5. Extract the DNA strands encoding n + 1. The residual test tube gives the desired result. 4. Logical operations method.
on binary strings can easily be performed by our
To obtain the bitwise OR of α and β form the test tubes T [α] and T [β] and mix them together. The resulting tube represents α OR β. To obtain bitwise XOR, construct T [α] ⊕ T [β] as in Step1 of Addition. The resulting tube represents α XOR β. The bitwise AND is obtained by forming T [α] T [β] as in Step1 of Addition using only set extract operations. To obtain the NAND of two n-bit strings α and β, simply construct the test tubes T [α], T [β] and the tube T consisting of strands encoding all the integers from 1, . . . , n. Then obtain the tube T representing T [α] T [β] as in Step1 of Addition. Now obtain T = T − T by set extraction as in [6]. T represents the desired result. 2.3
Use in Cryptography
Due to the simplicity of performing XOR (needing only constant number of biosteps), one can easily implement the Vernam one-time-pad (cf. [17]). Let α (in binary) be the plaintext and κ the random key of the same length as α. The random key can be constructed by randomly synthesising a strand consisting of two nucleotides, say A and G standing for 0 and 1 respectively. To encrypt, the sender constructs the test tube T = T [α] ⊕ T [κ] and transmits T . On receiving T , the receiver decrypts by constructing T = T ⊕ T [κ]. One can recover α from T by reading its contents. In view of the ease with which both encrypting and decryption can be carried out (using only a few extractions), we feel that our method is much simpler than what is proposed in Gehani, LaBean and Reif [10].
130
2.4
Rana Barua and Janardan Misra
Complexity Analysis
Time Complexity Addition. As each level of recursion in addition operation involves a fixed number of bio steps, therefore the total number of steps depends on the number of recursion levels in the abstract model. We shall compute the expected number of bio-steps for two random n-bit numbers α, β. Since the probability that at each position bit will be set to 1 (or 0) is 1/2, the expected number of 1’s in any randomly chosen binary number of length n are n/2. Similarly probability that at any position both the numbers set the bit to 1 is (1/2)×(1/2) = (1/4). Therefore, expected number of positionswhere both the numbers set bits to1 is n/4, which is the expected size of X[α] X[β] and consequently of (X[α] X[β])+ . Also, the probability that at any position both the numbers set bits differently is (1/2).(1/2) + (1/2).(1/2) = (1/2) so that the expected number of 1’s in α ⊕ β as n/2. This is same as the expected number of integers in X[α] ⊕ X[β]. Now for 2nd level of recursion, the probability that position number i is present in both the sets from the 1st step is (1/2) × (1/4) = (1/8). Hence, expected number of integers after set intersection and shifting by one is n/8; while expected number of integers in symmetric difference of the sets is n/2 since, probability that an integer is present in only one of the two sets is (1/2).(1/4) + (1/2).(3/4) = (1/2). Following the same argument, it can be seen that after ith level of recursion, the expected size of set resulting after set intersection and shifting will be n/2i+1 . Therefore, the expected number of recursion levels will be [log2 n] − 1. Thus the expected number of bio-steps needed in the addition operation is O(log2 n). Multiplication. In case of multiplication, the expected number of addition that has to be performed is (log n − 2). Hence the expected number of bio-steps will be O((log n)2 ). Subtraction. Since subtraction involves only two additions and a constant number of bio-steps, the expected number of bio-steps in this case is also O(log n). Logical Operations. Each logical operation needs a constant (at most 7 or 8) number of bio-steps and is independent of the size of the strings (except for preprocessing). Volume Complexity At no stage do we require some strand to be destroyed or filtered out. So the total number of strands remains more of less constant. 2.5
Errors
Errors in DNA computing experiments are of primary concern and the issue has been looked into by many researchers [7,2,3,8]. We shall consider the possible errors that may take place in the Addition algorithm. The first source of error is the extract operation which is not error-free. At the cost of more time and space, one may use compound extract instead of simple extract as explained in [8]. This could reduce the error to a tolerable level. A more serious source of error seems to be sliding or partial annealing [18] which could take place because of
Binary Arithmetic for DNA Computers
131
periodic nature of our coding. Partial annealing can be considerably reduced if one adopts the following coding. dsi = S0 B0 S1 C 5 B1 S1 C 5 ....Bi−1 S1 C 5 Bi S1 . One can then carry out Step2 of the Addition algorithm, using only extracts, by the method of [6] concurrently for all strands as follows. Heat the test tube and then extract upstrands ↑ S0 B0 S1 C 5 B1 S1 C 5 ....Bi−1 S1 C 5 Bi S1 using the downstrands ↓ S0 . Then add downstrands ↓ Bi S1 C 5 Bi+1 S1 , for all i < n. Annealing takes place to form strands of the form ↑ S0 B0 S1 C 5 B1 S1 C 5 ....Bi−1 S1 C 5 Bi S1 ↓ C 5 Bi+1 S1 . Polymerase will then yield complete double strands of the form S0 B0 S1 C 5 B1 S1 C 5 ....Bi−1 S1 C 5 Bi S1 C 5 Bi+1 S1 . Consequently, every strands dsi in the test tube will be replaced by dsi+1 . The remaining partially annealed or unannealed strands, if any, may be filtered out using extracts several times.
3
Conclusion
We have present methods for carrying out arithmetic and logical operations which can be easily implemented in the DNA computing paradigm, though we require one test tube for each binary number. Moreover, we have discussed ways of reducing errors by changing our codings and using only compound extract. This is, however, at the expense of more time and space. Since it is simple to perform XOR using only constant number of bio-steps, it has the potential for being used in cryptograhic implementation of Vernam one-time-pad. The potential use for DNA computers, however, depends on the efficiency of the bio-steps involved. Acknowledgement: The first author would like to thank Zhongwei Li of Dupont for many useful discussions that led to significant improvements.
References 1. M.Amos and P.E.Dunne, DNA Simulation of Boolean Circuits, Tech Report CTAG-97009, Dept of Computer Science, University of Liverpool, Dec 1997. 2. M.Amos, S.Wilson, D.A.Hodgson, G.Owenson and A.Gibbons, Practical Implementation of DNA Computation. In: Proc 1st International Conference of Unconventional Models of Computation, Aukland, N.Z., Jan 1998, pp 1-18. 3. Y.Aoi, T.Yoshinobu, K.Tanizawa, K.Kinoshita and H.Iwasaki, Ligation Errors in DNA Computing. In: Proc 4th DIMACS Workshop on DNA Based Computers, U Penn, 1998, pp 181-187.
132
Rana Barua and Janardan Misra 4. A.Atanasiu, Arithmetic with Membrames. In: Proc of the Workshop on Mutiset Processing, Curtea de Arges, Romania, Aug 2000, pp 1-17. 5. S.Biswas, A Note on DNA Representation of Binary Strings. In: Computing with Bio-Molecules. Theory and Experiments, Ed G.Paun, 1998, pp 153-157. 6. D.Boneh, C.Dunworth, R.Lipton and J.Sgall, On Computational Power of DNA, Princeton CS Tech Report No. CSTR49995, 1995. 7. D.Boneh, C.Dunworth, J.Sgall and R.Lipton, Making DNA Computers Error Resistant. In: Proc 2nd DIMACS Workshop on DNA Based Computers, Princeton, 1996, pp 102-110. 8. K.Chen and E.Winfree, Error Correction in DNA Computing: Misclassification and Strand Loss. Proc of the 5th DIMACS Workshop on DNA Based Computers, MIT, Cambridge, 1999, pp 49-63. 9. P.Frisco, Parallel Arithmetic with Splicing. Romanian Journal of Information Science and Technology, 3, 2000, pp 113-128. 10. A.Gehani, T.H.LaBean and J.H.Reif, DNA-based Cryptography, In: Proc of the 5th DIMACS Workshop on DNA Based Computers, MIT, Cambridge, 1999. 11. F.Guarneiri, M.Fliss and C.Bancroft, Making DNA Add. Science 273, 1996, pp 220-223. 12. J.P.Hayes, Computer Architecture and Organization. McGraw-Hill International, Singapore, 2nd ed. 1988. 13. H.Hug and R.Schuler, DNA Based Parallel Computation of Simple Arithmetic. In: Proc of 7th DIMACS Workshop on DNA Based Computers, Tampa, 2001, pp 159-166. 14. V.Gupta, S.Parthasarathy and M.J.Zaki, Arithmetic and Logic Operations with DNA. In:Proc of 3rd DIMACS Workshop on DNA Based Computers, U Penn 1997, pp 212-220. 15. M.Ogihara and A.Ray, Simulating Boolean Circuits on a DNA Computer. Tech Report TR631, Department of C.Sc., University of Rochester, Aug 1996. 16. Z.F.Qiu and M.Lu, Arithmetic and Logic Operations with DNA Computers, Proc of 2nd IASTED International Conference on Parallel and Distributed Computing and Networks, Brisbane, 1998, pp 481-486. 17. D.R.Stinson, Crypytography: Theory and Practice, CRC Press, Boca Raton, 1995. 18. M.Yamamoto, J.Yamashita, T.Shiba, T.Hirayama, S.Takiya, K.Suzuki, M.Munekata and A.Ohuchi, A Study on the Hybridization Process in DNA Computing, In: Proc of the 5th DIMACS Workshop on DNA Based Computers, MIT, Cambridge, 1999, pp101-110.
Implementation of a Random Walk Method for Solving 3-SAT on Circular DNA Molecules Hubert Hug1 and Rainer Schuler2 1
Universit¨ ats-Kinderklinik Ulm, Prittwitzstrasse 43, D-89075 Ulm, Germany Abt. Theoretische Informatik, Universit¨ at Ulm, D-89069 Ulm, Germany [email protected]
2
Abstract In a DNA computation a select operation is used to separate DNA molecules with different sequences. The implementation of the select operation requires specialized hardware and non-standard modification of DNA molecules by adding e.g., magnetic beads to a primer sequence or using other methods to separate DNA in solution. In this paper we consider DNA computations which use enzymatic reactions and secondary structure formation to perform computations. We show that it is possible to implement an efficient (exponential-time) probabilistic algorithm to solve instances of the satisfiability problem on circular single stranded DNA.
1
Introduction
It is generally accepted that NP-complete problems cannot be solved efficiently. The best known probabilistic algorithms for 3-SAT have an expected running time of p(n) · (4/3)n , where n is the number of variables and p is some polynomial [13,9]. Hence, the size of the instances which can be solved on a computer is bounded by some small constant. For example, if 1010 operations can be performed in a reasonable amount of time, the satisfiability problem can be decided for every 3-CNF formula with less than 80 variables (since (4/3)80 ≈ 1010 ). For every additional variable the number of steps increases by a factor of (4/3). In contrast, the trivial approach to solve the 3-SAT problem (which of course is more general and can be applied to solve the satisfiability problem for boolean formulas in general) considers all 2n different truth assignments. In this case the number of variables which can be used in a formula is bounded by 34 (since 234 > 1010 ). The more efficient heuristic allows us to more than double the number of variables of a formula, and this fact is independent of the current state of technology (see e.g., [3] for algorithms for other NP-complete problems). The advantage of DNA computing is the parallelism which can be exploited by the huge number of DNA molecules in a single probe. However, if we implement the trivial approach to solve the 3SAT problem and consider all possible assignments, much of the advantage of the parallel processing will be lost. For example, since the trivial approach requires 280 steps for 80 variables we need to perform 246 ≈ 1015 steps in parallel to beat the efficient algorithm on a single M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 133–142, 2003. c Springer-Verlag Berlin Heidelberg 2003
134
Hubert Hug and Rainer Schuler
processor. Here we assume a work efficient implementation and that at most 1010 ≈ 234 steps can be performed sequentially on each processor. Even if a DNA computer can handle 270 DNA molecules in a single test tube, we still need more than 1000 sequential reactions. A different aspect of DNA computations that we consider here is the number of physical separate operations. These operations are necessary to implement a selection step during a computation, e.g., to select all DNA molecules containing a specific sequence. The implementation [7] of the efficient algorithm for the 3-SAT problem mentioned above [13] uses n3 separate operations. In general an implementation of an efficient algorithm for an NP complete problem (compared to the trivial approach) requires for each iteration step of the algorithm, the separation of DNA molecules into several test tubes. Here the number of iterations corresponds to the length of a solution and the number of test tubes corresponds to the number of different extensions of a partial solution, e.g. the number of literals of a clause, or the number of edges leaving a node in a graph problem. The separation into different tubes is necessary since the extension step usually determines some property of the partial solutions (see e.g., [5,3]). In each separate operation there is some not too small probability that the actual solution may be lost. The probability depends on the reaction conditions and is error prone in particular if it is done manually. Even a quadratic number of such operations might be too large. Separate operations can be performed also with specialised hardware, e.g. micro-flow reactors [11]. In this paper we consider the satisfiability problem, which has been solved with DNA molecules on solid surfaces for small examples [10,14]. Here the search space, i.e., all possible assignments, are coded in the molecules attached to the surface. An alternative approach for solving a small instance of 3SAT uses DNA hairpin formation [12]. Very recently the satisfiability problem has been solved for a 20 variable 3CNF formula [4] after an exhaustive search of 220 possible assignments.
2
The Model
We consider computations performed in a test tube (i.e., in solution). Our goal is to use the more complex biochemical reactions to modify DNA molecules. In particular we want to avoid operations, e.g., extraction of marked molecules, which separate molecules according to the operations which have to be performed. In our model we use the following operations 1. Add nucleic acid molecules (DNA or RNA) of specific sequences with the following restrictions. The number of different molecules (sequences) should be small, e.g., linear in the input size. The total number of molecules is bounded by the size of the tube (270 ). If the number grows exponentially with the input size, then this gives a limit on the size of the instances. 2. Add enzymes (DNA polymerase, ligase, DNase, RNase, restriction enzymes) 3. Set reaction conditions (temperature, pH, salt concentrations)
Implementation of a Random Walk Method for Solving 3-SAT
135
4. Destroy enzymes and digest nucleic acid molecules 5. Combine contents of several test tubes A striking example that a computation is possible without separation steps during computation was given by Sakamoto et al [12] to solve instances of the 3-SAT problem. Here, DNA molecules code for consistent assignments to the n variables of the formula. For a growing number of clauses each DNA molecule contains a (short) subsequence indicating which variable of each clause should be set to 0 or 1 (to satisfy the clause). If a contradicting assignment is chosen, then the molecule will form an intramolecular hairpin, and the double stranded (part of the) molecule will be cut by a restriction enzyme. Both the assembling of the DNA molecules (by hybridization) and the digestion by restriction enzymes can be done without separation of the DNA molecules (according to sequence). Unfortunately, at least from a theoretical point of view, the algorithm implemented and verified by experiments is not efficient. The molecules coding for a solution will contain a literal of each clause. If m is the number of clauses then the search space, and hence the number of molecules, is 3m . Typically the number of clauses is larger than the number of variables and can be as large as m = n3 = θ(n3 ). In the next section we give an implementation of the trivial 2n -time algorithm for the 3-SAT problem using our model. The implementation is then modified to implement the more efficient algorithm from [13].
3
Implementation of the 2n Time Algorithm for the Satisfiability Problem of CNF Formulas
A formula F is in conjunctive normal form, if it is the conjunction of a set of clauses. Each clause contains a set of literals, i.e., variables and negated variables. A variable can be set to 1 (true) or 0 (false) and any function which assigns a value to each variable is called a truth assignment. Given an assignment, a clause is said to be true if it contains a variable which is set to 1 or a negated variable which is set to 0. A formula in conjunctive normal form is true if every clause is true. A formula F is called satisfiable, if there exists an assignment a such that F is true. In this case a is called satisfying assignment. The satisfiability problem is to decide whether a given formula is satisfiable. The task to find such an assignment is the corresponding search problem. A formula F is in 3-CNF if every clause contains 3 literals. The 3-SAT problem is to decide whether a formula in 3-CNF is satisfiable. The trivial algorithm to check whether a formula F is satisfiable considers all assignments. Let n denote the number of variables of F , then there are 2n different assignments. For every assignment and every clause C of F we test whether C is true. If a satisfying assignment is found, then F is satisfiable, otherwise F is unsatisfiable. Note that given an assignment a it is easy to check for all clauses C whether C (and hence F ) is true.
136
3.1
Hubert Hug and Rainer Schuler
First Method
Our implementation on a DNA computer is in two steps. First we create a set of oligonucleotides which code for all assignments. The oligonucleotide sequences are selected such that cross hybridization is minimal [8]. For each variable a we choose a sequence coding for variable a set to 1, denoted by a, and a sequence coding for variable a set to 0, denoted by ¬a. An assignment contains for each variable a either sequence a or sequence ¬a. Note that we use a to denote the reverse complementary sequence of a (see Figure 1). The second step is repeated for every clause. In this step the oligonucleotides which do not satisfy this clause are destroyed. This is done by selecting and protecting the assignments which satisfy the clause, i.e. contain some literal of the clause. The complexity of the second step is independent of the number of variables and clauses. In the second implementation the complexity grows linearly with the number of literals per clause (which is constant in the case of 3SAT). The space complexity is (polynomial) in 2n and the time complexity (number of iterations) is linear in m the number of clauses. Implementation For each literal we assign a DNA sequence of approximately 20 bp. We synthesize oligonucleotides for every possible assignment. The 3’ end of all oligonucleotides contains the same sequence (adaptor sequence) of approximately 50 bp. The oligonucleotides can be assembled from oligonucleotides coding for the literals. Adaptor oligonucleotides consisting of reverse complementary sequence of the beginning of a literal and the reverse complementary sequence of the end of the previous literal and the oligonucleotides coding for the literals are put into solution. The oligonucleotides hybridize and are then ligated. In this way oligonucleotides coding for all possible assignments can be obtained [1]. For each clause oligonucleotides are synthesized. The sequence of the oligonucleotides contains sequences of all the literals which occur in the clause and the reverse complementary 50 bp adaptor sequence at the 5’ end. The oligonucleotides are hybridized with oligonucleotides containing reverse complementary sequences. The double stranded DNA is now amplified by the polymerase chain reaction (PCR) with primers containing 5’ and 3’ topoisomerase attachment sites (Invitrogen, USA). Then DNA sequences containing promoters for RNA polymerases (e. g. T7 promoter and T3 promoter, Invitrogen) are attached by the ligase activity of topoisomerase I (Invitrogen). The resulting DNAs are again amplified by PCR with promoter element specific primers. RNA consisting of the sequences for the clauses in now synthesized by in vitro translation (Invitrogen). Now the clauses are ready for use in the DNA computer. The DNA oligonucleotides coding for the assignments are now hybridized to the RNA molecule coding for the first clause (Fig. 1). The temperature has to be above the melting temperature for the literal sequences. The 3’ end of the DNA is elongated with reverse transcriptase. The RNA is then removed with RNase H. RNase H digests RNA in a DNA-RNA heteroduplex [2]. If necessary, unhybridized RNA molecules could be removed by RNase A. A hybridization
Implementation of a Random Walk Method for Solving 3-SAT
137
temperature that allows for hairpin formation is now used. Only molecules that contain a literal and the reverse complement of the literal form a hairpin structure (Fig. 1). We then use exonuclease I from E. coli (Epicentre Technologies, USA, Biozym Diagnostik GmbH). Exonuclease I is a single strand-specific exonuclease that digests DNA in a 3’ to 5’ direction. It requires Mg2+ and a free 3’ hydroxyl terminus. After denaturation only solutions for the clause under investigation remain intact. The solutions are separated from the pieces of attached clauses by hybridizing a short reverse complementary adaptor sequence (Fig. 1).The 3’ overlapping single stranded DNA sequence is again degraded with exonuclease I and the short oligonucleotide is removed by denaturation. (The short oligonucleotide will be degraded in the cycle for the following clause.) Then the solutions are ready for the hybridization with the next clause sequence. The readout is by DNA sequencing [1]. Instaed of RNA, the clauses could also consist of DNA. Corresponding oligonucleotides would then code for the clause sequences and the adopter sequence with the restriction endonuclease recognition site at the 3’ end. Either the 5’ or the 3’ end are biotinylated. After hybridization of the oligonucleotides coding for the solution with the oligonucleotide coding for the clause the free 3’ ends are extended with a DNA polymerase. Double stranded DNA is denatured and the strands containing the clause sequences and the biotin are removed with a magnet [1]. 3.2
Second Method
An alternative implementation uses each literal of a clause for hybridization to the solutions instead of the whole clause. Again, the literals consist of RNA (Fig. 2). In the i-th iteration of step two we add the molecules of clause i. These molecules consist of adaptor sequence A and the reverse complementary sequence of one of the literals of clause i. After hybridization, the 3’ end of the DNA is elongated as described in section 3.1. DNA molecules which contain the sequence of the attached literal will form an intramolecular hairpin (Fig. 2). The double stranded portion (stem) of the hairpin protects from digestion by exonuclease I. Removal of the attached literal is as described in section 3.1. We observe that in each step approximately 2/3 of the DNA coding for assignments that satisfy the clause migth be lost since in the worst-case only one of the literals of the clause will be present in the assignment. After each cycle, the remaining DNA can be amplified by linear PCR (only one of the primers is added).
4
A Probabilistic Algorithm for 3-SAT
The best known (worst-case) probabilistic algorithms for the 3-SAT problem use the following strategy. First guess an assignment a by choosing a truth value from 0 or 1 for each variable independently with probability 1/2. Then repeat the following step 3n times, where n is the number of variables. Take any clause
138
Hubert Hug and Rainer Schuler
that is not true, choose one of the literals (with probability 1/3) and flip its truth value in a. The probability that a satisfying assignment is found is (3/4)n , i.e., the expected number of (independent) restarts of the algorithm needed to find a satisfying assignment is (4/3)n [13]. We give an implementation of the method using intramolecular hairpin formation. We note that the particular secondary structure we predict seems possible but has to be verified experimentally. A different implementation was proposed in [7] which uses separation of molecules according to the operations to be performed. Our implementation is in two steps. In a first step we prepare a set of oligonucleotides coding for a random subset of the possible assignments. The number of assignments can in fact be smaller than (4/3)n , provided that the number of copies is large, e.g., 1.2n different assignments are sufficient [6]. The DNA coding for an assignment forms a circle (Fig. 3). Single stranded circular DNA can be prepared from bacteriophage M13 [2] or the pBluescript System (Stratagene,USA) can be used. For each clause C = {l1 , l2 , l3 } we prepare oligonucleotides containing the reverse complementary sequences of the negated literals (Fig. 3). Between the base pairs coding for the negated literals we prepare a recognition sequence for a restriction endonuclease (RE e.g., BamH1). To be able to cut any of the three literals we prepare three different oligonucleotides for each clause by shifting the literals by 1, i.e., ¬l1 RE ¬l2 RE ¬l3 , ¬l2 RE ¬l3 RE ¬l1 , and ¬l3 RE ¬l1 RE ¬l2 . We note that, all three parts of the oligonucleotides can bind to the appropriate reverse complementary part of the circular molecules coding for assignments that do not satisfy the clause C. We have to set the hybridization conditions (and possibly further enzymatic reactions) such that clauses which are satisfied by the assignment (and have maximal 2 reverse complementary parts) do not hybridize. One of the three literals is cut by an restriction enzyme and (reverse complementary) sequences for all negated literals (of the clause) with appropriate overlapping parts are added. The 3’ end of the these oligonucleotides are blocked in order that they cannot be elongated by DNA polymerase (ThermoHybaid, Ulm, Germany). The oligonucleotide coding for the negated literal will hybridize. The DNA is then treated with DNA polymerase followed by T4 DNA ligase. The complex is denatured and the result is a closed single stranded DNA circle with one literal replaced by the negated literal (Fig. 3). The space complexity is of order (4/3)n . The initial set of oligonucleotides can be implemented by a random assembly of literals li , li+1 , where li is the oligonucleotide (sequence) coding for the i-th variable or the negation of the i-th variable. The second step (the modification of the molecules, Fig. 3) is iterated 3n times. In each iteration all clauses have to be considered. If we consider the clauses sequentially the time complexity will be m · n.
¬c
Add next clause
Exonuclease I, Denaturation
Exonuclease I, Denaturation
Hybridize short adapteroligonucleotide
b a 5´
No hairpin:
¬b
c
Allow for hairpin formation
¬b a
¬a c
b
¬a c
3´
¬c
a
c
b
b
b
a
a
a
b a
Hairpin:
b a 5´
b DNA: RNA:
5´
a
5´
c
c
3´
b
b
c
¬a
¬a
b
b
¬c
Rnase H
¬c b ¬a
¬a
b ¬a c
c ¬b a 5´
¬c b ¬a
Reverse transcriptase
¬c b ¬a
¬c
¬c 3´
¬a
¬a
b
b
¬c
¬c b ¬a 3´
c ¬b a 5´ 3´
c
139
3´
Solutions Clause
Implementation of a Random Walk Method for Solving 3-SAT
Figure 1: Steps for the selection of correct solutions. As an example two solutions, consisting of DNA with the adaptor sequence, out of a set of 23 possible solutions are given. The clauses consist of in vitro transcribed RNA and hybridize via the adaptor sequence. One clause sequence is added in each cycle. The reactions are explained in the text. Letters describe literals. Lines above letters mark reverse complementary sequences of the literals. Thick bars represent sequences for literals, thinner bars adaptor sequences. All sequences are given in the 5’ to 3’ direction.
140
Hubert Hug and Rainer Schuler
DNA: RNA:
5´
a
d
¬c
b
Solutions Literal
3´
5´
b
a
b
¬c
d
a
b
¬c
d
3´ 5´
c
3´
¬d Reverse transcriptase, Rnase H
5´
5´
5´
a
b
¬c
d
b
a
b
¬c
d
c
a
b
¬c
d
¬d
3´
3´
3´
Allow for hairpin fomation
¬c a
5´
Hairpin:
d
b 3´
b
No hairpin:
a
5´
a
5´
b
¬c
d
c
b
¬c
d
¬d
3´
3´
Exonuclease I, Denaturation
5´
a
d
¬c
b
b
3´
Hybridize short adapter oligonucleotide
a
5´
d
¬c
b
b
3´
Exonuclease I, Denaturation
5´
a
b
¬c
d
3´
Add next clause
Figure 2: An alternative strategy for the selection of correct solutions.
Implementation of a Random Walk Method for Solving 3-SAT
a
Solutions:
h
Clause:
b
(¬a v ¬d v ¬f)
g
c
141
Synthesize
f
5´
a
d
f
3´
e
d
Hybridize g b c e d a f h
5´
3´
a
f
d
Remove d by restriction cutting and then denature g h b c e a f Hybridize reverse complementar
a
h
b
g
c
f 3´
c
e ¬d
a
3´-block
e DNA polymerase T4 DNA ligase Denaturation
h
b
g
c
f ¬d
e
Figure 3: Substitution of a literal d in a circular single stranded DNA molecule containing an assignment by the negated literal ¬d. Letters a to h represent literals. The reverse complement of a literal x is denoted by x. The negation of x is denoted by ¬x.
142
Hubert Hug and Rainer Schuler
It is possible to consider the oligonucleotides of all clauses simultaneously. In this case the time complexity is 3n. We note that circular molecules might get cut into several pieces, if more than one clause is not satisfied by the assignment and the corresponding oligonucleotides hybridize simultaneously. However we believe that the loss due to such events will be very small since simultaneous hybridization of the oligonucleotides should be unlikely.
References 1. L.M. Adleman. Molecular computation of solutions to combinatorial problems. Science, 266:1021–1024, 1994. 2. F.M. Ausubel, R. Brent, R.E. Kingston, D.D. Moore, J.G. Seidman, J.A. Smith, and K. Struhl, editors. Current Protocols in Molecular Biology. Wiley and Sons, 2001. 3. E. Bach, A. Condon, E. Glaser, and C. Tanguay. DNA models and algorithms for NP-complete problems. In Proc. of 11th Conference on Computational Complexity, pages 290–299. IEEE Computer Society Press, 1996. 4. R.S. Braich, N. Chelyapof, C. Johnson, P.W.K. Rothemund, and L.M. Adleman. Solution of a 20-variable 3-SAT problem on a DNA computer. Science, 96:478–479, 2002. 5. K. Chen and V. Ramachandran. A space-efficient randomized DNA algorithm for k-sat. In A. Condon, editor, Sixth International Workshop on DNA-based Computers, volume 2054 of LNCS. Springer-Verlag, 2001. 6. E. Dantsin, A. Goerdt, Hirsch E.A., R. Kannan, J. Kleinberg, C. Papadimitriu, P. Raghavan, and U. Sch¨ oning. Deterministic local search decides k-SAT in time (2 − 2/(k + 1))n . TCS, in press, 2002. 7. S. Diaz, J. L. Esteban, and M. A Ogihara. A DNA-based random walk method for solving k-sat. In A. Condon, editor, Sixth International Workshop on DNA-based Computers, volume 2504 of LNCS. Springer-Verlag, 2001. 8. A. G. Frutos, Q. Liu, A. J. Thiel, A. M. Sanner, A. E. Condon., L. M. Smith, and R. M. Corn. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Res, 25(23):4748–4757, 1997. 9. T. Hofmeister, U. Sch¨ oning, R. Schuler, and O. Watanabe. A probabilistic 3-SAT algorithm further improved. In Proceedings 19th Symposium on Theoretical Aspects of Computer Science, volume 2285 of LNCS, pages 192–203. Springer-Verlag, 2002. 10. Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith. DNA computing on surfaces. Nature, 403:175–179, 2000. 11. J. S. McCaskill. Optically programming DNA computing in microflow reactors. Biosystems, 59:125–138, 2001. 12. K. Sakamoto, H. Gouzu, K. Komiya, D. Kiga, S. Yokoyama, T. Yokomori, and M. Hagiya. Molecular computation by DNA hairpin formation. Science, 288:1223– 1226, 2000. 13. U. Sch¨ oning. A probabilistic algorithm for k-SAT and constraint satisfaction problems. In Porc. 40th FOCS, pages 410–414. ACM, 1999. 14. L. Wang, J. G. Hall, M. Lu, Q Liu, and L. M. Smith. A DNA computing readout operation based on structure-specific cleavage. Nature Biotech., 19:1053–1059, 2001.
Version Space Learning with DNA Molecules Hee-Woong Lim1 , Ji-Eun Yun1 , Hae-Man Jang2 , Young-Gyu Chai2 , Suk-In Yoo1 , and Byoung-Tak Zhang1 1
Biointelligence Laboratory School of Computer Science and Engineering Seoul National University, Seoul 151-742, Korea 2 Department of Biochemistry and Molecular Biology Han-Yang University, Ansan, Kyongki-do 425-791, Korea {hwlim,jeyun,hmjang,ygchai,siyoo,btzhang}@bi.snu.ac.kr
Abstract. Version space is used in inductive concept learning to represent the hypothesis space where the goal concept is expressed as a conjunction of attribute values. The size of the version space increases exponentially with the number of attributes. We present an efficient method for representing the version space with DNA molecules and demonstrate its effectiveness by experimental results. Primitive operations to maintain a version space are derived and their DNA implementations are described. We also propose a novel method for robust decision-making that exploits the huge number of DNA molecules representing the version space.
1
Introduction
Learning can be formulated as a search for a hypothesis in the space of possible hypotheses that are consistent with the training examples. Version space learning was proposed as a method for representing the hypothesis space [7]. It maintains the general boundary and the specific boundary to represent the consistent hypotheses consisting of conjunctions of attribute values. However, Haussler [4] shows that the size of the boundaries can increase exponentially in some cases. Hirsh [6] shows that if the consistency problem, i.e. whether there exists a hypothesis in the hypothesis space that is consistent with the example, is tractable, the boundaries are not needed and new examples can be classified only by positive examples and negative examples. In this paper, we present a DNA computing method that implements the version space learning without maintaining boundary sets. We exploit the huge number of DNA molecules to maintain and search the version space. To use the massive parallelism of DNA molecules efficiently, the encoding scheme is important. Therefore, we present an efficient and reliable method to express the hypothesis in the version space. In proposed encoding scheme, the number of necessary sequences increases linearly, not exponentially, with the number of attributes. We verify the reliability of this encoding scheme by bio-lab experiments. We also show that the version space learning can be reduced to two primitive set M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 143–155, 2003. c Springer-Verlag Berlin Heidelberg 2003
144
Hee-Woong Lim et al.
operations of intersection and difference. Experimental methods are described for these primitive operations as well as for predicting the classification of new examples. And some experimental results to verify these method are presented. Our work is related with Sakakibara [9] and Hagiya et al. [3] in the sense that they try to learn a concept of predefined form from training examples and adopt the general framework of generate-all-solution and search, like in Adleman [1]. Sakakibara suggested a method to express k-term DNF with DNA molecules, to evaluate it, and to learn a consistent k-term DNF with the given examples. Hagiya et al. developed the method to evaluate µ-formula and to learn consistent µ-formula with whiplash PCR. This paper is organized as follows. In Section 2, it is shown that the process of maintaining a version space can be formulated as a set operation and that the version space learning can be performed by two primitive set operations. Section 3 describes the method for version space learning with DNA molecules. In Section 4, the experimental results of these method are presented. Finally, the conclusion and future work are given in Section 5.
2
Version Space Learning as a Set Operation
An excellent description of version space learning can be found in [7]. Here we describe the basic terminology for our purposes. Attributes are features that are used to describe an object or concept. A hypothesis h is a set of restrictions on the attributes. Instance x is a set of attribute values that describe it and all attributes have its values. In this paper, we assume that an attribute takes binary values. For the purpose of illustrating, let us consider a concept: “An office that has recycling bin” in [2]. If all offices can be described by attributes, department (cs or ee), status (faculty or staff), and floor (four or five), “An office on the fourth floor belonging to a faculty” can be expressed as {status=faculty, floor =four}, or {faculty, four} in an abbreviated form. We define the version space VS as a set of hypotheses consistent with training examples. In this paper, this does not include the concept of general and specific boundaries. VS e denotes a version space for a single positive example e, i.e. a set of all hypotheses that classify an example e as positive. By above definition, an instance x is classified as positive by hypothesis h if and only if h ⊆ x. Therefore, a power set of instance x is equivalent to a set of hypotheses that classify the instance x as positive. In the case of above example, an instance {cs, faculty, four} is classified as positive by a hypothesis {faculty, four}, and is classified as negative by a hypothesis {faculty, five}. Version space learning can be viewed as a process that refines the version space by removing inconsistent hypotheses and searches as the training examples are observed (Fig. 1). Thus, this process can be reduced to the following set operations. If a positive example is given, we select all hypotheses that classify the example as positive in version space. In contrast, if a negative example is given, we eliminate all hypotheses that classify the example as positive in
Version Space Learning with DNA Molecules
145
the version space. Therefore, the version space learning can be performed by intersection and difference operations.
VS = Initial whole hypothesis space New Example e Repeat for every example
VSe = Power_Set( e) e is positive? No Yes
VS = VS + VS e
VS = VS - VSe
Fig. 1. Procedure for maintaining a version space
3 3.1
DNA Implementation Encoding
Basically, a hypothesis is represented as a single strand, and the strand is composed by ligation of the sequences that correspond to the attribute values. When generating the initial hypothesis space, double strands with sticky ends are used for conjunctions of attribute values. The difference of the sticky ends between the attributes determines the order of attribute values in the hypothesis, and in consequence, a hypothesis can take no more than one attribute value of the same kind of attribute. In addition to the attribute value strand, another double strand with a sticky end is used equivalently to encode the “don’t-care symbol” to make intersection and difference operations easy. The process to create the initial version space is as follows. At first, we put all the double strands with sticky ends, which correspond to each attribute values, into a test tube and let them hybridize each other. Next, ligation is performed to make the conjunctions of attribute values. Finally, after extracting single strands that we want, the initial version space is completed. This extraction of single strands can be done by affinity separation with magnetic beads (Fig. 2). By this process, we can generate all the possible hypotheses and the length of each hypothesis is all the same. The advantage of this encoding method is that the number of sequences that is needed is the number of attribute values + the number of attributes and it increases linearly, not exponentially, with the number of attributes. And by concatenating another common strand to the first attribute and the last attribute for primer, we can easily amplify the version space with PCR.
146
Hee-Woong Lim et al.
cs
faculty
ee
four
staff
five
? status
? floor
cs
faculty
four
cs
faculty
five
cs
faculty
?
cs
staff
four
...
? dept
cs
?
?
...
ee
faculty
?
...
Primer ee
?
?
...
Sticky End ?
?
?
Fig. 2. All attribute strands are put into a test tube, and they are hybridized and ligated. This generates all possible hypothesis strands. “?a ” denotes the “don’t-care symbol” for attribute a
In addition to the above encoding scheme, we can also use the Adleman [1] style encoding scheme. Assume that an attribute value corresponds to a city in the Hamiltonian path problem, and the conjunction of two attribute values corresponds to the road that connects two cities. Then we can make an order of attribute values and restrict the ligation of the attribute values by the existence of an edge strand. The graphs below show two possible directed assembly graphs. Graph (a) in Fig. 3 encodes “don’t care symbol” with the absence of attribute value. So, the length of the hypothesis strands is variable. Graph (b) in Fig. 3 encodes “don’t-care symbol” with another DNA strand like an attribute value. In this case, the length of hypothesis strand is fixed.
( a)
cs
ee
( b)
four
faculty
staff
cs
faculty
four
ee
staff
five
? dept
? status
? floor
five
Fig. 3. Two possible directed assembly graphs to encode the hypotheses.
However, the number of the necessary strands is the number of nodes plus the number of edges in the graph, and it increases exponentially with the number of attribute values. So, this encoding scheme is infeasible.
Version Space Learning with DNA Molecules
147
We use the first encoding method by considering the number of needed strands and the implementation of intersection and difference operation. 3.2
Primitive Operations
Basically, an affinity separation with magnetic beads is used to perform the set intersection and difference that are the primitive operations to maintain the version space, and all separations are positive selections. We designed an experiment that does not need the power set of the examples explicitly, but uses only the individual attribute values separately. To do this, magnetic beads with DNA single strands are needed and the number of the beads is the same as the number of attribute values including “don’t-care symbol”. In the example described in Section 2, there are nine kinds of bead. Intersection In case of a positive example, we should select the hypotheses that are composed of the example attribute values and “don’t-care symbol”. Therefore, affinity separation with beads that have the attribute values or “don’tcare symbol” must be performed by the order of each attribute. For example, if the positive input example is , the hypotheses that must be selected are as follows: , , faculty, four>, , , , , faculty, ?>, , ?, four>, , ?, ?>. To obtain theses hypotheses, at first, an affinity separation with the beads “cs” and “?department ” is performed simultaneously, and we select all the DNA strands that are hybridized with magnetic beads. And then with the beads “faculty” and “?status ”, and, finally, with the beads “four” and “?f loor ” (Fig. 4).
Fig. 4. Intersection operation for a positive example {cs, faculty, four}. DNA strands that are hybridized with beads are selected step by step.
When the number of attributes increases, the experimental process may be complicated because the affinity separation must be performed as many times as
148
Hee-Woong Lim et al.
the number of attributes. However, the experimental step increases only linearly with the number of attributes. Difference Because a difference operation is equivalent to the selection of the hypotheses that are not elements of the intersection set, we can consider the hypotheses that are not selected by the intersection operation, as the result of difference operation. However, because of the reversibility of the chemical processes involved in our experiments, there can be remaining molecules that are not selected even though they are elements of the intersection set. So, choosing the rest of the intersection as difference can be a primary factor of error. Therefore, difference operation must be performed through affinity separation, e.g. using the beads that are not used in intersection. In difference operation, the beads that are different kinds of attribute can be used simultaneously. For example, if the negative example is , we must select the hypotheses whose department is “ee”, status is “staff”, or floor is “five”. To get such hypotheses, the beads “ee”, “staff” and “five” can be used to perform affinity separation. And all DNA strands that are hybridized with one or more of the three beads are selected (Fig. 5).
Fig. 5. Difference operation for a negative example {cs, faculty, four}. The beads are used simultaneously and the DNA strands that are hybridized with beads are selected.
3.3
Classification
So far, the method to maintain the version space for observed examples is presented. The question now to ask is: How to use this version space? How to classify new examples? There are various ways to classify new examples using current version space [7]. On the one hand, the version space classifies the new example as positive if it has at least one hypothesis that classifies the example
Version Space Learning with DNA Molecules
149
as positive, and classifies it as negative when there are none. On the other hand, version space can use a majority-voting method. In this case, the new example is classified as the majority of hypotheses in the version space decide. In this paper the latter method is implemented as follows. First, we divide the solution into two parts with an equal volume. And then in one tube, we perform intersection operation with the new example, and in the other tube, we perform difference operation. Now, assuming that the mass of each hypothesis strand is fixed and the distribution is uniform, we can classify the example to the part that has the more DNA molecules. That is to say, if the tube of intersection has more intensity, then we classify the new example as positive, otherwise as negative. To compare the intensity of hypothesis molecules, we can use gel-electrophoresis or fluorescence. If one tube has more DNA molecules, the band in gel-electrophoresis will be thicker, or the intensity of fluorescence will be stronger (Fig. 6).
intersection
A A B
difference
This example is negative!
B
Fig. 6. Classification process. The intersection and difference are performed respectively. The remaining molecules are compared by the fluorescence intensity of DNA molecules. Then, the majority part is selected.
4 4.1
Experimental Results on Hypothesis Space Generation Sequence Design and Generation of Initial Version Space
To solve the example in Section 2, we needed to design nine DNA sequences for nine attribute values, and two DNA sequences for two sticky ends. We used the sequence generator NACST/Seq [10] to design the DNA sequences for experiment. When generating, we restricted the Tm and GC contents so that all sequences have a similar probability to hybridize and the distribution of hypothesis is uniform. Similarity and H-measure are considered to prevent cross hybridization. The designed sequences are given in Table 1. We used the right half of sequence no. 7, and the left half of sequence no. 11 as sticky ends, and used the rests as attribute values, because the similarity between no. 7 and no. 11 was low. All 18 sequences that were used in the experiment are shown in Table 2.
150
Hee-Woong Lim et al.
Table 1. Sequences that are designed by the sequence generator [10] 1 2 3 4 5 6 7 8 9 10 11
Sequence CTCCGTCGAATTAGCTCTAA AGTCAGTTGGTGACCGCAGA GCATATCAGGCGAGTAGGTG ACAAGGGCTCAGAACCAATG CAGTACTCGGTTTCCGCTAA CGTATGCGCATCCGTTTCAT TTCTTGTGTACAACCGCGGC ATCATGTAGGAACTGTCGCA ACTCCGTATCGGGTAGCTTT GGAGTTGACACTATCGTCGT ATAGCCTCGA GGGACGAATA
Tm(◦ C) GC% 57.17 61.13 61.85 60.07 59.10 62.07 62.44 59.43 60.00 58.82 61.23
45 55 55 50 50 50 55 45 50 50 50
Table 2. DNA strands that are used in the experiment. Primer
Tm(◦ C)
Sequence
cs
5’- CTCCG TCGAA TTAGC TCTAA ATAGC CTCGA -3’ 3’- GAGGC AGCTT AATCG AGATT -5’
65.2 48.9
ee
5’- ACAAG GGCTC AGAAC CAATG ATAGC CTCGA -3’ 3’- TGTTC CCGAG TCTTG GTTAC -5’
69.3 53.7
5’- ATCAT GTAGG AACTG TCGCA ATAGC CTCGA -3’
66.9
3’- TAGTA CATCC TTGAC AGCGT -5’
50.0
?dept
faculty
5’-
AGTCA GTTGG TGACC GCAGA CAACC GCGGC -3’
3’- TATCG GAGCT TCAGT CAACC ACTGG CGTCT staff
?status
- 5’
5’CAGTA CTCGG TTTCC GCTAA CAACC GCGGC -3’ 3’- TATCG GAGCT GTCAT GAGCC AAAGG CGATT - 5’ 5’-
ACTCC GTATC GGGTA GCTTT CAACC GCGGC -3’
3’- TATCG GAGCT TGAGG CATAG CCCAT CGAAA
- 5’
77.1 70.3 73.7 66.9 74.0 66.9
four
5’- GCATA TCAGG CGAGT AGGTG -3’ 3’- GTTGG CGCCG CGTAT AGTCC GCTCA TCCAC -5’
52.0 76.5
five
5’- CGTAT GCGCA TCCGT TTCAT -3’ 3’- GTTGG CGCCG GCATA CGCGT AGGCA AAGTA -5’
58.0 79.0
5’- GGAGT TGACA CTATC GTCGT -3’
48.5
3’- GTTGG CGCCG CCTCA ACTGT GATAG CAGCA -5’
75.2
?f loor
Version Space Learning with DNA Molecules
151
Each sequence was synthesized as a single strand by Bioneer Corporation (Tae-Jeon, Korea) and has been purified by PAGE and 5’ phosphorylated. a. Hybridization of Molecules First, all 18 single strands which were adjusted to 100 pmol were mixed with 10µ. Initial denaturation was performed at 95 ◦ C for 5 minutes and then it was cooled down to 16 ◦ C by the rate of 1 ◦ C per 1 cycle in iCycler thermal cycler (Bio-rad, USA). b. Ligation of Molecules Ligation was performed with T4 DNA ligase at 16 ◦ C overnight. The reaction buffer was 50 mM Tris-HCl (pH 7.8), 10 mM MgCl2 , 5 mM DTT, 1 mM ATP, and 2.5 µg/m BSA. c. Native Gel-electrophoresis (Confirmation) Ligation mixture was defined by 3 % native gel-electrophoresis. The running buffer consists of 40 mM Tris-acetate, 1 mM EDTA, and pH 8.0 (TAE). Gel was run on Bio-rad Model Power PAC 3000 electrophoresis unit at 60 W (6 V/cm), and constant power. We confirmed the generation of the initial version space by gel-electrophoresis. As a result, the band of double stranded 80 bp DNA (60 bp for three attribute values, 20 bp for two sticky ends) was generated (Fig. 7).
Fig. 7. Result of gel-electrophoresis for the version space generated. Lane 1 and 2: all 18 single strands were mixed simultaneously to generate initial version space. Lane 3 and 4: After the hybridization of two single strands for making double strands for attribute values respectively, the double strands were mixed.
152
Hee-Woong Lim et al.
4.2
Primitive Operation
And we also tested the affinity separation for primitive operation by PCR with different primers and gel-electrophoresis. a. Preparation of probe sequence with amino linker Nine probe sequences were synthesized by Koma biotech (Seoul. Korea) and each sequence is complement with the sequence of the attribute value. Each sequence contains a primary amino group at 5’ end to couple with the magnetic bead. It was concentrated to 200 pmol by ethanol precipitation method and resuspended by 100 µ of coupling buffer. b. Preparation of magnetic probe Magnetic bead system of CPG (USA) was used. It has a 15 ˚ A extension aliphatic arm terminating in an active primary amine. These magnetic bead were coupled with the modified probe sequences respectively. c. Separation with magnetic probe Affinity separation was performed with magnetic beads “ee”, “staff”, and “five” sequentially in 100 µ of initial version space. In other word, the hypothesis <ee, staff, four> was selected. In each separation, we performed hybridization for 2 hours at room temperature, removed supernatant magnetically and added 100 µ of pure water. Then we performed denaturation for 5 min at 95 ◦ C and separated the supernatant. d. Confirmation by PCR & Gel-electrophoresis The above supernatant product was amplified by PCR with different primer sets and the products were confirmed by Gel-electrophoresis. We used four different sets of primers. First, the correct primers “ee” and “five”, second, incorrect primers “?dept ” and “four”, third, incorrect primers “?dept ” and “?f loor ”, and the last, no primers. Each PCR was performed with 50 µM of each dNTP (Takara, Korea) and 5 U of Taq polymerase (Bioneer, Korea) in 10 mM. The reaction was cycled by 30 times in iCycler thermal cycler (Bio-rad, USA) using the following temperature and cycle times; 95 ◦ C 30 s, 58 ◦ C 2 min, and 72 ◦ C 1 min. And the PCR products were defined by 3 % native gel-electrophoresis with Tris HCl, pH 8.0, 50mM KCl, 2.5mM MgCl2 Fig. 8 shows that the amplification was performed well in case of the primer set, “ee” and “five”. Though we can see that some amplification was performed in second case and third case, the band in first case is lighter than the others. This means that the above separation product contains the hypothesis that begins with “ee” and ends with “five” rather than others. The second attribute can be confirmed by PCR with primer set, “ee” and “staff”, or “staff” and “five”. We also tested the possibility of the majority voting. The UV spectrophotometry and the luminescence spectrometry can be used to compare the number of hypotheses i.e. DNA molecules. We measured the signals of different concentration of DNA solution. In case of UV spectrophotometry, we used initial version space without any modification. Photometric assay was performed using spectrophotometer UV-1601 (SHIMADZU), and Fig. 9 shows the result. And
Version Space Learning with DNA Molecules
153
Fig. 8. Result of gel-electrophoresis of the PCR products. Lane 1: primers “ee” and “five”. Lane 2: primers “?dept ” and “four”. Lane 3: primers “?dept ” and “?f loor ”. Lane 4: no primers.
in case of luminescence spectrometry, in generating initial version space, we attached FITC at 5’ end of the third attributes “floor”. As a result, all hypotheses in version space had FITC at their 5’ end. Fluorescence signal was measured by luminescence spectrometer, SLM-AMINCO series 2 (AMINCO) at 520 nm excitation wave length. Fig. 10 shows the result. Considering that the X-axis is log scale, these results show that the values of UV spectrophotometry and luminescence spectrometry are proportional to the concentration of DNA solution. Therefore, it is possible to compare the number of hypotheses in two DNA solutions.
5
Conclusion
We proposed an experimental method to implement version space learning with DNA molecules. An efficient encoding scheme for representing version spaces is presented, where the number of necessary DNA molecules increases only linearly with the number of attribute values. We also showed that the version space learning is reduced to a set operation on hypothesis sets, and defined two primitive operations for maintaining version spaces. Simple experimental methods to maintain a version space were proposed, along with a method to predict the class of a new example with the current version space by majority voting. The number of experimental steps to perform primitive operations and to predict the new example increases only linearly with the number of attributes and the number of attribute values. We confirmed that hypothesis strands of correct length were generated by an experimental result, and tested the magnetic bead separation method for primitive operations in simple experimental process. And the possibility of the method to perform majority voting for prediction was shown.
154
Hee-Woong Lim et al.
Fig. 9. UV spectrophotometry
Fig. 10. Luminescence spectrometry using FITC
The more experimental verification of the methods for the primitive operations for version space maintenance and for prediction of new examples remains as future work. And the experimental verification must be performed by full learning process. The success of experiments depends on the accuracy of affinity separation using magnetic beads. However, this error of experimental process may make version space learning robust to the noisy training examples. Some theoretical and experimental work on this possibility is still necessary. Another point to consider is the case in which an attribute can have more than two val-
Version Space Learning with DNA Molecules
155
ues. In this case, it must be specified how the generalization should be performed [5]. Finally, because all experimental processes described in this paper need only affinity separation by magnetic beads, it seems very plausible that this learning process can be automated by the network of microreactor as described in [8].
Acknowledgement This research was supported in part by the Ministry of Education & Human Resources Development under the BK21-IT program, and the Ministry of Commerce, Industry and Energy through the Molecular Evolutionary Computing (MEC) project, and by the project, RIACT 04212000-0008. The RIACT at Seoul National University provides research facilities for this study.
References 1. L. M. Adleman: Computing with DNA, Scientific American, pages 34-41, August, 1998 2. T. Dean, J. Allen, and Y. Aloimonos: Artificial Intelligence, Addison-Wesley, 1995 3. M. Hagiya, M. Arita, D. Kiga, K. Sakamoto, S. Yokoyama: Towards parallel evaluation and learning of Boolean µ-formulas with molecules, DNA Based Computers III, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 48, pages 57-72, 1999 4. D. Haussler: Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework, Artificial Intelligence, Vol. 36, pages 177-221, 1988 5. H. Hirsh: Generalizing version spaces, Machine Learning, Vol. 17(1), pages 5-45, Kluwer Academic Publishers, 1994 6. H. Hirsh, N. Mishra, L. Pitt: Version spaces without boundary sets, Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI97), pages 491-496, AAAI Press/MIT Press 1997 7. T. M. Mitchell: Machine Learning, McGraw-Hill, 1997 8. D. van Noort, F.-U. Gast, J. S. McCaskill: DNA computing in microreactors, In Proceedings of 7th International Meeting On DNA Based Computers, pages 128137, 2001 9. Y. Sakakibara: Solving computational learning problems of Boolean formulae on DNA computers, DNA Computing, Springer-Verlag, Heidelberg, pages 220-230, 2000 10. S.-Y. Shin, D.-M. Kim, I.-H. Lee, and B.-T. Zhang: Evolutionary sequence generation for reliable DNA computing, Proc. of Congress on Evolutionary Computation 2002, pages 79-84, 2002.
DNA Implementation of Theorem Proving with Resolution Refutation in Propositional Logic In-Hee Lee1 , Ji-Yoon Park2, Hae-Man Jang2 , Young-Gyu Chai2 , and Byoung-Tak Zhang1 1
Biointelligence Laboratory School of Computer Science and Engineering Seoul National University, Seoul 151-742, Korea 2 Department of Biochemistry and Molecular Biology Hanyang University, Ansan, Kyongki-do 425-791, Korea {ihlee,jypark,hmjang,ygchai,btzhang}@bi.snu.ac.kr
Abstract. Theorem proving is a classical AI problem having a broad range of applications. Since its complexity grows exponentially with the size of the problem, many researchers have proposed methods to parallelize the theorem proving process. Here, we use the massive parallelism of molecular reactions to implement parallel theorem provers. In particular, we show that the resolution refutation proof procedure can be naturally and efficiently implemented by DNA hybridization. Novel DNA encoding schemes, i.e. linear encoding and hairpin encoding, are presented and their effectiveness is verified by biochemical experiments.
1
Introduction
DNA computing is famous for its power of massive parallelism. Since Adleman’s first experiment [1], many researchers utilized parallel reactions of DNA molecules to solve hard computational problems [3,7,18]. Recently, several research groups have proposed DNA computing methods for logical reasoning [6,10,16,17]. Theorem proving is a method for logical reasoning and has a variety of applications, including diagnosis and decision making [8,11]. Resolution refutation is a general technique to prove a theorem given a set of axioms and rules. But theorem proving by resolution refutation has a difficulty in practice. If the goal becomes complex or the number of axioms gets large, the time for theorem proving grows exponentially. To overcome this drawback, parallel theorem provers have been proposed [5,9,15]. However, these parallel machines do not overcome the difficulties inherent to silicon-based technology. Wasiewicz et al. [17] describe an inference system using molecular computing. Their inference system is different from ours in that theirs does not use a resolution refutation technique. A resolution method for Horn clause computation was suggested in [6,16], but was not used for theorem proving. Mihalache proposed an implementation of Prolog interpreter with DNA molecules which is an important practical application of resolution refutation theorem proving M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 156–167, 2003. c Springer-Verlag Berlin Heidelberg 2003
DNA Implementation of Theorem Proving
157
[10]. But his suggestion is not physically feasible, because it requires too many experimental steps. Except for the inference system in [17], none of these was implemented in real bio-lab experiments. In this paper, we describe resolution refutation theorem proving methods using DNA, which are verified by experiments. We develop two different encoding methods for logical formulas, i.e. linear and hairpin encodings. These make use of the DNA hybridization reaction in a natural way to perform resolution refutation. Our implementation requires only a constant number of lab steps. The feasibility of the methods is confirmed by lab experiments. The rest of the paper is organized as follows. A brief introduction to theorem proving and resolution refutation is given in Section 2. Sections 3 and 4 describe the linear and hairpin implementation methods and their experimental results, respectively. Conclusions are drawn in Section 5.
2
Theorem Proving with Resolution Refutation
Theorem proving is a method for automated reasoning [8,11]. In theorem proving, one must decide methods for representing information and inference rules for drawing conclusions [8]. Here, we confine ourselves to propositional logic. We use the resolution as inference rule. In this section, we will briefly describe propositional logic, resolution principle, and resolution refutation. Propositional logic formula consists of Boolean variables and logical connectives. Boolean variable is a variable which can take only T (true) or F (false) as its value. Among basic logical connectives, there are ∧ (and, logical product), ∨ (or, logical sum), ¬ (not, negation), and → (implication). A Boolean variable or its negation is called a literal. More exactly, a Boolean variable with ¬ connective is called a negative literal and one without it is called a positive literal. It is proven [2] that any n-ary connective can be defined using only ∧, ∨, and ¬. For example, A → B can be defined as ¬A ∨ B for any Boolean variables A, B. To prove theorems using resolution refutation, every formula must be expressed in clause form. A clause form in propositional logic is defined as follows: (clausef orm) := (clause) ∧ (clause) ∧ · · · ∧ (clause) (clause) := (literal) ∨ (literal) ∨ · · · ∨ (literal) A clause with no literal is called an empty clause. Now the resolution principle can be defined. Let A and B be clauses, and v be a literal such that v ∈ A and ¬v ∈ B. Then, from both A and B we can draw (A − {v}) ∨ (B − {¬v}). We say that we resolved A and B on v and the product of resolution is called a resolvent. In general, a resolution refutation for proving an arbitrary formula ω from a set of formulas ∆ proceeds as follows [11]: 1. Put the formulas in ∆ into the clause form. 2. Add the negation of ω, in clause form, to the set of clauses and call it C.
158
In-Hee Lee et al.
Fig. 1. Theorem proving using resolution refutation. 3. Resolve these clauses together, producing a resolvent that logically follows from them. 4. If an empty clause is produced, a contradiction is occurred. Thus, it is proved that ω is consistent with ∆. Stop. 5. If no new resolvent can be produced, ω is proved not to be consistent with ∆. Stop. 6. Else, add the resolvent to C and go to step 3. When an empty clause is drawn, C is called the proof of the goal ω. Fig. 1 shows an example of theorem proving process. The set of formulas ∆ = {P ∧Q → R, S ∧T → Q, S, T , P } is given. We want to prove R is consistent with ∆. After converting the formulas into clause form, we get a set of clauses {¬P ∨ ¬Q ∨ R, ¬S ∨ ¬T ∨ Q, S, T , P }. Then we add the negation of R, i.e. ¬R, to this set. Each box in Fig. 1 contains one clause. The clauses and their resolvents are connected with arrows. Resolving on P from ¬P ∨ ¬Q ∨ R and P results in ¬Q ∨ R. Similar steps can be continued until an empty clause is produced. The symbol nil denotes an empty clause. The above theorem proving process becomes complex as the number of formulas grows. In the case of the propositional calculus, one must decide the literal to resolve on. Therefore, if there are n different literals, there are n! different theorem proving processes with n! = O(nn ) by Stirling’s approximation. Thus, the complexity of proof grows exponentially. So it is impossible for large n to test one by one with digital computer which one is a logically correct proof. To speed up finding proofs, two approaches were developed. One approach is to use heuristics such as the breadth-first strategy, the set of support strategy, or the unit preference strategy [8]. The other approach is to parallelize the theorem proving process [5,9,15]. We took the second approach and chose to use the massive parallelism of DNA molecular reactions for implementing parallel theorem provers.
DNA Implementation of Theorem Proving (a)
159
(b) 1. 1. Represent Represent Clauses Clauses with with DNA DNA
A
ACGTTAGA
¬A
TCTAACGT
B
TCGTCAGC
2. 2. Hybridization Hybridization
3. 3. Ligation Ligation
4. 4. Polymerase Polymerase Chain Chain Reaction Reaction (PCR) (PCR)
5. 5. Gel Gel Electrophoresis Electrophoresis
Fig. 2. (a) Encoding for a Boolean variable and its negation (The arrows are from 5’ to 3’). (b) The general procedure for our experiments.
3
Linear Implementation of Resolution Refutation
We solved the theorem proving problem shown in Fig. 1 by biochemical experiments. We developed two versions of implementation and performed experiments separately. Each implementation consists of two steps. In the first step, we represent formulas in clause form with DNA molecules. Because we restrict ourselves to propositional logic, we just need to encode each Boolean variable with different DNA sequences. Our implementations are different from each other in the way to make a clause from these variable sequences. In both implementations, we encoded each variable with different sequences of the same length. The negation of a variable is denoted as a Watson-Crick complementary sequence encoding the variable. Encoding for a Boolean variable and its negation is shown in Fig. 2. In the second step, we implemented resolution refutation steps with molecular reactions. This step varies, depending on the representation method used in the previous step. But the general procedure is identical in our two implementations as follows (see Fig. 2-(b)). 1. 2. 3. 4.
Mix DNA molecules corresponding to clauses. Hybridize DNA molecules to perform resolution. Ligate hybridization products to make it easy to find a proof. Perform PCR to amplify ligation products. In this step, only the ligation product which is a valid proof can be amplified. 5. Perform gel electrophoresis to see whether we found a proof or not. 3.1
Representation of Clauses
A clause is designed with single-stranded DNA that consists of each variables in the clause. We determined the order in which a literal in a clause appears so that a valid proof make a linear double-stranded molecule after hybridization step. As an example, sequence for a clause ¬Q ∨ ¬P ∨ R is a concatenation of sequences for ¬Q, ¬P , and R (see the topmost box in Fig. 3).
160
In-Hee Lee et al.
GAGT CAAT CTGA
GACT TGCA ACGT
¬Q
¬P
¬S
R
CTCA
ACGT
GTTA
S
P
T
CTCA
¬T
Q
TGAC
¬R
GTTA
S
T GACT TGCA ACGT R ¬P GAGT CAAT CTGA Q ¬S ¬T ACGT TGAC P
¬R
CTCA GTTA GACT TGCA ACGT GAGT CAAT CTGA ACGT TGAC
¬S
¬T
Q
P
¬R
Fig. 3. Simplified process of linear implementation (The arrows are from 5’ to 3’). 3.2
Implementation of Resolution Refutation
Resolution of a variable between two clauses is represented with hybridization of two regions corresponding to that literal in each clause. For example, when a variable v is resolved from clauses A and B, regions corresponding to v or ¬v in each clause hybridize and other regions remain unchanged. Therefore, if the resolvent is an empty clause, no region will remain as single-stranded DNA. Therefore, to see whether an empty clause is produced, one needs to verify if there exists a molecule with no single-stranded region. We used ligation, PCR, and gel electrophoresis to find this molecule. During ligation, every clause sequence used to produce the empty clause will be ligated into a long double-stranded molecule. In the next step, we amplify this ligation product by PCR with the goal sequence as a primer. Finally, the result is examined by gel electrophoresis. All of these experimental steps are summarized in Fig. 3. 3.3
Experimental Results
The sequences used in the experiment were designed by NACST using evolutionary optimization [14] and synthesized by Bioneer Co. (Tae-Jeon, Korea). The oligomer sequences are given in Table 1. The experiment consists of the following steps: 1. Purification of oligomers: Each oligomer was 5’-phosphorylated and purified by PAGE. Briefly, 1 nM of each oligomer was mixed and incubated for 1 h at 37◦ C with 10 U T4 polynucleotide kinase (Life Technologies) in 70 mM Tris-HCl (pH 7.6) buffer containing 10 mM MgCl2 , 100 mM KCl, 1 mM 2-mercaptoethanol and 1 mM ATP (Sigma, St. Louis, MO, USA), in a volume of 100 µl. The T4 kinase was inactivated by heating to 95◦ C for 10 min.
DNA Implementation of Theorem Proving 587 540 504 458 434
M
1
161
2
267 234 213 192 184 124/123 104 89/80
75 bp
64/57/51 21/18/11/8
Fig. 4. Electrophoretogram of the PCR product in the linear implementation. Lane 1: PCR products with S and ¬R as primers. Lane 2: PCR products with ¬S and R as primers. Lane M is a size marker. 2. Hybridization of oligomers: 100 pM of each oligomer was mixed. Initial denaturation was achieved at 95◦ C for 10 min. During hybridization we lowered temperature from 95◦ C to 16◦ C (1◦ C / min) using iCycler thermal cycler (Bio-rad, USA). 3. Ligation of hybrid molecules: Ligation was achieved with T4 DNA ligase at 16◦ C for overnight using iCycler thermal cycler. The reaction buffer contains 50 mM Tris-HCl (pH 7.8), 10 mM MgCl2 , 5 mM DTT, 1 mM ATP, and 2.5 µg/ml BSA. 4. PCR amplification for ‘readout’: 50 µl PCR amplification contained 100 pM of primer and template. The reaction buffer consists of 10mM Tris-HCl (pH 7.8) containing 50 mM KCl, 1.5 mM MgCl2 , 0.1% Triton X-100, 0.2 mM dNTP, and 1 U DNA Taq polymerase (Korea Bio-technology, Korea). PCR condition is given in Table 2. 5. Gel electrophoresis: Amplified PCR products were purified by electrophoresis on a 15% polyacrylamide gel (30% acrylamide [29:1 acrylamide / bis (acrylamide)]). The running buffer consists of 100 mM Tris-HCl (pH 8.3) 89 mM boric acid, and 2 mM EDTA. The sample buffer is Xylene Cyanol FF tacking dye. Gels were run on a Bio-rad Model Power PAC 3000 electrophoresis unit at 60 W (6V/cm) with constant power. Table 1. Sequences for linear implementation (in order of from 5’ to 3’). clause sequence ¬Q ∨ ¬P ∨ R CGTACGTACGCTGAA CTGCCTTGCGTTGAC TGCGTTCATTGTATG Q ∨ ¬T ∨ ¬S TTCAGCGTACGTACG TCAATTTGCGTCAAT TGGTCGCTACTGCTT S AAGCAGTAGCGACCA T ATTGACGCAAATTGA P GTCAACGCAAGGCAG ¬R CATACAATGAACGCA
162
In-Hee Lee et al.
Table 2. The PCR condition for linear implementation. cycle denaturation (94◦ C) annealing (58◦ C) polymerization (72◦ C) 1 4 min 0.5 min 0.5 min 2 ∼ 26 0.5 min 0.5 min 0.5 min 27 0.5 min 0.5 min 10 min
The electrophoretogram is given in Fig. 4. To make an empty clause, all of the 5 variables must be resolved. Every time one variable is resolved, double-stranded region with 15 bp is produced. Therefore, we can expect a 75 bp band will appear after gel electrophoresis. As can be seen in Fig. 4, an empty clause is produced. Thus, we found a proof for the given problem. 3.4
Discussion
There are some points to be considered to expand this implementation. First of all, we should rearrange variables in each clauses to make an empty clause form a linear double-stranded molecule. This requirement poses a limitation on the type of clauses we can use. That is, some kind of clauses does not make a linear double-stranded molecule no matter how we rearrange the variables. Also, it is impossible to know in advance which sequence to use as a primer. And there are possibilities of false positive. Even if S and T does not exist in Fig. 3, our implementation will make band with expected length. But we can reduce the possibilities of false positive by including exonuclease treatment step before PCR step.
4
Hairpin Implementation of Resolution Refutation
As mentioned in previous section, there are several limitations in linear implementation. To overcome these limitations, we introduce branched molecules and hairpin molecules to represent clauses. In this implementation, we can always use the negation of goal as a primer. Similar idea was introduced by Uejima and others[16] for Horn clause computation. But their work was intended to perform Horn clause computation not resolution refutation. Also in their work, molecular form of the empty clause is different from ours. 4.1
Representation of Clauses
Each clause with n literals is denoted by a branched molecule with n arms except for the clause with one literal. Each n-arm has a sticky end corresponding to each literal. Sticky ends for the positive and negative literals of one variable are complementary. For a clause with one literal which is not the goal, we represent it with a hairpin molecule having a sticky end. We encode the goal clause with a linear single-stranded molecule as in the linear implementation described in the
DNA Implementation of Theorem Proving
163
Fig. 5. The simplified process of hairpin implementation (The arrows are from 5’ to 3’). Table 3. Sequences for hairpin implementation (in order of from 5’ to 3’ ). clause sequence P TATTAAGACTTCTTGTAGTCT ¬P ∨ Q TAATAAGGAA TCATGTTCCT ¬Q CATGA
previous section. For example, if there were ¬Q ∨ ¬P ∨ R, S, and ¬R (the goal), the clause ¬Q ∨ ¬P ∨ R is represented by a 3-armed branched molecule, S by a hairpin molecule, and ¬R by a single-stranded molecule (see Fig. 5). 4.2
Implementation of Resolution Refutation
As in the linear implementation, resolution between two clauses is implemented as hybridization between two molecules. When an empty clause is drawn, it will start with a goal sequence and end with its negation, since clauses are either branched molecules or hairpin molecules except for the goal. Therefore, at the PCR step, we used the negation of goal variable only as a primer. To read the PCR product, we used gel electrophoresis. If a band is formed, we can conclude that the goal is consistent with the given clauses. If not, we say that the goal is not consistent with them. Because each band corresponds to a proof, we can find several different proofs at one time.
164
In-Hee Lee et al. (a)P
(c)¬Q
(b)¬P V Q
5mer
¬P
6mer
P
¬Q 5mer
Q
5mer
5mer
Fig. 6. The form of molecules used in experiments for hairpin implementation. (The arrows are from 5’ to 3’.) Table 4. The PCR condition for hairpin implementation cycle denaturation (98◦ C) annealing (58◦ C) polymerization (72◦ C) 1 5 min 1 min 1 min 2 ∼ 26 1 min 1 min 3 min 27 1 min 1 min 7 min
4.3
Experimental Results
To test our idea, we solved a very simple theorem proving problem: given P and P → Q, is Q consistent with them? Putting it in clause form, we get the clauses {P, ¬P ∨ Q, ¬Q}. The form of molecules representing these clauses are given in Fig. 6. As in the previous experiment, all sequences were designed by NACST [14] and were synthesized by Bioneer Co. The sequences we used are given in Table 3. Experimental steps are the same as in the previous experiment. The main differences between two experiments were the PCR condition and the type of gel used in gel electrophoresis. The PCR condition for this experiment is given in Table 4. We should have used ¬Q, the negation of the goal variable, as a primer in principle. But in this case, the sequence corresponding to it is too short to use it as a primer. Therefore, we used lower part of the molecule in Fig. 6-(b) as a primer. We used 3% agarose gel at the gel electrophoresis step. The electrophorestogram of hairpin implementation is given in Fig. 7. In Fig. 7, we can see bands of 23bp and 46bp as expected in Fig. 6. From this result, we can say our method can find a proof if it exist. If there is no such proof, we should fail to get band with that length. To verify this, we performed the same experiments without one or more molecules in Fig. 6 (see Fig. 8). Only lane 12 in Fig. 8 contains all of them. Therefore a band must appear in lane 12 only. But we can see short bands in lane 2∼7. We think that short bands in lane 3∼7 are formed by hybridization of two P s(see Fig. 6-(a)). And the longer bands in lane 2 seems to be formed by hybridization of P and ¬P ∨ Q (see Fig. 6-(a) and Fig. 6-(b)). But this longer band must be shorter than the band in lane 12, for it does not contain molecule ¬Q. To compare both lanes, we draw dashed lines
DNA Implementation of Theorem Proving
M
50 bp 25 bp
1
2
3
4
5
6
7
8
165
M
23 bp
Fig. 7. The result of ligation mixture after PCR and electrophoresis. Lane 1-8: Ligation mixture under various conditions. Lane M: a 25 bp size marker. The arrow indicates the 23 bp PCR product.
in Fig. 8. We find that the band in lane 2 is shorter than that of lane 12 as we expected. 4.4
Discussion
In this experiment, because of self-complementary sequences such as hairpin and group of sequences with complementary subsequence such as branch, PCR step is extremely difficult. If some of these complementary sequences hybridize during PCR step due to their thermodynamical properties such as melting temperature, amplification will stop at that point and we will get false negative result. Also as you can see in Fig. 7, one proof can have two bands of different length. We think it is due to the hybridization of two hairpin molecules. And there are some cases when our method can not tell whether a proof is found or not. For example, if the goal variable is resolved more than once, our method will tell there exists a proof regardless of the existence of empty clause. To solve this problem, we need a different detection method for empty clause or a different encoding scheme. Also, for we did not restricted the form of clause, non-Horn clause can form a complicated self-loop structure. And even if we restrict ourselves to Horn-clauses, if there are two clauses which have two resolvable variables, they can form a self-loop structure, too. For example, if there are ¬P ∨ Q and P ∨ ¬Q ∨ ¬R, after resolving P or Q, the resolvent can form a self-loop structure. What is worse, as the number of variable grows, we should use longer sequences to encode each variable and the possibility of self-hybridization will grow. To solve this problem, we need to make procedure removing this self-loop or to design new representations for clauses.
166
1
In-Hee Lee et al.
2
3
4
5
6
7
8
9
10
11 12
M
50 bp 25 bp
Fig. 8. The verification of detection method. Lane 1: without Fig. 6-(a). Lane 2: without Fig. 6-(c). Lane 3: without upper part of Fig. 6-(b). Lane 4: without lower part of Fig. 6-(b). Lane 5: Fig. 6-(a) and (c) only. Lane 6: Fig. 6-(a) and lower part of (b) only. Lane 7: Fig. 6-(a) and upper part of (b) only. Lane 8: Fig. 6-(b) only. Lane 9: lower part of Fig. 6-(b) and (c) only. Lane 10: upper part of Fig. 6-(b) and (c) only. Lane 11: the same as lane 8. Lane 12: with all of molecules. Lane M: a 25bp size marker.
When using branched molecules, branch migration may occur. But as suggested in [12], inserting T’s to junction point can reduce the possibility of branch migration.
5
Conclusions
Using molecular reactions of DNA, we proved theorems in the propositional calculus. We presented methods for encoding clauses with DNA molecules and solved theorem proving problems with lab experiments. Our methods are distinguished from other work in several points. First, it does not need additional operations except hybridization. Only simple operations such as ligation and PCR are needed to verify the results. Second, the number of experimental steps does not vary with the problem size. Our implementation methods require only hybridization, ligation, PCR, and gel electrophoresis, and these operations are all O(1) operations. Finally, taking the limit of PCR operation (10,000bp) into consideration, our methods can solve theorem proving problems with up to 660 literals (15mer per literal). As we discussed above, there are several things to consider in both implementations. And we are trying to improve our implementations.
Acknowledgements This research was supported in part by the Ministry of Education under the BK21-IT program and the Ministry of Commerce through the Molecular Evo-
DNA Implementation of Theorem Proving
167
lutionary Computing (MEC) project. The RIACT at Seoul National University provided research facilities for this study.
References 1. Adleman, L., Molecular computation of solutions to combinatorial problems, Science, 266:1021–1024, 1994. 2. Fitting, M., First-Order Logic and Automated Theorem Proving, Springer-Verlag New York Inc., 1942. 3. Hagiya, M., Arita, M., Kiga, D., Sakamoto, K., and Yokoyama, S., Towards parallel evaluation and learning of Boolean µ-formulas with molecules, Preliminary Proceedings of the Third DIMACS Workshop on DNA Based Computers, 105–114, 1997. 4. Hagiya, M., From molecular computing to molecular programming, Lecture Notes in Computer Science, 2001. 5. Hasegawa, R., Parallel theorem-proving system: MGTP, Proceedings of Fifth Generation Computer System, 1994. 6. Kobayashi, S., Horn clause computation with DNA molecules, Journal of Combinatorial Optimization, 3:277-299, 1999. 7. Lipton, R.J., DNA solution of hard computational problem, Science, 268:542–545, 1995. 8. Luger, G.F. and Stubblefield, W.A., Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 2nd Ed., Benjamin/Cummings, 1993. 9. Lusk, E.L. and McCune, W.W., High-performance parallel theorem proving for shared-memory multiprocessors, http://wwwfp.mcs.anl.gov/∼lusk/papers/roo/paper.html, 1998. 10. Mihalache, V., Prolog approach to DNA computing, Proceedings of the IEEE International Conference on Evolutionary Computation, IEEE Press, 249–254, 1997. 11. Nilsson, N.J., Aritificial Intelligence: A New Systhesis, Morgan Kaufman Publishers Inc., 1998. 12. Sa-Ardyen, P., Jonoska, N., and Seeman, N.C., Self-assembling DNA Graphs, Preliminary Proceedings of the Eighth International Meeting on DNA Based Computers, 20–28, 2002. 13. Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T., and Hagiya, M., Molecular computation by DNA hairpin formation, Science, 288:12231226, 2000. 14. Shin, S.-Y., Kim, D., Lee, I.-H., and Zhang, B.-T., Evolutionary sequence generation for reliable DNA computing, 2002 IEEE World Congress on Evolutionary Computation, 2002 (accepted). 15. Suttner, C., SPTHEO - A parallel theorem prover, Journal of Automated Reasoning, 18(2):253-258, 1997. 16. Uejima, H., Hagiya, M., and Kobayashi, S., Horn clause computation by selfassembly of DNA molecules, Preliminary Proceedings of the Seventh International Meeting on DNA Based Computers, 63–71, 2001. 17. Wasiewicz, P., Janczak, T., Mulawka, J.J., and Plucienniczak, A., The inference based on molecular computing, International Journal of Cybernetics and Systems, 31(3):283–315, 2000. 18. Winfree, E., Algorithmic self-assembly of DNA, Ph.D. Thesis, California Institute of Technology, 1998.
Universal Biochip Readout of Directed Hamiltonian Path Problems David Harlan Wood1 , Catherine L. Taylor Clelland2 , and Carter Bancroft2 1
Computer and Information Science University of Delaware Newark, DE 19716 USA [email protected] 2 Department of Physiology and Biophysics, Box 1218. Mount Sinai School of Medicine. One Gustave L. Levy Place, New York, NY 10029 USA {Catherine.Clelland, Carter.Bancroft}@mssm.edu
Abstract. A universal design for a biochip that reads out DNA encoded graphs is enhanced by a readout technique that may resolve multiple solutions of Hamiltonian path problems. A single laboratory step is used. DNA encoded graphs are labeled with many quantum dot barcodes and then hybridized to the universal biochip. Optical readouts, one for each barcode, yield multiple partial readouts that may isolate individual paths. Computer heuristics then seek additional individual paths.
1
Introduction
The design of a universal biochip for readout of any DNA encoded graph with n or fewer nodes is presented. We emphasize reading out multiple solutions of Directed Hamiltonian path (DHP) problems. We do not solve DHP problems. We merely wish to read out DHP solutions. However, an innovative simultaneous quantum dot barcode labeling of DHP solutions with n2 probes yields partial readouts of the biochip. These multiple partial readouts may fully determine some paths, even when multiple paths are present. Further processing by conventional computers seeks additional paths. Since the pioneering molecular computation of a DHP problem [1], careful design of execution conditions and DNA encodings has been employed to successfully solve DHP problems using DNA [2,3,4]. Detection, but not readout, of solutions of the DHP problem was originally performed by a series of DNA strand purifications followed by gel electrophoresis [1]. However, readout of DHP solutions has received less attention. DHP readout by biochip hybridization has been suggested [2,5]. However, DHP and other graph problems can have multiple solutions, and this presents an important difficulty in that following hybridization to the biochip, all paths
The following partial support is gratefully acknowledged: NSF Grant No. 0130385, NSF Grant No. 9980092, and DARPA/NSF Grant No. 9725021
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 168–181, 2003. c Springer-Verlag Berlin Heidelberg 2003
Universal Biochip Readout of Directed Hamiltonian Path Problems
169
become superimposed upon one another, leaving any one path indistinguishable from other paths. Proposed modifications of the original biochip approach have, in principle, overcome the superposition difficulties. One modification uses biochip readout to direct further laboratory steps, including additional biochip readouts, to eventually isolate all of the individual Hamiltonian paths [5]. This method may require O(nk) laboratory steps to find and read out k paths, each of length n. We have designed a readout procedure using innovative multiple labeling and a universal biochip. We describe here a laboratory approach for readout of DHP problems having multiple solutions. We show resolution of multiple DHP solutions may be obtained from n2 partial readouts, one for each of our multiple labels. Heuristics for computational determination of further multiple DHP solutions from our partial readouts are presented. These and further heuristics are under development [6].
2
Graphs and Their Adjacency Matrix Representations
A graph is a set of nodes, usually drawn as points, plus a set of edges, usually drawn as arrows joining pairs of nodes. The graph representation we use is the adjacency matrix of a graph. It should be noted that any graph with at most n nodes can be represented by an n × n adjacency matrix. This is one reason our universal biochip design is able to read out unrestricted graphs (up to size n) encoded in DNA. 2.1
The Adjacency Matrix of a Graph
If there are n vertices in a graph, then we form an n × n adjacency matrix of 0s and 1s. If there is a 1 in the ith row at the jth column, this means that the graph has a directed edge from the ith vertex to the jth vertex. A 0 in that position means there is no such edge. An example of an adjacency matrix representing a map of airline routes involving four cities is Depart c1
Depart c2 Depart c3 Depart c4
Arrive c1 0
Arrive c2 1
Arrive c3 1
1
0
1
1
0
0
1
1
1
Arrive c4 0 1 . 1 0
(1)
From the city in row 1, one must go to either city 2 or city 3. From city 2, one can return to city 1, or go on to either city 3 or city 4. From city 3, one must go to city 1 or city 4, where one can either return or continue to city 2 or city 1.
170
2.2
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
Superpositions of Adjacency Matrices
Sometimes we are given the superposition of the adjacency matrices of a collection of graphs. Technically, superposition is just the operation of elementwise logical OR applied to two or more adjacency matrices. (Logical OR is defined by 0 OR 0 = 0, 1 OR 0 = 0 OR 1 = 1, and 1 OR 1 = 1.) That is, we obtain a superposition matrix with a nonzero element in position i, j if and only if at least one of the graphs in the collection has a nonzero element in position i, j of its adjacency matrix. Note that the airline map in (1) might represent a superposition of three maps: one map for an airline with round robin service, and two more maps for associate airlines providing local shuttle services:
0 0 0 1
1 0 0 0
0 1 0 0
0 0 , 1 0
0 0 1 0
0 0 0 1
1 0 0 0
0 0 1 1 , 0 0 0 0
1 0 0 0
0 0 0 1
0 0 . 1 0
(2)
But notice that several other collections of adjacency matrices could also be superimposed to produce (1). It is clear that when multiple adjacency matrices are superimposed there is no general method for recovering the individual adjacency matrices from the superposition. However, in this paper we present a technique for the optical readout of an n × n universal biochip to be decomposed into n2 partial readouts that may decompose superpositions of adjacency matrices describing collections of directed Hamiltonian paths. Individual adjacency matrices from the collections may be isolated in this way.
3
Universal Graph Biochip and Q-dot Barcode Labels
Let n be the maximum number of nodes in any graph in a collection of graphs. Suppose each input graph in the collection is encoded in a DNA strand containing encodings of its edges (each edge twice, for convenience). Suppose each edge is encoded by appending the encoding of its departure node to the encoding of its arrival node. 3.1
Standardized DNA Encoding of Graph Nodes
DNA sequences encoding each of the n nodes are to be agreed upon in advance, and these DNA sequences are designated c1 , c2 , c3 , . . . , cn . This establishes standard DNA sequences allowing any graph with n or fewer nodes to be encoded in DNA. These standardized sequences only need to be designed once. This means that our universal DNA biochip could be fabricated in quantity without advance knowledge of what graphs it may be used to read out. Only two assumptions are made about the DNA encodings of graphs to be read out.
Universal Biochip Readout of Directed Hamiltonian Path Problems
171
1. Specific standard sequences are used to encode the graph edges. 2. Any additional DNA sequences must not hybridize to the standard sequences or their complements. Thus, we are able to design our own DNA sequences to make them convenient for biochip readout for all graphs. 3.2
Standardized DNA Encoding of Graph Edges
DNA Encodings of Biochip Spots. A universal biochip for readout of all graphs (up to size n) uses standard sequences that need to be designed only once. Each of the n2 spots on our biochip contains the DNA sequence complementary to the sequence resulting from appending the sequence encoding some node ci followed immediately by the sequence encoding some node cj . This is done for each pair i and j, with both i and j running between 1 and n. Thus the DNA in the spot in ith row and the jth column on the biochip will preferentially hybridize to input strands that contain the sequence ci immediately followed by the sequence cj . Figure 1 below shows the layout of the universal chip in the case of n = 4.
Depart Depart Depart Depart
c1 c2 c3 c4
Arrive c1 c1 c1 c2 c1 c3 c1 c4 c1
Arrive c2 c1 c2 c2 c2 c3 c2 c4 c2
Arrive c3 c1 c3 c2 c3 c3 c3 c4 c3
Arrive c4 c1 c4 c2 c4 c3 c4 c4 c4
Fig. 1. Given four standard node sequences, c1 , c2 , c3 , and c2 , the 4×4 universal biochip would have sequences complementary to those shown.
The appended sequences for the n2 spots can be designed and tested for noncrosshybridization without regard for the particular graphs to be read out. One of many methods for obtaining non-crosshybridizing DNA sequences is found in [7]. Non-crosshybridization of DNA sequences for other biochips has been specifically investigated before [8,9]. However, our universal design makes stronger demands than prior investigations. Since we are limited to choosing only n node sequences, we are not able to use n2 independently chosen DNA edge sequences. Preventing cross-hybridization is more challenging in our case because each of our node sequences is used as one-half of many edge sequences (see Fig. 1). Encoding Graphs in DNA Strands. A graph is constructed as a DNA strand that includes encodings of all of the edges of the graph. Although it is not strictly necessary, let us assume each edge occurs twice in the graph encoding. Labels (e.g. fluorescent) can be attached to the DNA strand, and the strand hybridized
172
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
to a universal biochip. Each spot on the biochip is complementary to one of the n2 possible edges. The spots on the universal biochip are prearranged so that when the graph strand is hybridized to the biochip and the labels are observed, the adjacency matrix of the graph is displayed. An example is shown in Fig. 2.
Depart c1
Depart c2 Depart c3 Depart c4
Arrive c1
h x x x
Arrive c2
x h h x
Arrive c3
x x h x
Arrive c4
h x x h
Fig. 2. Appearance of biochip readout for the graph in (1).
3.3
Fabrication of the Universal Biochip
Construction of numerous copies of the universal biochip will involve printing each DNA edge sequence onto glass slides using current protocols. Printing 10,000 spots per biochip is possible [10]. Specifically, our DNA sequences to be printed on the biochip can be commercially synthesized, and then printed onto the array in specific positions, using commercial microarray printers, such as the Affymetrix 477 arrayer. (However, the design of the n2 complementary edge sequences permits constructing them by ligating n synthesized sequences in pairs.) The use of standard hybridization conditions [8] will allow DNA strands to hybridize to spots on the array that contain sequences complementary to edges encoded within each graph. 3.4
Quantum Dot Barcodes for Optical Readout
A fluorescent quantum dot (Q-dot) consists of a nanocrystal cadmium selanide core, wrapped in a zinc sulphur shell [11]. When hit by a beam of light, the electrons in a Q-dot emit light at a specific wavelength (through quantum confinement). The wavelength is directly related to the size of the Q-dot. By varying the shell size, Q-dots with unique, spectrally distinct wavelengths can be produced. A single wavelength of light can be used for simultaneous excitation of all Q-dots [11]. Because Q-dots emit at precise wavelengths, Q-dot emission detection is free of the problems arising from signal overlap or photobleaching [11]. For our purposes, it is important that Q-dots can be integrated into microbeads to form Q-dot barcodes [12] attached to DNA strands. Specifically, Q-dots emitting distinct wavelengths are mixed in known concentrations and incorporated into polystyrene microbeads. This yields beads emitting correspondingly different spectra following excitation. The combinatorial power obtained by varying both the wavelengths and intensities within the fluorescent emission
Universal Biochip Readout of Directed Hamiltonian Path Problems
173
spectra of these microbeads yields large numbers of distinct Q-dot barcodes. It has been suggested that “a realistic scheme” could use 5-6 single frequencies, each with 6 intensity levels, yielding approximately 10,000 to 40,000 distinguishable barcodes [12]. For our applications, the Q-dot barcode microbeads can be attached to DNA strands. The carboxylic acid groups on the surface of Q-dot beads can be conjugated to streptavidin molecules, which can then be conjugated to biotinylated DNA strands. As described in the following section, the optical readout of multiple graphs can exploit the diversity of Q-dot beads that are bound to DNA strands, together forming Q-dot barcode labels. It is important to note that individual Q-dot barcode microbead images are much smaller than the spots on a typical biochip.
4
Graph Readout Using the Universal Biochip
In this section we consider three labeling techniques applicable to collections of graphs each of which is encoded in single stranded DNA. In Subsection 4.1 we use simple uniform labeling of graphs. In this case, all graphs in the collection are superimposed on the biochip. This can give good information if there are very few distinct graphs. In Subsection 4.2 we label only the graphs that contain a specific edge. In this case, only the graphs containing the specific edge are superimposed. This is most useful if only one, or a few, graphs contain the specific edge. The total information is limited, however, by the choice of the specific edge. In Subsection 4.3, we simultaneously label all possible edges in all of the graphs. Again, all graphs in the collection are superimposed, but if each edge label is distinct, we can isolate separate readouts (corresponding to Subsection 4.2) for each possible edge. The visually oriented reader may want to glance ahead to the examples shown in Fig. 4 and Fig. 5. 4.1
Labeling All Graphs in a Collection
Particularly if we have only one DNA encoded graph we wish to read out, it suffices to simply label all the input DNA strands. If there is more than one graph in the input the effect is that all the graphs will be superimposed on the biochip. If there are very few distinct graphs, this may be adequate. 4.2
Labeling a Single Specific Edge
Let us concentrate on the labeling of one specific edge. Focus on only those graphs encoded in DNA strands which contain the encoding of the edge ci → cj . Now let us add a label consisting of DNA complementary to the encoding of the edge ci → cj . This label is attached to a unique Q-dot barcode microbead. This barcode label can hybridize to any input strand that encodes the edge ci → cj .
174
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
C2C1’
C1C2 C1C2 C2C1 C2C1 C3C4 C3C4 C4C3 C4C3
C1C2’
C2C1’
C1C2’
Probe: C2C1 barcode labeled DNA
C2C1’
C1C2’
C2C1’
C2C1’
C1C2’ C1C2’
Probe: C1C2 barcode labeled DNA
C2C1’
Probe: C3C4 barcode labeled DNA
C1C2’
pot in
DNA Encoded Permutation C2C1’
C1C2’
DNA Encoded Permutation C1C2 C1C2 C2C1 C2C1 C3C4 C3C4 C4C3 C4C3
1st
Row,
2nd
Column
Probe: C4C3 barcode labeled DNA
Immobilized DNA strands encoding C2C1’
Spot in
2nd
Row, 1st Column
Fig. 3. A labeled graph hybridized to two typical spots on the universal biochip. All it takes for a spot at the k, l position on the chip to light up is that some graph in the input mixture somewhere contains ci followed by cj (to bind to the ci → cj barcode label) and somewhere contains ck followed by cl (to bind to the k, l spot). Recall we are using just this one label, namely, on the DNA encoding the edge from ci → cj . When we view the whole chip, we see the superposition of all the input graphs that somewhere contain ci followed by cj . The overall effect is that a label corresponding to a particular edge produces the superposition of all graphs containing that edge. 4.3
Labeling All Possible Edges Simultaneously
In the previous subsection a single specific edge was barcode labeled. But in this subsection all edges are labeled. We present a technique where n2 different Q-dot barcode labels are used, one label for each of the possible edges that can occur in a graph. We apply all of these labels simultaneously in one laboratory step. The labels are scarce enough that the DNA encoded graph strands bind to only one, or a few, labels. Labeling is followed by hybridization to the universal biochip. This sets the stage for Section 5, where a single optical readout is separated into n2 images of the biochip, one image for each different barcode edge label. Figure 3 shows how our scheme using multiple labels interconnects with hybridization to the universal biochip. Two examples are shown in Fig. 3, namely, the biochip spots immobilizing DNA complementary to DNA encoding (a) the edge from c1 → c2 and (b) the edge from c2 → c1 . At the top of Fig. 3, as in all the horizontal lines, we have the DNA encoding of a particular permutation graph.
Universal Biochip Readout of Directed Hamiltonian Path Problems
175
(Notice that each edge of the permutation graph is encoded twice.) Watson-Crick complements are indicated by primes. For example, C1 C2 is the complement of C1 C2 . Each of the edges C1 → C2 , C2 → C1 , C3 → C4 , and C4 → C3 , is assigned its own unique Q-dot barcode. Recall the barcoded labels are added so that most graph strands have only one label bound. The left of the figure shows the encoded graph, with various barcode labels, hybridizing to the spot in the first row and second column of the biochip. The right hand side of the figure corresponds to the spot in second row and first column of the biochip. From Fig. 3 it is seen that when encoded as a DNA strand, a graph literally carries a barcode label corresponding to a particular edge to every spot on the biochip that corresponds to some edge of that graph. Or, in the words of one of the reviewers of this manuscript, “each strand is multiply indexed: once by its position on the chip (which identifies one edge in the encoded Hamiltonian path (i, j)), and a second time by an attached label (which identifies a second edge in the encoded Hamiltonian path (k, l))”
4.4
Example Readouts Using Universal Biochip
Figure 4 schematically shows examples of biochip readouts using n2 different Qdot edge barcode labels (Q-dots enlarged about one thousand-fold for clarity).
A
B
C
D
Fig. 4. Part (A) defines a scheme assigning sixteen unique Q-dot barcodes to edges. Part (B) shows the readout of a simple graph. Notice that each lighted spot contains four colors, one for each edge in the graph. Part (C) shows the (superimposed) readout from two graphs. Part (D) shows the readout from an example collection of graphs defined by (3) below.
176
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
The readout in Fig. 4(D) comes from an example collection of tation) graphs having the following adjacency matrices 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 , , , 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 , , , 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0
eight (permu1 0 0 0 0 1 0 0
0 0 , 1 0 1 0 . (3) 0 0
This example collection of graphs will be used to illustrate the methods of the next section. To end this section, we summarize the most important concept of this paper: we label the input graphs with n2 distinct edge barcode labels, and allow the labeled inputs to hybridize on the universal chip; in the next section, optical readout of the universal biochip is obtained and then separated into n2 partial readouts that correspond to the n2 possible graph edges.
5
Optical Readout from n2 Distinct Barcode Labels
In Subsection 4.3 we labeled a collection of DNA encoded graphs with n2 distinct Q-bit edge barcode labels. The n2 barcodes correspond to the complements of every possible edge in the graph. In the laboratory, we can hybridize all labels to all inputs using an automated 384 multi-well format. We then allow the labeled collection of DNA encoded graphs to hybridize on the universal chip. After the hybrid probes of labeled inputs are hybridized to the biochip, hybridization is detected by the fluorescence emission of Q-dots. Standard microarray protocols can be used for hybridization of biochips [13]. Traditional fluorescent microscopy can then be employed for biochip readout using a single wavelength to excite all Q-dot barcode microbeads. Images are then captured with an ImagePoint cooled CCD video camera (Photometrics), through a Labophot-2A fluorescence microscope (Nikon) [14]. 5.1
Separation of n2 Barcode Readouts
Next we extract from the optical readout the contributions from each one of the n2 distinct Q-dot barcode labels. Thus, we obtain n2 separate partial readouts of the biochip data. Our motivation is that the superposition of graphs is lessened in the partial readouts. Each partial readout corresponds to a ci → cj edge barcode for a particular i and j. Such a partial readout exhibits the superposition of only those input graphs containing ci → cj , as discussed in Subsection 4.2. We resolve the images from each of the n2 barcode Q-dot beads distributed randomly over the surface of individual spots on the biochip. These images are
Universal Biochip Readout of Directed Hamiltonian Path Problems
177
about one thousand times smaller than the biochip spots. To our knowledge, no attempts have been made previously to resolve images arising from subsections of single spots on a biochip. 5.2
Readout Constraints on Scaleup for Large Biochips
Biochips with up to 10,000 spots can be fabricated with present technology [10]. Assuming a particular biochip has a value of n2 = 10,000 (a 100 × 100 biochip), and each edge within the input graphs has a unique barcoded bead attached; there would be 10,000 labeled hybrid probes, each with a unique Q-dot barcode. Scaleup is constrained by the following consideration. The diameter of a given Q-dot barcode bead can be between 0.1-5µm depending on manufacturing design, and this size is independent of the number of Q-dots contained within the bead [12]. Q-dot beads are approximately one thousand times smaller in diameter than one biochip spot (Clelland and Bancroft, personal observation). Thus, Q-dot barcode bead images could be resolved over the surface of any spot for barcode readout for rather fewer than one million barcodes. Thus, 10,000 input+label hybrid probes should be distinguishable after hybridization to a biochip. Software that permits measurement of intensity, hue, and saturation from thousands of image objects (e.g. SigmaScan Pro, SPSS Science), can then be employed to scan each biochip spot and thus yield readout of individual barcodes. A simple dilution of input+label hybrid probes would allow simultaneous readout of more beads. Using this approach, we can say that one laboratory step is sufficient for readout when n2 = 10, 000. When n2 > 10, 000 the number of laboratory steps is n2 /10, 000. Our new approach for Hamiltonian path readout could replace the O(nk) laboratory steps of [5] with n2 /10,000 laboratory steps. For example, suppose n = 100, and nk has a maximum value of n2 − n = 9900. In this case, 9900 laboratory steps would be replaced by one. We know of no physical principles that would forbid scaling up universal graph readout to the level described here, and beyond.
6
Isolating Individual DHP from n2 Partial Readouts
Up to this point we have presented designs and techniques for general, unrestricted graphs. We now specialize our results to permutation graphs, which are a modest generalization of directed Hamiltonian paths. A graph is said to be a permutation graph if its adjacency matrix has all 0 entries except one 1 in each row and one 1 in each column. Notice that the graphs in our example collection of graphs in (3) are all permutation graphs. 6.1
Displaying n2 Partial Readouts
Consider a typical partial readout corresponding to the edge ci → cj . Denote this n × n readout matrix by Ri,j . This matrix is the superposition of all graphs
178
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
that contain the edge ci → cj . We can form a large array with Ri,j at the i, j position. For the example collection of graphs given in (3) the partitioned matrix of partial readouts is shown in Fig. 5.
Fig. 5. For the collection of permutation graphs in (3), the n2 partial readouts, one for each edge, are here arranged according to the color scheme in Fig. 4(A). Since the barcode corresponding to Ri,j is placed at position i, j it follows that no information is lost by replacing the information in Fig. 5 with the partitioned matrix 1
0 0 0 0 1 0 0 R = 0 0 1 0 0 0 0 1
0 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0
0 0 1 0 1 0 1 1 1 0 0 1 0 1 0 0
0 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0
0
1
0
0
0
0
1
0
0
0
0
1
1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0. 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 0 0 1 0
0 1
0 0
1 0
1 0
1 0
0 1
1 0
0 0
0 0
1 0
0 1
(4) The partial readouts in (4) have their superpositions reduced compared to the same data before separation (shown in Fig. 4(D)). Indeed, some individual permutation matrices are apparent in (4):
Universal Biochip Readout of Directed Hamiltonian Path Problems
R1,1
1 0 = 0 0
0 1 0 0
0 0 1 0
0 0 0 0 , R2,4 = 0 1 1 0
0 0 0 1
1 0 0 0
0 0 1 0 , R2,3 = 0 0 0 1
0 0 1 0
0 1 0 0
179
1 0 . 0 0 (5)
These are three of the eight permutations in the input collection (3). 6.2
Logical AND of Partial Readouts
For any two partial readout matrices, we define Ri,j AND Rk,l to be elementwise logical AND, which is defined by 0 AND 0 = 0, 1 AND 0 = 0 AND 1 = 0, and 1 AND 1 = 1. Recall that for all i, j, the matrix Ri,j is the superposition of all graphs that contain the edge ci → cj . Using this fact, it can be shown that if some particular permutation matrix in a collection has a nonzero i, j element and it also has a nonzero k, l element, then the matrix Ri,j AND Rk,l must contain this permutation. In fact, it must contain the superposition of all permutation matrices in the collection that have both their i, j and their k, l elements nonzero. Of course, the AND of two partial readout matrices has fewer (or the same) number of nonzero elements than either of the two individual matrices. That is, for any i, j, k, l, the matrix Ri,j AND Rk,l will often have its superposition reduced compared to Ri,j and Rk,l individually. Proceeding with the example collection of permutation graphs from (3) we AND all pairs of the 16 partial readout matrices from (4). We display the maximum number of permutations that could be contained in each of the 256 resulting matrices in the following format
R1,1 R1,2 R1,3 R1,4 R2,1 R2,2 R2,3 R2,4 R3,1 R3,2 R3,3 R3,4 R4,1 R4,2 R4,3 R4,4
R1,1 1
R2,1
R3,1 2
2
1
1
1 1
2 3 2
1
2 6
2
2 1
2 1 1
1 1
1 1
1
1
1 1
R4,1
1 1 1 1 1
1 1 1
1
1 1 1
1
1 2
1 1 1
1 1 1
1 2
1 1
1 2 1
1 1 1
1 1
1 1
1 1
2 2
1 1 2
1 2 1
1 1 2
2 1
2
1
1 1
1 1 1
1 1
2 1
1 1
2 2
1
3
1
1 . 1 2
Collecting all the above 256 matrices Ri,j AND Rk,l that contain only one permutation, we recover exactly the eight distinct permutations that form the original collection that was given by (3).
180
6.3
David Harlan Wood, Catherine L. Taylor Clelland, and Carter Bancroft
Further Observations
Reductions and heuristics are under development [6] for resolving superpositions of collections of permutation graphs. These are founded on three principles. The Exclusion Principle. The partial readout Ri,j has its ith row and the jth column zero except at position i, j. Were this not the case, some permutation matrix would have more than one 1 in some row or column, which is forbidden. A consequence of the Exclusion Principle is the following. The Verification Principle. The adjacency matrix of any permutation in a collection has non-zero entries only at positions 1, p1 and 2, p2 and 3, p3 . . . and 2, pn , where p1 , p2 , p3 , . . . , pn is some ordered rearrangement of 1, 2, 3, . . . , n, only if R1,p1 AND R2,p2 AND R3,p3 . . . AND R1,p1 recreates the adjacency matrix of the permutation. We stated another principle at the beginning of the previous subsection. We repeat it here. The AND Principle. If some permutation matrix in a collection has nonzero elements in its i, j and k, l positions, then the matrix Ri,j AND Rk,l contains the superposition of all permutation matrices in the collection that also have nonzero elements in both their i, j and k, l positions.
7
Summary
We have described a novel technique for reading out arbitrary graphs with up to n nodes using an n × n biochip incorporating standardized DNA sequences, making this biochip universal for all graphs of this size. Multiple graphs, when simultaneously present, limit readout by superimposing their individual biochip data. We dilute superposition by detecting n2 different quantum dot barcode labels (one for each possible graph edge) within the spots on the universal biochip. For the special class of permutation graphs, computer-based heuristics isolated individual graphs from an example collection of graphs. Our only laboratory process is hybridization of an unknown collection of DNA encoded graphs to quantum dot microbead barcode labels, followed by biochip hybridization and optical readout. Future scaleup to graphs with one hundred nodes seems possible using a single laboratory step because (a) pairwise ligation of one hundred standard sequences can yield ten thousand graph edge encodings, (b) printing ten thousand spots on a biochip is about the state of the art, (c) ten thousand distinct quantum dot barcode beads are on the horizon [11,12], and (d) ten thousand distinct barcode images are likely to be resolvable within each spot on a biochip.
Universal Biochip Readout of Directed Hamiltonian Path Problems
8
181
References
[1] L.M. Adelman. Molecular Computation of Solutions to Combinatorial Problems. Science 266:1021-1024, 1994. [2] J. A. Rose, R. Deaton, M. Garzon, R. C. Murphy, D. R. Franceschetti, and S.E. Stevens, Jr. The effect of uniform melting temperatures on the efficiency of DNA computing. In DNA Based Computers III: DIMACS Workshop, June 23-25, 1997, pages 35-42, June 1997. [3] M. H.. Garzon, N. Jonoska, and SA. Karl. The bounded complexity of DNA computing. Biosystems 52;63-72. 1999. [4] CM. Lee, SW. Kim, SM. Kim and U. Sohn. DNA computing the Hamiltonian path problem. Molecular Cells. Oct 31;9(5): 464-469. [5] DH Wood. A DNA computing algorithm for directed Hamiltonian paths. In John Koza, Wolfgang Banzhaf, Kumar Chellapilla, Kalyanmoym Deb, Marco Dorigo, DB Fogel, MH Garzon, DE Goldberg, H Iba, and R Riolo, editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, July 22-25, 1998, University of Wisconsin, Madison, Wisconsin, pages 731-734, San Francisco, 1998. Morgan Kaufman. [6] DH Wood. Determining superimposed permutation matrices from n2 partial readouts. In preparation. [7] R Deaton. J Chen, H Bi, M Garzon, H Rubin, DH Wood. A PCR-based protocol for in vitro selection of non-crosshybridizing oligonucleotides. These proceedings. [8] AA BenDor, et. al., Universal DNA tag systems: A combinatorial scheme. J Comp Biol 7,503, 2000. [9] BJ Rose, R Deaton, M Hagiya, and A Suyama. The fidelity of the tag- antitag system. Preliminary Proceedings of the 7th International Conference on DNA-Based Computing, 302-310, 2001. Also to appear in LNCS series, in press. [10] E Wurmbach, T Yuen, BJ Ebersole, SC Sealfon. Gonadotropin-releasing hormone receptor-coupled gene network organization. J Biol Chem Dec 14;276(50) 47195-201, 2001 [11] E Klarreic. Biologists join the dots. Nature, 413(6855):450-452, October 2001. [12] M Han, X Gao, JZ Su, and S Nie. Quantum-dot-tagged microbeads for multiplexed optical coding of biomolecules. Nature Biochemistry, 19:631-635, July 2001. [13] CL Taylor Clelland, D M P Morrow, and C Bancroft. Gene expression in prostate cancer; microarray analysis of tumor specimens and the LNCaP prostate tumor model series. American Journal of Human Genetics, 67(4):81, 2000. [14] CL Taylor Clelland, B Levy, J M McKie, A M V Duncan, K Hirschhorn, and C Bancroft. Cloning, and characterization of human PREB: a gene that maps to a genomic region associated with trisomy 2p syndrome. Mammalian Genome, 11(8):675681, August 2000.
Algorithms for Testing That Sets of DNA Words Concatenate without Secondary Structure Mirela Andronescu1 , Danielle Dees2 , Laura Slaybaugh3 , Yinglei Zhao1 , Anne Condon1 , Barry Cohen4 , and Steven Skiena4 1
4
University of British Columbia, The Department of Computer Science [email protected], [email protected], [email protected] 2 Georgia Institute of Technology, College of Computing [email protected] 3 Rose-Hulman Institute of Technology, Computer Science Department [email protected] State University of New York at Stony Brook, Computer Science Department [email protected], [email protected]
Abstract. We present an efficient algorithm for determining whether all molecules in a combinatorial set of DNA or RNA strands are structure free, and thus available for bonding to their Watson-Crick complements. This work is motivated by the goal of testing whether strands used in DNA computations or as molecular bar-codes are structure free, where the strands are concatenations of short words. We also present an algorithm for determining whether all words in S ∗ , for some finite set S of equi-length words, are structure-free.
1
Introduction
In search-and-prune DNA computations, many long, DNA strands are created from a small number of short strands, called words. For example, Braich et al. [1] use the words pairs (TATTCTCACCCATAA, CTATTTATATCCACC), (ACACTATCAACATCA, ACACCTAACTAAACT), (CCTTTACCTCAATAA, CTACCCTATTCTACT), (CTCCCAAATAACATT, ATCTTTAAATACCCC), (AACTTCACCCCTATA, TCCATTTCTCCATAT), (TCATATCAACTCCAC, TTTCTTCCATCACAT)
to construct 26 strands, namely those obtainable by taking one of the words from each pair, and concatenating these in pair order. (Since there are two strands per pair and 6 pairs in total, the number of strands obtainable in this way is 26 .) Word sets are carefully designed using computational or information-theoretic methods, so that the resulting long strands behave well in the computation. One
This material is based upon work supported by the U.S. National Science Foundation under Grant No. 0130108, by the National Sciences and the Engineering Research Council of Canada, and by the by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-01-2-0555.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 182–195, 2003. c Springer-Verlag Berlin Heidelberg 2003
Algorithms for Testing
183
property of well-designed words is that the long strands formed from these words do not form secondary structures. For example, the design of Braich et al. uses a 3-letter alphabet, ensures that no sequence of length 8 appears more than once in any strand, and more. The success of the design rests (in part) on the hypothesis that words satisfying these design criteria form strands with no secondary structure. While this hypothesis seems plausible, to our knowledge there is currently no rigorous argument that supports the hypothesis, either for the design of Braich et al. or for any other word design currently employed in DNA computing. For this reason, an efficient algorithm that can test whether a given word set produces strands with no secondary structure would be valuable. With this motivation, we consider the following problem, which we call the structure freeness problem for combinatorial sets. We are given a list of t pre-designed pairs of words {w(i), w(i), ¯ 1 ≤ i ≤ t}. (Here the bar notation is not intended to represent Watson-Crick complementation). All words have the same length l, and are strings over {A, C, G, T } representing DNA strands with the start of the string corresponding to the 3’ end of the strand. Determine whether all of the 2t strands in the set S denoted by the regular expression (w(1) + w(1))(w(2) ¯ + w(2)) ¯ . . . (w(t) + w(t)) ¯ are structure free, that is, are predicted to have no secondary structure, according to the standard, pseudoknotfree, thermodynamic models. Zuker and Steigler [14] developed an efficient algorithm, based on dynamic programming, for determining the optimal (lowest energy) secondary structure of an RNA molecule. Their method can also be applied to DNA molecules, given suitable thermodynamic parameters [13]). The running time of their algorithm is O(n4 ) on a strand of length n, and an improved algorithm of Lyngso et al. [10] runs in time O(n3 ). If the algorithm of Lyngso et al. were applied to each of the words in S independently, the running time would be O(2t n3 ), where n = tl is the length of the strands in S. In Section 3 we describe an algorithm for the structure freeness problem for combinatorial sets that runs in time O(n3 ). The algorithm is a simple generalization of the algorithms of Zuker and Steigler and Lyngso et al. [14,10]. We also present experimental results that compare the performance of our algorithm with an exhaustive search approach to the structure freeness problem for combinatorial sets, and describe cases where we have found structures in previously reported word designs. Our algorithm easily generalizes to solve the following problem. Let Si be a set of strands, all having the same length li , for 1 ≤ i ≤ t, where the li are not restricted to be of the same length. Are all strands in the set S = S1 × S2 . . . St structure free? The running time of our algorithm for this extended problem is O(maxi |Si |2 n3 ), where here n = l1 + l2 + . . . + lt is the length of strands in S. The generalized algorithm can be used to verify structure freeness of strands in the combinatorial sets of Faulhammer et al. [5], and others. Our algorithm can also be used to verify that sets of molecular tags, or barcodes, such as those of Brenner et al. [3], are structure free. In this design, tags are constructed from the set of 8 4mer words
184
Mirela Andronescu et al. {TTAC, AATC, TACT, ATCA, ACAT, TCTA, CTTT, CAAA}.
This set of words is constructed over a 3-letter alphabet (G’s omitted), each word has exactly one “C”, and each word differs from the others in three of the four bases. Call this set S. The tags constructed by Brenner et al. are all of the 88 strands in the set S 8 . This example motivates the second problem studied in this paper. Suppose that in the future, Brenner et al. need more tags, and use strands from the set S 9 or S 10 . Even if all strands in the set S 8 are structure free, it might be possible that some strand in S 9 or S 10 has structure. Ideally, one would like to know that all words in S ∗ are structure free. Here, S ∗ is the set of strands obtained by concatenating zero or more copies of strands in S together. Note that S ∗ contains an infinite number of strands if S is not empty. The set S ∗ is often called the Kleene closure of S. Thus, we define the structure freeness problem for Kleene sets as follows. Given a set S of words, are all words in S ∗ structure free? In Section 4 we show that there is a constant m (depending on S) such that if some word of S ∗ has structure, then some word in S m has structure. This reduces the structure freeness problem for Kleene sets to the structure freeness problem for combinatorial sets. However, our bound on m is exponential in the size of S and the length of the words in S. Whether or not a better bound on m can be obtained is an interesting open question. Before getting to our algorithms in Sections 3 and 4, we provide background on thermodynamic models for calculating the free energy of RNA and DNA secondary structures in Section 2, along with an overview of the secondary structure prediction algorithm of Zuker and Steiglitz [14]. 1.1
Related Work
Our algorithm for determining structure freeness of combinatorial sets was already developed by Cohen and Skiena [4], but applied to a different problem in that work. They were interested in determining, among the RNA sequences coding for a given protein P , which has the most stable secondary structure? Other important related work concerns algorithms for predicting the secondary structure of a RNA or DNA strand. In contrast with the Zuker-Steigler algorithm which is the basis for our work and returns the free energy of the most stable structure accessible to a given strand, the partition function approach of McCaskill [11] measures the propensity of a strand to fold in terms of the sum of the Gibb’s factors (Z = exp(−∆G/RT )) of all folded structures. The quantity Z can be calculated in O(n3 ) time, using a dynamic programming algorithm. An extension of the approach used in our paper to use the partition function of McCaskill would provide more insight on the propensity of a strand to fold than does the “all or nothing” approximation of our current algorithm. We believe that the techniques described in our paper can be extended to the partition function approach, and view this as an important next step for this work.
Algorithms for Testing
185
Other thermodynamic models for DNA and RNA structure folding, that could be used as a basis for approaching the structure freeness problem for combinatorial sets, have been proposed by Hartemink et al. [7] and by Rose et al. [12]). In addition, Rose et al. [?] describe a thermodynamic model (based on a statistical zipper model), for estimating folding propensity of a mixture of DNA strands.
2
Background and Notation
Under the appropriate chemical conditions, an RNA or DNA strand may fold upon itself by forming intramolecular bonds between pairs of its bases, as illustrated in Figure 1 below. In what follows, if s = s1 s2 . . . sn is an RNA sequence and 1 ≤ i < j ≤ n, then i.j denotes the base-pairing of si with sj . A secondary structure of s is a set of base pairs such that each base is paired at most once. More precisely, for all i.j and i .j in the set, i = i if and only if j = j . A pseudoknot in a secondary structure is a pair of base pairs i.j, i .j in the structure with i < i < j < j . Throughout this work, we restrict our attention to secondary structures that have no pseudoknots. Hairpin loop
Multibranched loop
Bulge Stacked pairs
Internal loop External base
Fig. 1. Secondary structure of an RNA strand. The thick black line indicates the backbone and the thin lines indicate paired bases. We can classify loops formed by the bonding of base pairs in a structure according to the number of base pairs that they contain. A hairpin loop contains exactly one base pair. An internal loop contains exactly two base pairs. A bulge is an internal loop with one base from each of its two base pairs adjacent. A stacked pair is a loop formed by two consecutive base pairs (i, j) and (i + 1, j − 1). A multibranched loop is a loop that contains more than two base pairs. An external base is a base not contained in any loop. One base pair in any given loop is closest to the ends of RNA strand. This is known as the exterior or closing pair. All other pairs are interior. More precisely, the exterior pair is the one that maximizes j −i over all pairs i.j in the loop. Speaking qualitatively, bases that are bonded tend to stabilize the RNA, where as unpaired bases form destabilizing loops. The so-called free energy of a
186
Mirela Andronescu et al.
secondary structure, measures (in kcal/mole) the stability of a secondary structure at fixed temperature - the lower the free energy, the more stable the structure. The free energy of a folded RNA or DNA strand can be estimated as the sum of the free energies of its component loops, if the secondary structure contains no pseudoknots. Through thermodynamics experiments, it has been possible to estimate the free energy of common types of loops (see for example [13] for thermodynamic data for DNA). 2.1
Standard Free Energy Model and Notation
Computational methods for predicting the secondary structure of an RNA or DNA molecule are based on models of the free energy of loops. The parameters of these models are driven in part by current understanding of experimentally determined free energies, and in part by what can be incorporated into an efficient algorithm. The free energy of a loop depends on temperature; throughout we assume that the temperature is fixed. We next summarize the notation used to refer to the free energy of loops, along with some standard assumptions that are incorporated into loop free energy models. We refer to a model that satisfies all of our assumptions as a standard free energy model. – eSs (i, j). This function gives the free energy of a stacked pair that consists of i.j and (i+1).(j −1). eSs (i, j) depends on the bases involved in the stack. We use the notation e-Stack(a, b, c, d), where a, b, c, d ∈ {A, C, G, T }, to denote the free energy of a stacked pair involving ab (with the 3’ end at a) and cd (with the 3’ end at c, so that the pairs are (a, d) and (b, c)). Thus, for a strand s = s1 s2 . . . sn , eSs (i, j) = e-Stack(si , si+1 , sj−1 , sj ). Because stacked complementary base pairs are stabilizing, eSs values are negative if si is the Watson-Crick complement of sj and si+1 is the Watson-Crick complement of sj−1 . (The values may be negative in other cases, such as when one of the base pairs is G-U, sometimes called a “wobble pair”.) – eHs (i, j). This function gives the free energy of a hairpin loop closed by i.j. We assume that for all but a finite number of small cases, eHs (i, j) depends only on the length of the loop, si and sj , and on the unpaired bases adjacent to si and sj on the loop. In particular, eH(i, j) = e-Stack(si , si+1 , sj−1 , sj )+ e-Length(j − i − 1), for some function e-Length. Moreover, we assume that the function e-Length(j) grows at most linearly in j. – eLs (i, j, i , j ). This function gives the free energy of an internal loop or bulge with exterior pair i, j and interior pair i , j . We assume that eLs (i, j, i , j ) is given by e-Stack(si , si+1 , sj−1 , sj ) +e-Stack(sj , sj +1 , si −1 , si ) +e-Length(i − i − 1, j − j − 1) where we abuse notation slightly by using e-Length as a function with two parameters here, different from the function e-Length(·) used for hairpins, in this case growing at most linearly in i + j. We omit the free energy of multiloops for brevity in this abstract. The free energy of a strand s with respect to a fixed secondary structure F is
Algorithms for Testing
187
the sum of the free energies of the loops of F . Sometimes when the strand s is understood, it is convenient to refer simply to the free energy of the structure F . In this paper, we define the free energy of a strand s to be the minimum free energy of the strand, with respect to all structures F . (We note that this is a simplification; a better approach would be to define the free energy in terms of the sum of Gibb’s factors [11].) We say that a strand s is structure free (at a given fixed temperature) if its free energy is greater than 0. If the free energy of s is less than 0, we say that the strand s has structure. We note that our algorithm can trivially be modified to test whether any strand in a combinatorial set has free energy below any other threshold, too. 2.2
An Algorithm for Secondary Structure Prediction
Let Ws (j) be the minimum free energy taken over all structures of the strand s1 s2 . . . sj . Our goal is to compute Ws (n), which is the free energy of strand s. Zuker and Steigler [14] described a dynamic programming algorithm for computing Ws (j); their algorithm can easily be extended to produce a structure whose free energy is Ws (n). The algorithm is based on the following recurrences:
8 <0, for j = 0 Ws (j) = :min(Ws (j − 1),
min (Vs (i, j) + Ws (i − 1)), for j > 0.
1≤i<j
Here, Vs (i, j) is the free energy of the optimal structure for si . . . sj , assuming i.j forms a base pair in that structure, also expressible as a recurrence:
(
Vs (i, j) =
+∞ for i ≥ j min(eHs (i, j), eSs (i, j) + Vs (i + 1, j − 1), V BIs (i, j), V Ms (i, j)); for i < j
In turn, V BIs (i, j) is the free energy of the optimal structure for si . . . sj , assuming i.j closes a bulge or internal loop:
8 > <+∞ for j − i ≤ 1 V BIs (i, j) = > : min (eLs (i, j, i , j ) + Vs (i , j )) i .j i
V Ms (i, j) is the free energy of the optimal structure for si . . . sj , assuming i.j closes a multibranched loop (details omitted). The running time of the resulting for the above algorithm is O(n4 ). It can be improved to O(n3 ) using the method of Lyngso et al. [10].
3
Structure Freeness for Combinatorial Sets
In this section, we describe an algorithm for testing that all strands in a combinatorial set are structure free. The input to our algorithm is a list of words w(1), w(1), w(2), w(2), . . . w(t), w(t), each word has length l. Let S = {z1 z2 . . . zt |zi ∈ {w(i) or w(i)}}.
188
Mirela Andronescu et al.
Intuitively, the set S is the set of strands of length n = lt obtained by following a path in the graph of Figure 2 (first suggested by Lipton [9]) from the left end to the right end and concatenating the edge labels along the path.
Fig. 2. “Diamond Graph”: paths in the graph correspond to strands of S. While conceptually the algorithm is quite simple, the description uses a lot of notation that blurs the intuition. In Section 3.1 we first describe an algorithm for a very simple model of secondary structure formation, which captures the intuition. The details of our algorithm for the standard free energy model are presented in Section 3.2. 3.1
Algorithm for the No Repeated k-Strings Model
A simple model of RNA folding is the no repeated k-strings model (simplified from the staggered zipper model), which defines a sequence to have non-empty secondary structure if there is a string of length k that repeats without overlaps anywhere in the string. This captures the notion that long runs of stacked pairs are a primary source of secondary structure, and correspond to occurrences of a given substring and its reverse complement in sequence S. Avoiding the reverse complement instead of a repetition does not introduce any important algorithmic complexity. Cohen and Skiena [4] prove, under the no repeated k-strings model, that it is NP-complete to verify that there is at least one structure-free sequence in a combinatorial set generated in a similar fashion to the diamond graph above. Here, we show that verifying all are structure-free can be done in polynomial time. The distinction is exactly that of proving a given CNF-formula is satisfiable versus proving that it is a tautology. We can verify that all 2t sequences of length n = lt contain no repeated k-string in O(kn2 ) time. For each of the n starting positions in the sequence, there are at most two such positions in the “diamond graph” which generate the sequence starting from this position (either w(i) or w(i)). ¯ Given any two word/position pairs, we now walk forward a total of k characters, comparing the equality of the symbol on each path at every character. At most two such paths are active in each walk, because the paths meet at in-degree-2 vertices of the diamond graph. The existence of a length-k shared walk represents a repeated k-string in at least one sequence, and terminates the algorithm. We next enhance this basic approach for the more general class of secondary structures of the standard models.
Algorithms for Testing
3.2
189
Algorithm for the Free Energy Model
Let WS (n) be the minimum of Ws (n) (as defined in Section 2.2), taken over all s ∈ S. We present in this section an algorithm to compute WS (n). The algorithm runs in time O(n4 ). Using techniques of Lyngso et al. [10] the running time can be improved to O(n3 ). In describing our algorithm, we follow the approach of Section 2.2. Let Sj be the set of prefixes of length j of strands in S and let WS (j) be the minimum of Ws (j) taken over all s ∈ Sj . That is, WS (j) is the optimal energy of the strand in Sj , with the lowest energy secondary structure:
(
WS (j)
=
0, for j = 0, mins∈Sj Ws (j), for 0 < j ≤ n.
Roughly speaking, we would like to be able to express WS (j) in terms of WS (j − 1). It turns out to be more convenient to introduce some new functions to work with in our recurrences. We use word#(j) to refer to the index of the word to which base j of a string in S belongs. Thus, word#(j) = j/l , where l is the word length. Let S(T, j) be the subset of S in which the word containing the jth base is w(word#(j)). Intuitively, S(T, j) corresponds to the set of paths that go through the top of the diamond containing the jth base in Figure 2. Then, S(T, j) is {z1 z2 . . . zt |zi ∈ {w(i), w(i)} for i = word#(j) and zword#(j) = w(word#(j))}. Similarly, let S(B, j) be the subset of S in which the word containing the jth base is w(word#(j)). Intuitively, S(B, j) corresponds to the set of paths that go through the bottom of the diamond containing the jth base in Figure 2. Let WS (bj , j) be the minimum of Ws (j) for all s ∈ S(bj , j). Formally: WS (bj , j) =
Also, WS (j) =
min
s∈S(bj ,j)
Ws (j), for bj ∈ {T, B}.
WS (bj , j). So if we can calculate the WS (bj , j), we are
min
bj ∈{T,B}
done. It turns out to be easier to obtain recurrence for WS (bj , j) than for WS (j). Similarly, we use generalizations of the functions Vs , V BIs , in our recurrences. For example, we use VS (bi , bj , i, j) to denote mins∈S(bi ,bj ,i,j) Vs (i, j), where S(T, T, i, j) is the subset of S corresponding to the set of paths that go through the top of the diamond containing the ith base and the diamond containing the jth base. S(bi , bj , i, j) is defined similarly for other values of bi and bj . The base case is WS (b0 , 0) = 0, for b0 ∈ {T, B}. For j > 0 we have WS (bj , j)
= min
min
bj−1 ∈X(j−1)
min
WS (bj−1 , j − 1)
(bi ,bi−1 )∈X(i,j)
(VS (bi , bj , i, j) + WS (bi−1 , i − 1))
where X(j − 1) and X(i, j) are given by the tables of Figure 3. Due to space limitations, we do not describe further recurrences here.
190
Mirela Andronescu et al.
word#(j) X(j − 1) == word#(j − 1) yes {bj } no {∗}
word#(i) word#(i) X(i, j) == == word#(i − 1) word#(j) yes yes {(bj , bj )} yes no {(T, T ), (B, B)} no yes {(bj , ∗)} no no {(∗, ∗)}
Fig. 3. Tables for X(j − 1) and X(i, j). In the entries of the right hand column of each table, each ∗ may be replaced by a T or a B. 3.3
Experimental Results
We have implemented our algorithm, which we refer to as comfold in what follows. Our implementation uses Turner’s free energy parameters, as incorporated in the mfold algorithm [15]. At this version (1.0 alpha), there are still some small differences in the calculated energy values, compared with Zuker’s mfold or Vienna package’s RNAfold. Another difference is that comfold does not return suboptimal energies, as mfold does. We compared the running time performance of comfold with that of a simple (exponential time) exhaustive search algorithm, which we call exhaust v. The exhaust v algorithm uses the Vienna package’s library to fold each of the possible combinations of words, computes the energy for each, and then returns the combination that folds into the smallest energy. Figure 4 shows the running time of comfold compared to exhaust v, plotted on a logarithmic scale. Our results were obtained on a dual Pentium III 1GHz processor machine, with 1GB of RAM We measured the CPU time in seconds (y) when running comfold and exhaust v, for different number of groups (x) of randomly generated strands. For the left graph, we used 2 words of length 10 in each group and for the right graph we used 3 words of length 10 in each group. The graph indicates that the running time of exhaust v grows exponentially, while that of comfold grows polynomially. At this version of comfold, exhaust v is faster for a small number of groups. However, it is outperformed by comfold when the number of groups is 13 and 9 for the 2 words in a group set and the 3 words in a group set, respectively. Our results also show that for a set with 4 strands in a group, comfold is faster than exhaust v starting with 8 groups. We have also tested several carefully designed combinatorial sets for structure freeness, using our algorithms. The free energy results reported here in Table 1 are for both comfold and exhaust v. We note that the results reported here use the RNA free energy values at 37 degrees Celsius, rather than the DNA values. Every time we found that a set had structure, we computed the minimum free energy of the combination having the smallest minimum free energy. We calculated this strand’s DNA free energy by using Michael Zuker’s folder for DNA. The results are reported in Table 1, together with the CPU time in seconds for both comfold and exhaust v.
Algorithms for Testing 1e+06
191
100000
100000
10000
10000
CPU time (seconds)
CPU time (seconds)
1000 1000
100
10
100
10
1 1 0.1
0.1 exhaust_v comfold
exhaust_v comfold
0.01
0.01 4
6
8
10 12 Number of groups
14
16
18
20
3
4
5
6
7 8 Number of groups
9
10
11
12
Fig. 4. Performance of comfold compared to exhaust v. We measured the CPU time (seconds) for both comfold and exhaust v, for a variable number of groups, of 2 words each (left) and 3 words each (right). Combinatorial set (source) RNA mfe DNA mfe time comfold time exhaust v Braich et al.[1] 0 0 36.50 0.53 Braich et al.[2] 0 0 2,826.63 165,317.00 Brenner et al.[3] -0.10 0 1,027.25 6,851.59 Faulhammer et al.[5] -2.90 -2.20 444.01 63.95 Frutos et al.[6] -18.40 -11.80 1,062.31 7.31
Table 1. Results from testing some cited combinatorial sets for structure freeness.
Our results show that the combinatorial set of Braich et al.[1] described in the introduction is indeed structure free, as well as the set for solving the 20-variable 3-SAT problem with a DNA computer [2]. In contrast, one strand in the set of Brenner et al. did have a very slight negative energy (−0.10kcal/mol), namely, the strand TCTAATCATACTAATCCTTTTACTATCATTAC with corresponding structure ..((((.(((.(((.....))).))).)))) (here parentheses denote matching pairs, and dots denote unpaired bases). However, we obtained a positive DNA energy for this sequence. We also found a strand with negative free energy in the reported sets of Faulhammer et al.[5]. This is CTCTTACTCAATTCTTCTACCATATCAACATCT TAATAACATCCTCCACTTCACACTTAATTAAAATCTTCCCTCTTTACA CCTTACTTTCCATATACAAGTACATTCTCCCTACTCCTTCATAATCTT ATATTCTCAATATAATCACATACTTCTCCAACATTCCTTATCCCACAC ACATTTTAAATTTCACAA, that has an energy of −2.90kcal/mol for RNA, and an energy of −2.20kcal/mol for DNA. The sets of Frutos et al. [6] with two pairs of word labels contain the strand AACGGCATGAAGGCAATTCGCCTTCATGCGTT, with an RNA free energy of −18.40kcal/mol and a DNA free energy of −11.80kcal/mol.
192
Mirela Andronescu et al.
We note the tests reported here were done at a folding temperature of 37 degrees Celsius. The tests can be run at other folding temperatures, depending on the design requirements of the user.
4
Structure Freeness for Kleene Sets
In this section we provide an algorithm for the following problem: given a finite set S of words of equal length, are all words in S ∗ structure free? We call this the structure freeness problem for Kleene sets. We show that there is a constant m, which depends on S, such that if there is a word with structure in S ∗ , then there is a word with structure in S m . The constant m is bounded by exp(poly(|S|, l)), where l is the maximum length of the words in S. With the constant m in hand, the structure-freeness problem for Kleene sets is solvable using the method of Section 3. We prove the existence of m as follows. 1. There is a strand with structure in S ∗ if and only if there is a strand in S ∗ with a so-called energy bounded structure. We define what is an energy bounded structure in detail in Section 4.1, but the intuition is as follows. If there is a strand with structure in S ∗ that has a very long hairpin loop, we can remove whole words from the unpaired bases of the loop to obtain a shorter strand in S ∗ which also has structure, and the energy of the shorter hairpin is bounded. Other parts of the structure that contribute a high positive free energy to the structure can similarly be replaced by a hairpin with bounded energy. 2. The set of strands of S ∗ that have energy bounded structures is context free. Moreover, the size of the context free grammar that generates this set is bounded by a polynomial in |S| and the maximum length l of the words in S. We note that for a given strand with an energy bounded structure, the optimal structure for this strand may not be an energy bounded structure. From this, in Theorem 1 we apply standard results on context free languages that if there is a strand in S ∗ that has structure, then there is such a strand of length at most exp(poly(|S|, l). 4.1
Energy Bounded Structures
A secondary structure F is E-energy bounded for a strand s if, with respect to s, no substructure of F has negative energy, and no substructure of F is expensive, i.e. has energy above some threshold E. Here, a substructure of F is any proper subset of F that contains all base pairs (i , j ) of F with i ≤ i ≤ j ≤ j, for some (i, j) in F , and contains no other base pairs. We call this the substructure of F that is bounded by (i, j). We say that a strand s has E-energy bounded structure if the free energy of some E-energy bounded structure is less than 0.
Algorithms for Testing
193
Lemma 1. Let S be a set of strands. There exists a constant E that grows at most linearly in the length l of words in S such that the following is true: There is a strand with structure in S ∗ if and only if there is a strand with E-energy bounded structure in S ∗ . One direction of the proof is easy: regardless of the value of E, if there is a strand with E-energy bounded structure in S ∗ , then trivially this strand is a strand with structure in S ∗ . The proof of the other direction is based on the fact that for suitable E, expensive substructures can be “cut” out of a structure, and replaced by E-energy bounded structures. 4.2
The Language of Strands with Energy Bounded Structures Is Context Free
There is a context free grammar that generates the language of strands with energy bounded structures. Lemma 2. The language of strands with E-energy bounded structures, when the energy is measured according to the standard energy assumptions, is context free. There is a context free grammar that generates this language which has a number of nonterminals linear in E, in which each production has bounded length (independent of E). Proof. The proof provides a context free grammar for the language of strands with energy bounded structures. The terminals of the grammar are {A, C, G, T }, and the nonterminals generate structures with fixed energy values. First, we consider a simple case, namely how to generate energy bounded hairpins. Specifically, we focus on developing a context free grammar that generates the set of all possible strings that can form a hairpin loop (with no dangling ends) that is closed by the base pair (t, t¯) and has energy e. Call this set H(t, e). We will use the nonterminal Ht,e to generate the set H(t, e). Since H(t, e) is finite, we could simply have one production from Ht,e that directly generates each string in H(t, e). However, according to the standard energy model, other than a few special cases (such as tri-loops or tetra-loops), the free energy of a hairpin formed by pairing the outside bases in the string tau1 u2 . . . uj bt¯ (where t, a, b and ui , 1 ≤ i ≤ j are in {A, C, G, T }) can be expressed as the sum of two terms, namely a stacking energy term, e-Stack(t, a, b, ¯t), and a length term e-Length(j), where j + 2 is the number of unpaired bases in the hairpin. By taking advantage of this, fewer rules are needed. Specifically, from Ht,e , we add rules of the form: Ht,e → taUj−2 bt¯ for each a, b ∈ {A, C, G, T } for which there exists a j for which e − e-Stack(t, a, b, t¯) = e-Length(j). Here, Uj−2 is a nonterminal that generates all sequences of j − 2 bases: for each t ∈ {A, C, G, T } and 1 < i ≤ j, we add the rules Ui → tUi−1 and U1 → t.
194
Mirela Andronescu et al.
The total number of rules generated from Ht,e is a constant independent of e. In addition, the number of nonterminals Ui is approximately e-Length−1 (e). Generalizing from the case of hairpins, we need other new nonterminals in our grammar that generate different types of structures. Mirroring the notation used in the recurrence relations of Section 2.2, we use the following nonterminals (details omitted): – We : generates the set of strings that can form an E-energy-bounded structure with energy e. – Vt,e : generates the set of strings that can form an E-energy bounded structure that is closed by the base pair (t, t¯) and has energy e; – It,e : generates the set of strings that can form an E-energy-bounded structure with energy e, the outermost loop of which is an interior loop (other than stacked pair) that is closed by the base pair (t, t¯). – Mt,e : generates the set of strings that can form an E-energy bounded structure with energy e, the outermost loop of which is a multi-loop that is closed by the base pair (t, t¯). All of the productions above have length that is bounded (there is a maximum of nine symbols on the right hand side of any rule). Lemma 3. Let S be a set of strands, each of length l. Let L be the language of strands in S ∗ with E-energy bounded structures. Then L is context free, and there is a context free grammar that generates L, for which the number of nonterminals is O(E(l|S|)3 ) and each rule has bounded length (independent of E, l, and S). The proof of this lemma follows from classical results of automata theory, that the intersection of a context free language and a regular language is context free (details omitted). The main theorem of this section follows from the previous lemmas, together with classical results from automata theory. Theorem 1. There is a constant m, which depends on S, such that if there is a word with structure in S ∗ , then there is a word with structure in S m . The constant m is bounded by exp(poly(|S|, l)), where l is the length of words in S.
5
Conclusions
We present an efficient algorithm for determining structure freeness of combinatorial sets. The algorithm should prove useful in verifying the quality of word sets designed for DNA and RNA computations. An important direction for future work is to use partition functions to cover the range of alternative structures. Another direction for future work is to study whether our algorithm can be generalized to determine structure freeness of more general sets of strands, such as those used in Adleman’s Hamiltonian graph experiment.
Algorithms for Testing
195
References 1. Braich, R. S., C. Johnson, P. W.K. Rothemund, D. Hwang, N. Chelyapov and L. M. Adleman. Solution of a satisfiability problem on a gel-based DNA computer. Proceedings of the 6th International Conference on DNA Computation SpringerVerlag LNCS, vol. 2054, 2000, pages 27-41. 2. Braich, R. S., Chelyapov, N., Johnson, C., Rothemund, P. W.K. and Adleman, L. Solution of a 20-variable 3-SAT Problem on a DNA computer. Science 296, 2002, 499-502. 3. Brenner, S., Williams, R. S., Vermaas, E. H., Storck, T., Moon, K., McCollum, C., Mao, J-I., Luo, S., Kirchner, J. J., Eletr S., DuBridge, R. B., Burcham, T., and Albrecht. G. In vitro cloning of complex mixtures of DNA on microbeads: Physical separation of deferentially expressed cDNAs. Proceedings of the National Academy of Sciences of the United States of America 2000 February 15; 97(4): 1665-1670. 4. Cohen, B. and S. Skiena. Designing RNA Sequences: Natural and Artificial Selection, Proc. 6th Int. Conf. Computational Molecular Biology (RECOMB ’02). 5. Faulhammer, D., Cukras, A. R., Lipton, R.J., and Landweber, L.F. Molecular computation: RNA solutions to chess problems. Proc. Natl. Acad. Sci. USA, 97: 1385-1389. 6. Frutos, A. G., Liu, Q., Thiel, A. J., Sanner, A. M. W., Condon, A. E., Smith, L. M., and Corn, R. M. Demonstration of a Word Design Strategy for DNA Computing on Surfaces. Nucleic Acids Research, 25 4748 (1997). 7. A.J. Hartemink and D. K. Gifford, Thermodynamic simulation of deoxyoligonucleotide hybridization. Prel. Proc. 3rd DIMACS Workshop on DNA Based Computers, June 23-27, 1997, U. Penn., pages 15-25. 8. I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster, Fast Folding and Comparison of RNA Secondary Structures. Monatsh.Chem. 125, 1994, 167-188. 9. R. Lipton, DNA solution of hard computational problems Science 268 (1995), 542545. 10. Lyngso, R. B., Zuker, M., and Pedersen, C. N. S. Internal Loops in RNA Secondary Structure Prediction. Proc. Third International Conference in Computational Molecular Biology, pages 260-267, April 1999. 11. J.S. McCaskill, The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, Vol 29, pages 1105-1119, 1990. 12. J. A. Rose, R. Deaton, D. R. Franceschetti, M. Garzon, and S. E. Stevens, Jr., A statistical mechanical treatment of error in the annealing biostep of DNA computation. Special program in DNA and Molecular Computing at the Genetic and Evolutionary Computation Conference (GECCO-99), Orlando, FL., July 13-17, 1999, Morgan Kaufmann. 13. SantaLucia, Jr., J. A Unified View of Polymer, Dumbell, and Oligonucleotide DNA Nearest-Neighbor Thermodynamics. Proc. Natl. Acad. Sci. 1998, 95, 1460. 14. Zuker, M. and Steigler, P. Optimal Computer Folding of Large RNA Sequences using Thermodynamics and Auxiliary Information. Nucleic Acids Research, 9, (1981), pages 133-148. 15. Zuker, M., D.H. Mathews, and D.H. Turner, Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide. In RNA Biochemistry and Biotechnology, J. Barciszewski and B.F.C. Clark, eds., NATO ASI Series, Kluwer Academic Publishers, Dordrecht, NL, 1999, pages 11-43.
A PCR-based Protocol for In Vitro Selection of Non-crosshybridizing Oligonucleotides Russell Deaton1 , Junghuei Chen2 , Hong Bi2 , Max Garzon3 , Harvey Rubin4 , and David Harlan Wood5 1
5
Computer Science and Engineering, University of Arkansas Fayetteville, AR, USA 72701 [email protected] http://www.ai.uark.edu 2 Chemistry and Biochemistry, University of Delaware Newark, DE, USA 19716 [email protected], [email protected] 3 Computer Science, University of Memphis Memphis, TN, USA 38152-3240 [email protected] 4 School of Medicine, University of Pennsylvania Philadelphia, PA, USA 19104 [email protected] Computer and Information Science, University of Delaware Newark, DE, USA 19716 [email protected]
Abstract. DNA computing often requires oligonucleotides that do not produce erroneous cross-hybridizations. By using in vitro evolution, huge libraries of non-crosshybridizing oligonucleotides might be evolved in the test tube. As a first step, a fitness function that corresponds to noncrosshybridization has to be implemented in an experimental protocol. Therefore, a modified version of PCR that selects non-crosshybridizing oligonucleotides was designed and tested. Experiments confirmed that the PCR-based protocol did amplify maximally mismatched oligonucleotides selectively over those that were more closely matched. In addition, a reaction temperature window was identified in which discrimination between matched and mismatched might be obtained. These results are a first step toward practical manufacture of very large libraries of non-crosshybridizing oligonucleotides in the test tube.
1
Introduction
Template-matching hybridization reactions between DNA oligonucleotides are the foundation of DNA computing (DNAC). For example, in the Adleman-Lipton paradigm of DNAC[1,2], hybridizations between DNA oligonucleotides produce a search of the solution space that is random, parallel, and exhaustive. In DNA hairpin extension[3], the state transitions of the autonomous molecular computations are encoded in consecutive hybridizations. In DNA tiling structures[4], hybridizations direct the self-assembly process. Therefore, a critical step in DNAC M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 196–204, 2003. c Springer-Verlag Berlin Heidelberg 2003
A PCR-based Protocol for In Vitro Selection
197
is to construct an appropriate encoding of the problem in DNA oligonucleotide sequences such that hybridization implements the desired solution or algorithm. This has been termed the DNA word design problem, which involves several constraints. Oligonucleotides should hybridize only as designed. Otherwise, unwanted cross-hybridizations can introduce errors, such as false positives and negatives, and degrade efficiency and scaling by wasting oligonucleotides in hybridizations that do not contribute to the solution. In addition, the set of words, or library, should be large enough to represent the problem and implement a solution. Picking non-crosshybridizing sequences is straightforward for small problems. On the other hand, for DNAC to take full advantage of the massive number of sequences available, large libraries are necessary. As the size of the required library grows, however, the constraints, i. e. size and non-cross-hybridization, increasingly conflict, which makes it difficult to generate libraries which are both good and very large. Most attempts at word design for computation have used combinatorial methods implemented in software design tools[5,6]. An alternative to design of oligonucleotides by computer is to manufacture them in the test tube with in vitro evolution, as was originally proposed in [7]. In a sense, the manufacture of the libraries is a DNA computation that solves the DNA word design problem, and is well suited for in vitro techniques. Large numbers of random nucleic acid sequences are handled routinely in systematic evolution of ligands by exponential amplification (SELEX)[8]. The polymerase chain reaction (PCR) provides a tool for selection of non-crosshybridizing sequences. Complicated hybridization models do not have to be explicitly programmed, but are present ab initio in the test tube. The responsiveness of DNA to mutagenesis can be exploited to search the sequence space. Therefore, the vast number of possible sequences and massive parallelism of DNAC could generate huge libraries that are selected in vitro under the same conditions that will be used for computing. While the manufacture of the libraries with in vitro evolution might eventually use both mutation and crossover for randomization of the sequences, in this paper, the focus is on a new PCR-based protocol to select maximally mismatched oligonucleotides. In the first section, the design of the protocol is explained. Next, an experiment to verify that the protocol actually selects maximally mismatched oligonucleotides and to determine the temperature range in which libraries might be manufactured is described. In addition, the method used to design the DNA sequences for the experiment is summarized. The results of the experiments confirmed that the protocol preferentially amplified maximally mismatched oligonucleotides over a range of temperatures that are realistic for library manufacture. Therefore, an experimental protocol that selects maximally mismatched pairs of oligonucleotides has been developed and confirmed. This protocol is a possible basis for in vitro manufacture of huge, non-crosshybridizing libraries of oligonucleotides.
198
Russell Deaton et al.
Fig. 1. PCR with adjustable mutation and selection.
2
The Protocol
The libraries are evolved using a simple protocol (Figure 1), a variation of PCR with adjustable selection and mutation. All strands in the initial population are of the same length, and have a segment in the middle that is initially fabricated subject to randomness. Thus, the starting population for the selection protocol would be a huge collection of randomly manufactured DNA oligonucleotides with PCR primer sequences attached at each end. After a rapid quenching step that
A PCR-based Protocol for In Vitro Selection
199
freezes pairs of oligonucleotides into mismatched configurations, PCR is done at a low temperature that melts duplexes that have a high degree of mismatches, but not duplexes that are closer to being Watson-Crick complements. Therefore, the protocol as designed, selectively amplifies oligonucleotides that are present in mismatched duplex configurations, and thus, have a lower thermal stability. This is the property that is confirmed in this paper. A key design feature is that all the strands have universal, unchanging end sequences, labeled P1 and P2 . These are carefully designed primer sites that dependably bind to their complementary sequences if and only if they are in correct alignment. Naturally, one benefit is that the whole library can be readily amplified using PCR. Optionally, this could be followed by restriction enzyme cleavage to remove the primer sites. Of course, P1 and P2 cannot also occur within the middle segment, which is a limitation of the protocol. Duplex configurations with a range of mismatches are generated by first heating the mixture to a high temperature, and then, rapidly cooling to room temperature or below. This quenching during annealing should increase the likelihood of very mismatched configurations forming, thus, making them available for selection during the subsequent polymerase extension. When the protocol is iterated, each extension would be preceded by a quenching to freeze mismatched configurations for selection.
3
Experimental Design
As an initial step in the manufacture of the libraries, the selection properties of the PCR protocol had to be verified. This means that the ability of the protocol to preferentially amplify maximally mismatched oligonucleotides rather than oligonucleotides that are closer to being Watson-Crick complements must be confirmed. In addition, the selection property of the protocol had to be confirmed over a range of temperatures, as well as the efficiency of the polymerase at lower temperatures. To do this, DNA oligonucleotides were designed in the templates shown in Figure 2. The actual sequences are shown in Table 1. Each template had primers P1 and P2 at the ends. The middle regions correspond to four different degrees and type of mismatch. One middle region was perfectly Watson-Crick complementary (T1), another was completely mismatched (T4), and the third and fourth contained different types of mismatches in a Watson-Crick duplex, two isolated mismatches (T2) and a region of 3 contiguous mismatches (T3), respectively. The sequence designs were generated with a new software tool[9] that selects non-crosshybridizing sequences from an initial random pool. Hybridizations between oligonucleotides were determined from the nearest-neighbor model of duplex thermal stability[10] by computing the best local alignment between two oligonucleotides using a variation of the Smith-Waterman dynamic programming algorithm[11] with free energy as the scoring function.
200
Russell Deaton et al. P1
M1
P2’
P1’
M1’
P2
M1
P2’
T1
P1 T2
X
X
P1’
M2
P2
P1
M1 XXX
P2’
P1’
M3
P2
T3
P1
M1
P2’
T4 P1’
M4
P2
Fig. 2. Template configurations (T1-T4) for PCR experiments. X indicates mismatch. indicates reverse Watson-Crick complement.
Template T1 (P1-M1-P2 ) T1 (P1 -M1 -P2) T2 (P1 -M2-P2) T3 (P1 -M3-P2) T4 (P1 -M4-P2)
Sequence tcttcataagtgatgcccgtaaaataccctcccccctagagaaaaaacaccccttcgatg catcgaaggggtgttttttctctaggggggagggtattttacgggcatcacttatgaaga catcgaaggggtgttttttctcttggggggagggtcttttacgggcatcacttatgaaga catcgaaggggtgttttttctctaggtcagagggtattttacgggcatcacttatgaaga catcgaaggggtgttttttcacctatagtttgatatagatacgggcatcacttatgaaga
Table 1. Template sequences used in PCR protocol experiments. T1 was the template that was annealed to the others (T1 , T2-T4). Each template was composed of 3 individually designed 20-mers. P1 and P2 were primers for PCR. The middle sequences were designed for various degrees of matcing with M1. M1 was reverse Watson-Crick complement of M1. M2 introduced isolated T-T and C-T mismatches, M3 introduced a 3 bp loop in middle, and M4 was designed to be mismatched with M1. indicates a reverse Watson-Crick complement.
4
Methods and Materials
Oligonucleotides were purchased from Integrated DNA Technologies. They were purified from denaturing polyacrylamide gels after synthesis. Primer P1 was 32 P-
A PCR-based Protocol for In Vitro Selection
201
labeled at the 5’-end using T4 DNA kinase and [γ −32 P ] ATP. The ds DNAs (T1, T2, T3, and T4 ) were formed by annealing each pair of complementary single-stranded oligonucleotides. For PCR, 8 ng of 32 P-labeled primer P1, 8 ng primer P2, and 60 ng ds DNA(082-38’1, 082-351, 082-391, and 082-3101, individual) templates were incubated in a PCR buffer of 50 mM KCl, 10 mM Tris-HCl, 0.1% Triton X-100, 2.5 mM MgCl2 , 0.4 mM 4 dNTP, and 3 U Taq DNA Polymerase in total 10µl volume at designed temperatures (37◦ C, 40◦ C, 43◦ C, 46◦ C, 48◦ C, 50◦ C, 56◦ C, 62◦ C, 68◦ C, and 72◦ C). The reaction was incubated for 60 minutes. The primer extension was done for just one cycle. No further denaturing and re-annealing was done. The Primer extension products were loaded onto 12% ( w/v ) denaturing polyacrylamide gel (8M Urea) with 1X TBE buffer. The gel was run for 1 hour at 60◦ C. The voltage was 400 V, and the results were documented by autoradiography.
5
Results
The results of the experiments are shown in Figures 3 and 4. In these figures, the fully extended product (60-mer) would appear in the bands at the top, and the primers appear at the bottom. The smear in between are unsuccessful extensions. Emphasis should be placed on the topmost band (60-mer) of extension products. Figure 3 shows the ability of PCR to selectively amplify different degrees of matching. At the lower temperatures, 52◦ C and 58◦ C, the maximally mismatched template (T4) was amplified preferentially over the other templates, which are closer to being exact Watson-Crick complements. At 64◦ C, amplification becomes more significant for templates T1-T3. At 74◦ C, amplification was reduced in all cases, probably from exceeding the melting temperature of the primers. In addition, there was some slight differences between templates T1T3, reflecting the degree of stability from incorporated mismatches. The perfectly matched template (T1) was most stable, followed in decreasing order of stability, two isolated mismatches (T2), a small loop of mismatches in the middle (T3), and the complete mismatch (T4). Thus, PCR can select maximally mismatched oligonucleotides from more closely matched oligonucleotides, and preferentially amplify them. In Figure 4, a comparison of primer extension for the completely matched template (T1) versus the completely mismatched template (T4) over a wider range of temperatures is shown. From 37◦ C to 43◦ C, the mismatched template was amplified without any significant amplification of the matched template. Thus, a potential range of temperatures was identified over which maximum selectivity might be obtained for manufacture of the libraries.
202
Russell Deaton et al.
Fig. 3. A denaturing gel comparing the primer extension products of four different templates, as diagrammed on top of each panel at various temperatures. The primer is P1. The temperatures are 52◦ C, 58◦ C, 64◦ C, 70◦ C and 74◦ C from left to right in each panel. By focusing on the top most band in each gel, the degree of full product extension (60-mer) can be gauged. At lower temperatures, the maximally mismatched template was preferentially amplified over the perfect Watson-Crick and lower degree of mismatch tempates.
6
Discussion
These experiments represent a first step in the in vitro manufacture of huge libraries of non-crosshybridizing oligonucleotides. They indicate that the PCR protocol is capable of selectively amplifying maximally mismatched hybrid pairs over pairs with perfect matching or lesser degrees of mismatching. Importantly for the manufacturing protocol, a range of temperatures (37◦ C to 43◦ C) was identified in which the preferential selection occurs. Even though the possibility of PCR selection is confirmed, serious issues still remain before libraries can be routinely manufactured. For example, methods will have to be developed to estimate the size of the non-crosshybridizing library in vitro, i.e. number of distinct sequences. Currently, the ultimate size of the libraries that can be manufactured with this protocol is unclear. Certainly, the size will depend upon the initial, starting set of random oligonucleotides. The large and more diverse this set is, the larger the non-crosshybridizing library should be. In addition, the size will be affected by the efficiency of quenching to generate appropriate configurations for selection, i.e. maximally mismatched.
A PCR-based Protocol for In Vitro Selection
203
Fig. 4. A denaturing gel comparing the primer extension products of two templates, perfect matched and maximally mismatched, as diagrammed on top of each panel, at temperatures from left to right range from 37◦ C to 72◦ C. (37, 40, 43, 46, 48, 50, 56, 62, 68, 72). P1 was the primer. The topmost band of fully extended products (60-mers) shows preferential amplification of the maximally mismatched template over the perfect Watson-Crick template.
Moreover, the effect of multiple rounds of PCR on the effectiveness of the protocol has to be tested. This is important because the selection step actually creates a Watson-Crick copy of the most fit strands, which are then less likely to be selected in subsequent rounds. It may be that one round of selection is enough, or that by denaturing and quenching the annealing reaction, mismatched structures can be trapped for subsequent amplification. Additional issues are labeling, sorting, and cataloging of library sequences.
7
Conclusion
As noted by Richard Lipton[12], in order for DNAC to realize its potential, and possibly beat conventional, electronic computers, DNA computers have to exploit three dimensional space, random interactions between molecules, and the sheer vastness of the DNA sequence space. In addition, challenges associated with scaling and errors have to be overcome. The in vitro evolution of libraries
204
Russell Deaton et al.
might result in non-crosshybridizing oligonucleotides for three dimensional DNA computing, as well as surface-based approaches, that minimize errors and maximize hybridization efficiency. Randomization is used to initiate the libraries and search the space of sequences. The goal of the evolution is to manufacture as large a library, using as much of the DNA sequences space as possible. Thus, the libraries represent an enabling resource for scaling reliable DNA computations. The PCR-based protocol tested in this paper potentially provides a mechanism for selection of non-crosshybridizing DNA sequences from a starting population, and thus, is the first step in the in vitro manufacture of non-crosshybridizing libraries.
8
Acknowledgments
This work was supported by NSF Award EIA-0130385.
References 1. Adleman, L.M.: Molecular computation of solutions to combinatorial problems. Science 266 (1994) 1021–1024 2. Lipton, R.J.: DNA solution of hard computational problems. Science 268 (1995) 542–545 3. Sakamoto, K., Gouzu, H., Komiya, K., Kiga, D., Yokoyama, S., Yokomori, T., Hagiya, M.: Molecular computation by DNA hairpin formation. Science 288 (2000) 1223–1226 4. Winfree, E., Liu, F., Wenzler, L.A., Seeman, N.C.: Design and self-assembly of two-dimensional DNA crystals. Nature 394 (1998) 539–544 5. Baum, E.B.: DNA sequences useful for computation. In Landweber, L.F., Baum, E.B., eds.: Proceedings of the Second Annual Meeting on DNA Based Computers. Volume 44., Providence, RI, DIMACS, American Mathematical Society (1998) 122–127 DIMACS Workshop, Princeton, NJ, June 10-12, 1996. 6. Marathe, A., Condon, A.E., Corn, R.M.: On combinatorial DNA word design. In Winfree, E., Gifford, D.K., eds.: DNA Based Computers V, Providence, RI, DIMACS, American Mathematical Society (1999) 75–90 DIMACS Workshop, Massachusetts Institute of Technology, Cambridge, MA, June 14-16, 1999. 7. Deaton, R., Murphy, R.C., Rose, J.A., Garzon, M., Franceschetti, D.R., Stevens Jr., S.E.: A DNA based implementation of an evolutionary search for good encodings for DNA computation. In: Proceedings of the 1997 IEEE International Conference on Evolutionary Computation, IEEE (1997) 267–272 Indianapolis, IN, April 13-16. 8. Osborne, S.E., Ellington, A.D.: Nucleic acid selection and the challenge of combinatorial chemistry. Chemical Review 97 (1997) 349–370 9. Deaton, R., Chen, J., Bi, H., Rose, J.: A software tool for generating noncrosshybridizing libraries of DNA oligonucleotides. In this volume (2002). 10. SantaLucia, Jr., J.: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. 95 (1998) 1460–1465 11. Smith, T.F., Waterman, M.S.: The identification of common molecular subsequences. J. Mol. Biol. 147 (1981) 195–197 12. Lipton, R.: DNA computing: does it compute? Plenary address at the Seventh International Meeting on DNA Based Computers (2001)
On Template Method for DNA Sequence Design Satoshi Kobayashi1, Tomohiro Kondo1 , and Masanori Arita2 1
Department of Computer Science, University of Electro-Communications 1-5-1, Chofugaoka, Chofu, Tokyo 182-8585, JAPAN [email protected] 2 PRESTO, JST and Computational Biology Research Center, AIST 2-41-6, Aomi, Koto-ku, Tokyo 135-0064, JAPAN [email protected]
Abstract. The design of DNA sequences is one of the most practical and important research topics in DNA computing. Although the design of DNA sequences is dependent on the protocol of biological experiments, it is highly required to establish a method for the systematic design of DNA sequences, which could be applied to various design constraints. Recently, Arita proposed an interesting design method, called template method, for DNA word design. The current paper discusses on some extensions of the template method, where we propose a variant of template method, the use of multiple templates, and show their effectiveness. Furthermore, we also present some theoretical properties of templates.
1
Introduction
The design of DNA sequences is one of the most practical and important research topics in DNA computing([Adl94]). In order to obtain successful results of biological experiments, we have to find a good DNA coding of the target computational problem. Furthermore, the design of DNA sequences is an important problem not only in DNA computing technology but in other biotechnologies such as the design of DNA chips. Although the design of DNA sequences is dependent on the protocol of biological experiments, it is highly required to establish a method for the systematic design of DNA sequences, which could be applied to various design constraints. Most of previous works introduced some variants of Hamming distance between sequences, and proposed methods to minimize the similarity between sequences based on that measure([GND97]). Typical and well known approaches contain combinatorial word design, random generation, and genetic algorithms, etc ([BKS00][DGR98][FLT97]). Recently, [AK02] proposed an interesting design method, called template method, for DNA word design. The essence of the method is to use a DNA
The first author was supported in part by Japan Society for the Promotion of Science, Grant-in-Aid for Exploratory Research NO.13878054, 2001.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 205–214, 2003. c Springer-Verlag Berlin Heidelberg 2003
206
Satoshi Kobayashi, Tomohiro Kondo, and Masanori Arita
sequence representation by a pair of binary words, called template and code, respectively. The method proposes to design sequences in two steps. In the first step, the positions of [GC] and [AT] are fixed for all sequences. In the second step, the positions of [AG] and [CT] are determined for each sequence using the theory of error correcting code. Since the [GC] positions are fixed for all sequences, GC content of each sequence is equal to each other, which enables us to design DNA sequences with approximately the same melting temperature. Similar idea appeared in [FLT97], but [AK02] developed their idea to obtain more powerful results. This paper contain some results to further extend the template method. we first propose a variant of the template method, in which we fix, not [GC] positions, but [AG] positions for all sequences. Although this AG-template itself does not satisfy the GC content constraint, we will show that it is useful to apply the theory of constant weight codes in order to satisfy that constraint. Next we propose to use multiple templates instead of single template in order to increase the number of DNA sequences to be designed. Finally, we will derive some theoretical properties of templates.
2
Template Method
In this paper, we focus mainly on the set of words over the alphabet either Σdna = {A, C, G, T} or Σ01 = {0, 1}. The words over Σ01 are used by the template method ([AK02]) for designing DNA sequences over Σdna . Let x = x1 · · · xn be a word on the alphabet Σ with xi ∈ Σ (i = 1, ..., n). By lg(x), we denote the length n of x. By x, we denote the proper subword of x of length n − 2. In case of n < 2, x is defined to be the empty word λ. The reverse of x, denoted by xR , is the word x = xn · · · x1 . If Σ = Σdna , the complement of x, denoted by x, is the word obtained by replacing each A in x by T and vice versa, and by replacing each C in x by G and vice versa. In case of Σ = Σ01 , x is the word obtained by replacing each 0 in x by 1 and vice versa. Let y = y1 · · · ym be a word on the alphabet Σ with yi ∈ Σ (i = 1, ..., m). In case of n = m, the Hamming distance H(x, y) between x and y is the number of indices i such that xi = yi . For a finite set S of words of the same length, H(S) is the minimum Hamming distance among all pairs of distinct elements in S. In case of n ≤ m, we define: HM (x, y) = min{H(x, y ) | y is a subword of y of length n}. In case of n > m, HM (x, y) is defined to be n. The template method is proposed to find a set of DNA sequences satisfying the following constraints. Let us consider the case when we are designing a set S of words over Σdna of the same length. 1. Hamming distance — For any pair of distinct words x, y in S, H(x, y) should be at least a given integer d.
On Template Method for DNA Sequence Design
207
2. Hamming distance with reverse-complement — For any pair of (possibly the same) words x, y in S, H(x, y R ) should be at least a given integer d. 3. Hamming distance between concatenated words — For any (possibly the same) words x, y, z in S, HM (x, yz), HM (x, y R z R ), HM (x, yz R ), and HM (x, y R z) should be at least a given integer d. 4. GC content — The number of occurrences of G, C in a word x is called GC content of x. Then, this constraint requires every word in S should have the same GC content. GC content is an important indicator of the melting temperature of short oligonucleotides. We define: R(S) = min{ HM (x, y R ), HM (x, yz), HM (x, y R z R ), HM (x, yz R ), HM (x, y R z) | x, y, z ∈ S}, ||S|| = min{H(S), R(S)}. Then, the problem above can be formulated as follows: Problem 1. For a given positive integer d and n, design a set S of n words over Σdna such that ||S|| ≥ d and GC content of each word in S is the same to each other. In order to solve this problem, the template method ([AK02]) uses a DNA sequence representation by a pair of binary words. Let us consider two homo∗ → Σ01 such that φ(G) = φ(C) = 1, φ(A) = φ(T ) = 0 and morphisms φ, ψ : Σdna ψ(A) = ψ(G) = 1, ψ(C) = ψ(T ) = 0. For a word w over Σdna , the binary representation of w is defined to be (φ(w), ψ(w)), where φ(w) and ψ(w) are called GC-component and AG-component of w, respectively 1 . Note that the binary R representation of w R is (φ(w)R , ψ(w) ). The idea of template method is summarized as follows. In the template method, GC-component of every word in S is fixed, and its unique GC-component is called GC-template of S 2 . The GC-template of S is carefully chosen so that R(S) ≥ d holds for a given d. On the other hand, AG-component of every word in S is designed so that H(S) ≥ d holds, for which we can use the theory of error correcting codes. Furthermore, since the GC-component of every word in S is fixed, the constraint of GC content is also satisfied. In order to find a GC-template of S such that R(S) ≥ d, it is useful to introduce the following notation for a binary word x: ||x|| = min{ H(x, xR ), HM (x, xx), HM (x, xR xR ), HM (x, xxR ), HM (x, xR x) }. 1 2
In [AK02], the bitwise negation of φ(w) is called a template of w, and ψ(w) is called a code of w. In [AK02], the bitwise negation of GC-template is just called as template.
208
Satoshi Kobayashi, Tomohiro Kondo, and Masanori Arita
Then, for finding a GC-template of S such that R(S) ≥ d, it suffices to find a binary word (GC-template) x such that ||x|| ≥ d. In [AK02], some theoretical analysis on the GC-templates was presented. Furthermore, for the length l ≤ 30, all of the best GC-templates are searched exhaustively, and presented at the appendix of [AK02].
3
DNA Sequence Design Using AG-templates
In this section, we propose another natural approach, in which the AG-component of every word is fixed and its unique AG-component, called AG-template, is used to satisfy R(S) ≥ d. For this purpose, we introduce the following notation for a binary word x: x = min{ H(x, xR ), HM (x, xx), HM (x, xR xR ), HM (x, xxR ), HM (x, xR x) }. We can easily derive the following fact: Fact 1 For any binary word x such that x ≥ d and any set S of words over Σdna using x as an AG-template, R(S) ≥ d holds. For each length l ≤ 30, we exhaustively search the best AG-template x that gives the maximum value of x. The results are summarized in Table 1. The integers in the round brackets represent the number of AG-templates x that gives the maximum x values. length x length x length x
3 1 (4) 13 5 (20) 23 9 (848)
4 1 (8) 14 5 (8) 24 10 (24)
5 1
6 2
7 2
8 2
9 3
(24) 15 6 (4) 25 10 (208)
(14) 16 6 (232) 26 10 (27836)
(32) 17 6 (956) 27 11 (180)
(92) 18 6
(44) 19 7 (252) 29 12 (52)
(11564) 28 12 (12)
10 4 (4) 20 8 (200) 30 12 (23056)
11 4 (20) 21 8 (408)
12 4 (358) 22 8 (23510)
Table 1. Maximum x values of AG-Templates Compared to the best values of ||x|| in Table 2, AG-templates seem to give better values. In fact, at length l = 3, 4, 5, 9, 10, 13, 14, 15, 24, 25, 27, 28, 29, the maximum values of x are larger than those of ||x||. This feature of AGtemplates gives us a chance to design a set of DNA sequences with larger number of mismatches. Furthermore, even in the case that the maximum values of x and ||x|| are the same, the number of templates of the maximum x value is greater than
On Template Method for DNA Sequence Design length ||x|| length ||x|| length ||x||
3 – – 13 4
4 – – 14 4
5 – – 15 4
(436) 23 9 (12)
(604) 24 8
(704) 25 9 (352)
(196894)
6 2 (2) 16 6 (48) 26 10 (1748)
7 2
8 2
9 2
10 2
(24) 17 6 (268) 27 10 (19920)
(52) 18 6 (704) 28 11 (36)
(180) 19 7 (20) 29 11 (4)
(446) 20 8 (12) 30 12 (704)
11 4 (12) 21 8 (76)
209 12 4 (74) 22 8 (3738)
Table 2. Maximum ||x|| values of GC-Templates the number of templates of the maximum ||x|| value. This superiority of AGtemplates would help us flexibly design a set of DNA sequences. Although the use of a GC-template immediately satisfies the GC content constraint, the use of an AG-template requests us further to carefully design the GC-component of each word in order to satisfy the GC content constraint of S. To set the GC content of each word equal, we need to design a set of binary words each of which has the same number of occurrences of 1’s. In order to solve this problem, it is quite useful to apply the sophisticated theory of constant weight codes[BSS90]. def
Table 3 summarizes the new lower bounds of A∗ (l, d) = |S| of the set S of words of length n such that ||S|| = d. The values are obtained by using AG-templates and constant weight codes listed in [BSS90]. (l, d) A∗ (l, d)
(10,4) 36
(15,6) 70
(24,10) 96
(25,10) 130
(28,12) 435
Table 3. New Lower Bounds Obtained by AG-Templates
4
Using Multiple Templates
In this section, we will discuss on the effectiveness of using multiple templates. Let T be a finite set of binary words of length n to be used as templates. Then, we define: R (T ) = min{ H(x, y R ), HM (x, yz), HM (x, y R z R ), HM (x, yz R ), HM (x, y R z) | x, y, z ∈ T }, R (T ) = min{ H(x, yR ), HM (x, yz), HM (x, y R z R ), HM (x, yz R ), HM (x, y R z) | x, y, z ∈ T }, ||T || = min{H(T ), R(T )}, T = min{H(T ), R(T )}.
210
Satoshi Kobayashi, Tomohiro Kondo, and Masanori Arita
It is straightforward to see the following facts: Fact 2 For any set T of binary words such that ||T || ≥ d and any set S of words over Σdna using T as a set of GC-templates, R(S) ≥ d holds. Fact 3 For any set T of binary words such that T ≥ d and any set S of words over Σdna using T as a set of AG-templates, R(S) ≥ d holds. Remark: In case that we use T as multiple GC-templates, T should be a constant weight code in order to satisfy the GC content constraint. On the other hand, in case that we use T as multiple AG-templates, T need not be a constant weight code. It is clear that for every multiple templates T with ||T || ≥ d (T ≥ d) and any element x ∈ T , ||x|| ≥ d (x ≥ d) holds. Therefore, the existence of multiple templates T such that ||T || ≥ d (T ≥ d) can be checked by exhaustively searching all sets of words x with ||x|| ≥ d (x ≥ d). In case that we fail to find such T , we can conclude that the maximum ||T || (T ) values is less than d. In this way, we can get some information about the upper bound of ||T || (T ) values. In Table 4, we summarize the obtained results on the maximum ||T || and T values of multiple templates, where the number of words of T is 2. In the representation [x, y], x and y are lower and upper bounds, respectively. Most of the lower bounds are obtained during the process of finding upper bounds in the manner described in the above paragraph. The suffix g represents that this lower bound was obtained by applying genetic algorithms.
length ||T || T length ||T || T length ||T || T
3 – – 13 4 4 23 [7g ,8] 8
4 – – 14 4 [3,4] 24 8 [7,9]
5 – 1 15 4 [4,5] 25 8 [8,9]
6 – 1 16 [4,5] 5 26 9 [9,10]
7 – 2 17 5 5 27 9 [9,10]
8 2 2 18 6 6 28 [8,10] [8,11]
9 2 2 19 6g 6 29 [9g ,10] [9,11]
10 2 2 20 [6g ,7] [6,7] 30 [10,11] [10,12]
11 2 3 21 [6,7] [6,7]
12 4 4 22 [7,8] [7,8]
Table 4. Maximum ||T || and T values of multiple Templates T of size 2
This time, we can not find any superiority of AG-templates method to GCtemplates method. However, in case of multiple GC-templates, for each length l = 8, 9, 10, 12, 13, 14, 18, 24, there is no decrease of mismatches. In particular, at length l = 12, 18, 24, 3l mismatches are guaranteed.
On Template Method for DNA Sequence Design
211
Table 5 summarizes the new lower bounds of A∗ (l, d) obtained by using multiple GC-templates of size 2 and error correcting codes listed in the appendix of [MS77]. Here, we list only the results related to the sets of words with 3l mismathes at least. (l, d) A∗ (l, d)
(12,4) 288
(18,6) 1024
(24,8) 8192
(26,9) 1024
(27,9) 1024
Table 5. New Lower Bounds Obtained by Multiple GC-Templates
5
Some Properties of Templates
In this section, we focus on the GC-templates x with x = xR , and AG-templates y with y = y R . There exist some interesting relationships between the templates with such properties. The table 6 summarizes the maximum x values of AG-Templates x with the property x = xR . The integers in the round brackets represent the number of templates x that gives the maximum x values. As shown in this table, the maximum mismatches with the constraint x = xR are not so bad compared to Table 1. All of the values for lengths l ≥ 20 keep 3l mismatches. In particular, at l = 6, 7, 8, 12, 17, 18, 21, 22, the values are equal to those in Table 1. Therefore, the templates with these properties will be investigated in details at the rest of this section. Note that GC-templates x with the constraint xR = x should be of length even, since xR = x holds only if x is of length even. Interestingly enough, for l ≤ 30, we obtained that the best GC-templates x with the constraint xR = x have the same best values as that of best AG-templates with the constraint x = xR . length x length x length x
3 – – 13 4 (4) 23 8 (56)
4 – – 14 4 (20) 24 8 (198)
5 1 (4) 15 4 (64) 25 9 (32)
6 2 (2) 16 4 (90) 26 9 (16)
7 2 (8) 17 6 (4) 27 10 (24)
8 2 (4) 18 6 (16) 28 10 (38)
9 2 (16) 19 6 (64) 29 10 (388)
10 2 (18) 20 7 (2) 30 11 (8)
11 3 (8) 21 8 (4)
12 4 (6) 22 8 (10)
Table 6. The maximum x values of AG-Templates with x = xR
Next we give an interesting relationship between GC-templates and AGtemplates.
212
Satoshi Kobayashi, Tomohiro Kondo, and Masanori Arita
For binary words x, y, z, we define a binary word x(y, z) as a word obtained by substituting y into 0 in x and by substituting z into 1 in x, respectively. Then, we have the following lemmas. Lemma 1. Let x be a binary word of length lx such that ||x|| = dx and x = xR , and y be a binary word of length ly such that y = dy . Then, ||y(x, xR )|| ≥ min{dx ly , dy lx } holds. Proof. Let us first introduce a definition. For a word w, a word z is called a subword of w with shift k iff w zw = w and lg(w ) = k for some w and w . Let X = y(x, xR ) and W be either XX, X R X R , XX R , or X R X. Let k be an integer with 0 < k < lg(X), and Z be a subword of W of length lg(X) with shift k. Let us consider the case of k = 0 (mod lx ). We divide X and Z into ly segments each of which is of length lx . Then, since ||x|| = dx , it is straightforward to see that for any i (1 ≤ i ≤ ly ), the ith segment of X and that of Z differ at least dx positions. Therefore, we have H(X, Z) ≥ dx ly . Let us consider the case of k = 0 (mod lx ). Note that X R = y R (xR , (xR )R ) = R R y (x , x) = yR (x, xR ). Since y = dy and x = xR , it is straightforward to see that at least dy segments of X and Z should be completely flipped. Therefore, we have H(X, Z) ≥ dy lx . Finally, we consider the case of Z = X R . Also in this case, in a similar manner as in the above paragraph, we have H(X, Z) ≥ dy lx . This completes the proof. (Q.E.D.) Lemma 2. Let y be a binary word of length ly such that y = dy and y = y R , and x be a binary word of length lx such that x = dx . Then, x(y, y R ) ≥ min{dx ly , dy lx } holds. Proof. In a similar way as in the proof of Lemma 1.
(Q.E.D.)
Using these lemmas, we can construct longer templates from short templates. For example, let us consider an AG-template y = 110 and a GC-template x = 110100, where y = 1, ||x|| = 2 and x = xR hold. Then, by Lemma 1, we have z = y(x, xR ) = 110100110100001011 and ||z|| = min{2 · 3, 1 · 6} = 6. Since the length of z is 18, z could be a GC-template which guarantees 3l mismatches. In this way, using these lemmas, we can construct GC-templates of length l = 36, 54, 60, 66, 72, ... with 3l mismatches at least, and AG-templates of length l = 36, 51, 54, 60, 63, 66, 69, 72, ... with 3l mismatches at least. Let us examine the behavior of the following functions: µ(l) =def max{
||x|| | x ∈ Σ l }, l
π(l) =def max{
x | x ∈ Σ l }. l
From Table 1 and Table 2, we can see that both of the functions take the values near 13 . In fact, by the same discussion as in Lemma 2.5 of [AK02], we can derive the following theorem.
On Template Method for DNA Sequence Design
Theorem 1. For any l > 2, µ(l) ≤
1 2
and π(l) ≤
l 2(l−2)
213
hold.
Furthermore, using Lemma 1 and Lemma 2, it is not so difficult to see the following theorem. Theorem 2. For infinitely many l’s, µ(l) ≥
11 30
and λ(l) ≥
11 30
hold.
Proof. Let us assume that there exist binary words x and y such that ||x|| ≥ d, y ≥ d, lg(x) = lg(y) = l, x = xR and y = y R . Then, we will prove that for any i ≥ 0, there exist binary words X and Y such that ||X|| ≥ dli , Y ≥ dli , R lg(X) = lg(Y ) = li+1 , X = X and Y = Y R . In case of i = 0, the claim follows immediately. Assume that the claim holds for i < k and consider the case of i = k. By induction hypothesis, there exist X R and Y satisfying ||X|| ≥ dlk−1 , Y ≥ dlk−1 , lg(X) = lg(Y ) = lk , X = X and Y = Y R. R Let X = y(X, X R ) and Y = y(Y, Y ). Then, by the above lemmas, we R
R
have ||X || ≥ dlk and Y ≥ dlk . Furthermore, we have X = y(X, X R ) = R R (y(X, X R ))R = y R (X , X) = y(X, X R ) = X and Y R = (y(Y, Y ))R = R y R (Y R , Y ) = y(Y, Y ) = Y . Therefore, the claim holds. From Table 6, we can select parameters l = 30 and d = 11, which completes the proof. (Q.E.D.)
6
Conclusions
This paper presented some results to further extend the template method. We first proposed to use a variant of the template method, in which we fix, not [GC] positions, but [AG] positions for all sequences. We could see that AG-templates have a better performance at some lengths. Although this AG-template itself does not satisfy the GC content constraint, we could show that we can use the theory of constant weight codes in order to satisfy that constraint. Next we proposed to use multiple templates method, which was shown to have potential ability to increase the number of DNA sequences to be designed. Finally, we derived some theoretical properties of templates. However, we have the following problems to be studied in the future works. At first, any structured mis-hybiridization of DNA strands, such as bulge loops, is not considered in this method. The stability of such structures should be dealt with using enegy based methods. Second, the validity of the template method should be checked by biological experiments. In particular, it should be checked whether 3l mismatches of length l sequences is enough or not. Furtheremore, most of the important properties of templates have not been revealed yet. In particular, the reason why the best templates of length l might have approximately l 3 mismatches is not clear. Finally, the proposed method does not guarantee the optimality in the number of sequences to be designed. Further theoretical analysis should be done in a more general framework.
214
Satoshi Kobayashi, Tomohiro Kondo, and Masanori Arita
References [Adl94] L. Adleman, Molecular Computation of Solutions to Combinatorial Problems. Science 266, pp.1021-1024, 1994. [AK02] M. Arita and S. Kobayashi, DNA Sequence Design Using Templates, New Generation Computing, 20, pp.263-277, 2002. [BKS00] A. Ben-Dor, R. Karp, B. Schwikowski, and Z. Yakhini, Universal DNA Tag Systems: A Combinatorial Design Scheme, Proc. of the 4th Annual International Conference on Computational Molecular Biology (RECOMB2000), pp.6575, 2000. [BSS90] A.E. Brouwer, J.B. Shearer, N.J.A. Sloane, and W.D. Smith, A New Table of Constant Weight Codes, IEEE Trans. on Information Theory, 36, pp.1334-1380, 1990. [DGR98] R. Deaton, M. Garzon, J.A. Roze, D.R. Franceschetti, R.C. Murphy and S.E. Jr. Stevens, Reliability and Efficiency of a DNA Based Computation, Physical Review Letter, 80, pp.417-420, 1998. [FLT97] A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith and R. M. Corn, Demonstration of A Word Design Strategy for DNA Computing on Surfaces, Nucleic Acids Research, Vol.25, No.23, pp.4748-4757, 1997. [GND97] M. Garzon, P. Neathery, P. Deaton, R.C. Murphy, D.R. Franceschetti, and S.E. Jr. Stevens, A New Metric for DNA Computing, In Proc. of 2nd Annual Genetic Programming Conference, Morgan Kaufmann, pp.472-478, 1997. [MS77] E. J. MacWilliams and N.J.A. Sloane, The Theory of Error-Correcting Codes, North-Holland, 1977.
From RNA Secondary Structure to Coding Theory: A Combinatorial Approach Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos Department of Computer Science University of British Columbia 201-2366 Main Mall Vancouver, B. C. V6T 1Z4 {heitsch, condon, hoos}@cs.ubc.ca
Abstract. We use combinatorial analysis to transform a special case of the computational problem of designing RNA base sequences with a given minimal free energy secondary structure into a coding theory question. The function of RNA molecules is largely determined by their molecular form, which in turn is significantly related to the base pairings of the secondary structure. Hence, this is crucial initial work in the design of RNA molecules with desired three-dimensional structures and specific functional properties. The biological importance of RNA only continues to grow with the discoveries of many different RNA molecules having vital functions other than mediating the production of proteins from DNA. Furthermore, RNA has the same potential as DNA in terms of nanotechnology and biomolecular computing.
1
Introduction
Beyond their essential roles in living organisms, the synthetic importance of nucleotide molecules with particular functions continues to expand. For example [1, 8], biomolecular computations utilize DNA and RNA molecules with specially designed structural properties. Other potential applications of RNA design include “nanorobotics and the rational synthesis of periodic matter,” as has been the goal for DNA of Seeman’s laboratory [9]. As such, the analysis and design of nucleotide structures is an important problem at the rapidly developing intersection of the biological, mathematical, and computational sciences. Although RNA molecules have been the focus of this work, the same principles apply to the design of DNA base sequences with desired structural arrangements. A significant initial step in the engineering of RNA molecules with desired functional properties would be solving the RNA secondary structure design problem of finding, when possible, a base sequence which folds to a given target RNA secondary structure. Previous work on RNA structure algorithms
This material is based upon work supported by the U.S. National Science Foundation under Grant No. 0130108, by the National Sciences and Engineering Research Council of Canada, and by GenTel Corp.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 215–228, 2003. c Springer-Verlag Berlin Heidelberg 2003
216
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
has mainly focused on the reverse problem of predicting base pairings from a primary nucleotide sequence, under certain structural and thermodynamic assumptions. Although efficient algorithms have been developed for this prediction question, there is no known efficient deterministic procedure for RNA secondary structure design. The Vienna RNA Package of Schuster et al. [4] provides a simple stochastic local search algorithm which works well for the design of small secondary structures. Seeman [5] used a heuristic approach based on sequence symmetry minimization for a design problem closely related to the one studied in this paper. Here we focus on a significantly restricted version of the RNA secondary structure design question, with the ultimate goal of an efficient algorithmic solution for well-characterized subcases of the general problem. The special case considered in this paper is already nontrivial to resolve, and retains enough characteristics of the full RNA secondary structure design problem to be a very useful first step. A precise problem statement is provided in Section 4, and an outline of our current methodology can be found in Section 5. Section 2 gives an overview of the standard RNA thermodynamic model, while Section 3 illustrates its abstract mathematical interpretation. The example of Figure 1, however, captures both our choice of restrictions and our algorithmic approach, as well as the essential difficulty confronting any solution strategy. The basic assumption underlying current understanding of RNA secondary structure is that base sequences fold to minimize free energy. Under this hypothesis, the fundamental problem with RNA secondary structure design is ensuring the desired minimal free energy configuration of the constructed sequence. Intuitively, a sequence will fold to a configuration which minimizes loop costs while maximizing the beneficial stacked pairs. Thus, to preclude alternate configurations we must ensure that improvements in loop energies are offset by the penalty of lost base pairs. Our simple design strategy isolates loop stretches from helical segments, enabling the clear understanding of helix “mismatches” found in Section 6 and an efficient analysis of the energy trade-offs among various possible loop arrangements as given in Section 7. Essentially, each base sequence and corresponding complement assigned to a stem must be as different as necessary from all others. Theorem 2 of Section 8 gives a constructive bound on the helix “quality” guaranteeing a unique minimal secondary structure among a subset of all possibilities for the corresponding sequence. Thus, our major advancement in this work is a value on the quantization of “as different as necessary” for a particular subset of secondary structures and a certain class of primary base sequences. A secondary contribution is the theoretical framework surrounding our main result, which may facilitate further insights in the investigation of the nucleotide structure design problem.
2
RNA Secondary Structure and the Free Energy Model
Like DNA, the primary RNA structure is an oriented linear sequence of four nucleotides, denoted A, U, C, and G, with chemically distinct ends referred to as 5 and 3 . These nucleotides may form the usual Watson-Crick, or so-called
From RNA Secondary Structure to Coding Theory
217
L3 L4 5 4 6
5
L5
L2
L1
(a)
(b)
L5
L4
L3
L2
(c)
L1
Loops
1(b)
1(c)
L1
-1.70
-1.70
L2
4.10
4.10
L3
4.40
0.80
L4
-0.40
2.90
L5
4.40
4.50
(d)
Fig. 1. (a) The simple secondary structure of S. cerevisiae Phe-tRNA at 37◦ . The destabilizing effects of single-stranded regions, or “loops,” are counterbalanced by the beneficial negative free energies of successive stacked base pairs, or “stems.” (b) A basic design method attempts to replicate the structure by wrapping a simple strand around its geometric interpretation, assigning A’s as loop segments and the unrelated Watson-Crick base pairs, C−G and G−C, to helical stretches. (c) Without careful construction, the sequence exploits symmetries to reduce loop costs and folds to an alternate minimal energy configuration with fewer “expensive” hairpin loops. (d) This table lists the different energies for the corresponding loops Li for the structures in 1(b) and 1(c) respectively. All foldings courtesy of mfold version 3.1 by Zuker and Turner [10, 2], available online via http://www.bioinfo.rpi.edu.
218
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
canonical, base pairings (namely A − U, U − A, G − C, and C − G) as well as other non-canonical matches. Self-bonding forces the single RNA strand into a potentially complicated arrangement of stabilizing helical regions, or “stems,” connected by single-stranded regions, or “loops.” The RNA secondary structure is characterized as the set of base pairs, including the “wobble” pairing of G and U, inducing these geometric structural arrangements. Definition 1. Let R = b1 b2 . . . bn be an RNA sequence of length n. For 1 ≤ i < j ≤ n, let i · j denote the pairing of base bi with bj . A secondary structure of R is a set P of base pairs such that, for all i · j, i · j ∈ P , i = i if and only if j = j. A basic premise is that RNA molecules will assume foldings which minimize the overall free energy. There currently exist well-regarded and widely-used algorithms such as Zuker’s mfold [10, 2] which implement an efficient recursive calculation of the minimum free energy configuration under this model. Experimentally derived parameters are used in these computations, such as the RNA thermodynamic values provided by the Turner group [6] and the corresponding ones for DNA calculated by the SantaLucia group [3]. However, the computational efficiency of the current recursive methods is obtained at the expense of a class of foldings, called pseudoknots, which can be conceptualized as “switchbacks” in the RNA secondary structure. Without pseudoknots, all base pairings can be considered as “inside” the planar representation of an RNA secondary structure. Under this exclusion, the free energy can be efficiently calculated by decomposition into an independent sum over the loops and stacked pairs of the interior. Definition 2. An RNA secondary structure P includes a pseudoknot if there exist two base pairs i · j, i · j ∈ P with i < i < j < j . Definition 3. A base bk or base pair k · l ∈ P is accessible from a base pair i · j ∈ P if i < k (< l) < j and if there is no other base pair i · j ∈ P such that i < i < k (< l) < j < j. Definition 4. The loop closed by a base pair i · j ∈ P , denoted L(i · j), is the collection of bases and base pairs accessible from i · j. Note that L(i · j) does not include the closing base pair i · j. According to the above terminology, a stacked pair is formed by a closing base pair i · j whose loop L(i · j) contains exactly the base pair (i + 1) · (j − 1). In succession, stacked pairs form a helical segment, or stem, which stabilizes the secondary structure. For the purposes of this work, we will generally reserve the term “loop” for destabilizing components containing unpaired bases. Loops are distinguished according to whether they contain 0, 1, or more base pairs. Let the term k-loop refer to a loop having k − 1 accessible base pairs, totaling k base pairs including the closing one. Intuitively, k different stems radiate out from a k-loop; the
From RNA Secondary Structure to Coding Theory
219
central loop, labeled L4 , in Figure 1(b) is a 4-loop. A 1-loop, such as L2 in both Figure 1(b) and 1(c), is also commonly known as a hairpin. In general, a 2-loop is called an internal loop, as in the case of L4 in Figure 1(c), except for bulges which have all unpaired bases occurring on only one side. For k ≥ 3, a k-loop is simply called a multiloop. Typically, the energy of a 1-loop or 2-loop is the sum of several terms. In our model, the relevant values are the entropic term, which depends only on the number of unpaired bases in the loop, and the beneficial stacking interaction between a base pair and the adjacent single-stranded nucleotide. In general, these single-stranded stacking energies, also known as the terminal mismatch energies, depend on the orientation of the closing base pair so the values for C − G and G − C are not necessarily symmetric. The standard affine linear energy function for the entropic term of k-loop energies when k > 2 is chosen primarily for computational convenience since so little is known experimentally about the stability of multiloops. The external loop of an RNA secondary structure is the set of bases and base pairs without a closing base pair. The loop L1 in Figures 1(b) and 1(c) is an example of an external loop. For arbitrary RNA secondary structures, it will be denoted Le . The current model assumes that the external loop has no conformal constraints, and hence no associated entropic costs. Thus, it must be treated distinctly from all other loops.
3
Plane Trees and RNA Foldings
Much of the essential arrangement of loops and stems in an RNA secondary structure is captured by a special type of graphical object known as a plane tree. Specifically, as observed in the previous section, the exclusion of pseudoknots induces an interior/exterior orientation to an RNA secondary structure. We utilize this fact to abstract a given set of base pairs to their geometric “skeleton.” Helical segments are associated with edges, and loops to vertices. We preserve information about the length of the stems as a weight on the corresponding edge. Hence, the basic arrangement of an RNA secondary structure may be described by a weighted plane tree, such as in Figure 1(b). Definition 5. [7] A plane tree is a rooted tree whose subtrees at any vertex are linearly ordered. The ordering is sufficient to distinguish any vertex of a plane tree, and so labels are unnecessary. The unique root vertex of a plane tree corresponds to the distinct external loop of an RNA secondary structure. According to common graph terminology, a vertex is the “child” of the connected vertex one edge closer to the root. Vertices with no children are called leaves and, for our purposes, correspond to the hairpin loops of an RNA secondary structure. Definition 6. [7] A plane tree vertex with n children has degree n.
220
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
In our association of plane trees and RNA secondary structures, the degree of a vertex corresponds to the number of base pairs in the loop, excluding the closing pair. Thus a k-loop has k − 1 accessible base pairs and corresponds to a vertex of degree k − 1 in the associated plane tree. We say that plane tree corresponds to an RNA secondary structure if the arrangement of vertices and edges is the same as that of loops and stems. With the additional information of the loop segments lengths, a weighted plane tree completely specifies the desired configuration of base pairs.
4
Restricted RNA Secondary Structure Design
The question of designing RNA base sequences with desired secondary foldings has important computational, as well as biological, implications. The general problem may be precisely stated as: Given the specification of an RNA secondary structure, return a primary nucleotide sequence whose minimal energy configuration has the desired base pairings under the current free energy model. We consider here a special case capturing many essential aspects of this difficult problem. Specifically, we impose restrictions on the input structure, output sequences, and potential reconfigurations. To begin, in keeping with most RNA prediction algorithms, we exclude pseudoknots. This permits an input configuration abstractly described by a plane tree T . We will also allow specification of the loop segment lengths and edge weights / stem lengths. These input parameters are subject to the requirement that, for any output strand, among the secondary structures corresponding to the desired configuration T , the lowest free energy must have the given loop structures. We call a set of base pairs satisfying these input constraints a restricted structure. Additionally, our constructed sequences must satisfy the loop-protecting property: all intended loop segments should remain unpaired in any alternate configuration. This requirement is enforced by restricting to a three letter alphabet {A, C, G} and assigning A exactly to the unpaired segments. The current thermodynamic model predicts no base pair energetic interactions between A and C or G. Hence, under this restriction, the number of base pairs in any alternate configuration cannot exceed the original count. As the final restriction, our design must be sufficiently good to preclude any alternate minimal energy configurations from a particular subclass of structures. In our loop-protecting RNA model, there can be no interaction between the intended A loop segments and the C − G, G − C base pairs forming the helical stretches. However, various {C, G} segments may align in the minimal energy configuration even though they are not exact or intended complements. Hence, a helix from the target structure is said to be partially conserved in another pseudoknot-free configuration when a {C, G} nucleotide segment forms base pairs with exactly one other segment of the strand; a helix is fully conserved when it pairs with its intended complement. The set of alternate configurations with partially or fully conserved helices will be called helix-preserving for a given strand and target structure.
From RNA Secondary Structure to Coding Theory
221
Thus, given as input a restricted RNA secondary structure, we investigate efficient algorithmic methods for generating a loop-protecting sequence which will not to fold into any helix-preserving alternate configuration.
5
A Constraint Satisfaction Solution Strategy
Subject to our current restrictions, our soluction must encode the given minimal free energy secondary structure configuration into a primary nucleotide sequence. Accordingly, for an RNA molecule with n stems, we need to produce n strings over the alphabet {C, G}, s1 , . . . , sn . They and their Watson-Crick complements, s¯1 , . . . , s¯n , would then be appropriately arranged into a single linear strand interspersed by A stretches of the desired length. Thus, our output will have the form R = (l0 , h1 , l1 , h2 , . . . , l2n−1 , h2n , l2n ) where hi ∈ {s1 , . . . , sn , s¯1 , . . . , s¯n } ⊂ {C, G}+ and lj ∈ {A}∗ . We let RH = (h1 , h2 , . . . h2n ) be the intended helical segments of R, while the loop regions are denoted RL = (l0 , l1 , . . . , l2n−1 , l2n ). We accept as input a plane tree T with edges ej and weights wj for 1 ≤ j ≤ n as well as the specification of the loop segments RL = (l0 , l1 , . . . , l2n−1 , l2n ) where li ∈ {A}∗ . In keeping with known thermodynamic constraints on RNA secondary structure, we have two restrictions on the possible lengths of the loop segments. If li is the singlestranded segment intended to form a 1-loop, then |li | ≥ 3. Likewise, there cannot exist i and j such that li and lj form a 2-loop and |li | = 0 = |lj |. According to the free energy model, for a weighted plane tree T representing the given restricted RNA secondary structure with loop segments RL , we can calculate the lowest free energy of strand R in that abstract configuration, E(R, T). We will divide this energy value into two components – one involving all the loops segments RL , denoted EL (R, T), and the other, EH (R, T), for the energies associated with RH . By our input assumption, EL (R, T ) depends only on the lengths of the li and the single-stranded stacking interactions with the base pairs in the loops since the lowest value of E(R, T ) corresponds to the foldings with the given loop segments RL . It may be, however, that the energy E(R, T ) is not minimal over all other helix-preserving configurations T . If not, there would exist at least one T such that the lowest free energy of strand R in a configuration corresponding to T , E(R, T ), is lower than E(R, T ). To preclude such an occurrence, we must ensure that any improvement in the energies of the structures involving the predetermined loop segments RL is offset by the loss of beneficial stacked pairs in the remaining components. More specifically, we bound from below the E(R, T ) value by the sum of two components determined by our loop-protecting strand. Thus, we have EL (R, T ) which represents the lowest free energy associated with the loop structures / vertices of configuration T which include all the bases from RL (and possibly some unpaired bases from the ends of RH ). Likewise, EH (R, T ) is then defined to be the lowest free energies for the bases associated with the edges of T , which includes most of RH , and corresponds to the helices
222
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
which are conserved, partially or fully, when strand R is in configuration T . Hence, we will refer to EH (R, T ) as the “helix energy” component of E(R, T ). Consequently, configuration T will be the minimal energy secondary structure of a loop-protecting strand R among helix-preserving foldings as long as for all alternate T with the same number of vertices and edges: E(R, T ) − E(R, T ) ≤ [EH (R, T ) − EH (R, T )] + [EL (R, T ) − EL (R, T )] ≤ 0
Thus, an alternate minimal energy configuration of a primary base sequence will be prevented by ensuring that any benefit from improving the loop arrangements does not outweigh the cost in free energy terms for the mismatched helical segments. Towards this end, we will introduce a notion of “quality” with respect to a set of nucleotide strings which is a measure of their mutual differences, in a thermodynamic sense. An RNA strand has a q-quality design if it is loop-protecting and the quality of its helical segments is at least q. Understanding the interplay between loop arrangements and the loss of stacked base pairs in helical mismatches is essential to the design of a strand with a specific minimal energy configuration. Given a bound on loop energy improvements, the design problem reduces to finding base sequences for stems which satisfy constraints, hence precluding any beneficial trade-off for an alternate configuration. The value in this is that the latter problem is amenable to solution using RNA word design techniques. Hence, given a specified input from a subclass of RNA secondary structures, we provide a means of calculating a lower bound on the quality, as a function of the input, which is sufficient to preclude a large number of alternate configurations. Further analysis will be necessary to extend the method beyond the subset of helix-preserving to all possible alternate secondary structures.
6
Helix Mismatches in Alternate Configurations
Plane trees naturally fall into distinct equivalence classes according to number of edges. Since a tree always has one more vertex than edge, we can also partition plane trees according to the number of vertices. Thus, let Tn be the set of plane trees having n edges and n + 1 vertices. For an RNA strand R designed to have n helices in configuration T , we are concerned about the free energy minimality of other configurations T ∈ Tn . This corresponds to restricting our attention to the mismatches in helix-preserving alternate structures. We will use two other equivalent plane tree representations. The first is a string over the set {1, 2, . . . , n} such that each number appears exactly twice and there are no subsequences of the form ijij. (This restriction corresponds to the pseudoknot exclusion in RNA secondary structures.) The second follows easily from the first by replacing each pair of numbers by the endpoints of an arc, pictured as n nonintersecting semi-circles whose 2n endpoints all lie below them on the same line. See Figure 2 for an example. Because each number appears exactly twice, there is no ambiguity in the assignment of arcs to numbered pairs.
From RNA Secondary Structure to Coding Theory
223
Definition 7. Given T, T ∈ Tn , there are m mismatches between T and T if there are m arcs from the arc specification of T whose endpoints align with nonequal numbers from the string over {1, 2, . . . , n} for T . Note that two different plane trees with n edges can have at most n mismatches and must have at least two. However, although mismatches are symmetric, they are not necessarilty additive since the mismatches between T, T and T , T may “propagate.” Hence, let Tn,i (T) be the set of trees T ∈ Tn having up to i mismatches between T and T .
1 2 2 3 3 4 4 1
1 2 2 3 4 4 3 1
1 2 2 3 3 4 4 1
(a) Fig. 1(b) helices.
(b) Fig. 1(c) helices.
(c) The mismatches.
Fig. 2. The first two figures illustrate the arc and string specification for the plane trees representing the corresponding RNA secondary structures from Fig. 1. They have two mismatches as shown in 2(c). For a strand R = l0 h1 l1 . . . l2n−1 h2n l2n , we need a better understanding (R, T ) between the desired of the difference in helix energy, EH (R, T ) − EH configuration T and a potential reconfiguration T . We can further refine the helix energy calculations according to the plane tree edges corresponding to the helical segments of R. The helix energy associated with an edge is the minimum free energy of the corresponding two nucleotide segments 5 − hi − 3 , 5 − hj − 3 , denoted Ee (R, T ) or Ee (R, T ), for edges e ∈ T or e ∈ T respectively. When ¯ j , as is always the case for Ee (R, T ), this is just the sum of the stacked hi = h pair energies. In the “mismatched” case, when hi and hj are not Watson-Crick complements, it is still possible to calculate the free energy as the minimum over all possible partial alignments. (Alternately, the energy could be estimated by using some (generalized) Hamming distance between the strings times the minimum energy of a stacked base pair.) ¯ j for an edge e ∈ T as well as for e ∈ T , then We note that when hi = h ¯j Ee (R, T ) = Ee (R, T ). In this case, the energy component for the helix hi , h does not enter into the difference EH (R, T ) − EH (R, T ). Thus, we need only consider the energies for helices which are only partially conserved in T in order to calculate the difference in helix energies for an alternate configuration.
7
Bounding Possible Loop Energies
In order to analyze the potential benefit of a configuration other than the one for which an RNA strand was designed, we must be able to bound from below
224
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
the loop energies for a class of structures. Recall that the minimum free energy calculation is a sum of the independent loop energies. Hence for a given strand R = l0 h1 l1 . . . l2n−1 h2n l2n and an alternate plane tree configuration T with n+1 vertices v , the lower bound on the loop component of the free energy is the sum EL (R, T ) = v ∈T Ev (R, T ) where RL = (l0 , l1 , . . . , l2n−1 , l2n ) and Ev (R, T ) is a lower bound on the energy of the loop corresponding to vertex v . Furthermore, the calculation of lower bounds on loop free energies in our model is a function of the number of single-stranded bases as well as the number and composition of the base pairs, with special cases for 1-loops, 2-loops, and the external loop. Thus, to calculate each Ev (R, T ), except for the root node, we only need to know the number of base pairs and single-stranded nucleotides. But a vertex of degree i represents a loop containing i + 1 base pairs, including the closing one. Additionally, there are i + 1 associated single-stranded regions, containing lj1 , . . . , lji+1 , so that a lower bound on the total length is easily calculated. Since the situation is similar, although slightly more complicated for the root node/external loop, we have that Ev (R, T ) can be calculated in time O(i) for a vertex of degree i. However, we are interested in the bounds on the possible energies among all helix-preserving configurations. The necessary minimum value may be calculated by adapting the standard dynamic programming method for RNA secondary structure prediction. Definition 8. Let M(R, T) be the minimum over all lower bounds EL (R, T ) for T ∈ Tn and T = T . Theorem 1. There is an efficient algorithm to compute M (R, T ) under the current energy model.
8
The Quality of an RNA Encoding
For a strand R designed to be in configuration T , the maximum difference in loop energies over all possible helix-preserving alternate configurations is EL (R, T ) − M (R, T ). In order for T to be the minimum free energy configuration in the class Tn , it must be that this improvement is offset by the loss of stacked pairs. Hence, the helical segments must be of a certain “quality.” Definition 9. For a strand R and two plane tree configurations T and T , let Q(R, T, T ) = EH (R, T ) − EH (R, T ). If the value of Q(R, T, T ) is negative, then the configuration T is a more beneficial one for strand R, in terms of (partially conserved) helix energies, than the arrangement T . And vice versa for a positive value of Q(R, T, T ) since a more negative free energy is optimal. In order to produce the necessary helical segments to solve our RNA secondary structure design problem, we must be able to generate sufficient strings and assign them to the edges of the input structure T . Suppose that S =
From RNA Secondary Structure to Coding Theory
225
{s1 , . . . , sk } is a set of distinct strings over the alphabet {C, G} with k ≤ n. Let ¯ S¯ be the set of Watson-Crick complements of S; s ∈ S if and only if s¯ ∈ S. Recall that the edges ej of T have weights wj for 1 ≤ j ≤ n and that T may be specified by a string over {1, . . . , n}, T = (f1 , . . . , f2n ). We identify an edge ej with the two instances of j = fi , fi for 1 ≤ i, i ≤ 2n. We say that α(S, T) = (h1 , . . . , h2n ) = RH is a stem assignment of S to T if for every edge ej , for i and i such that j = fi = fi , there exists 1 ≤ l ≤ k such that hi = sl , hi = s¯l , and |sl | = wi . Let A be the set of all stem assignments of S to T . Definition 10. The quality of S with respect to T up to i mismatches is Qi (S, T ) = min
max
α∈A T ∈Tn,i (T )
Q(α(S, T ), T, T )
For the moment, we are concerned with the quality of S over all possible helixpreserving alternate structures T ∈ Tn,n (T ) = Tn . Because we have restricted to loop-protecting RNA strands over {C, G, A}, a stem assignment α of S to T will typically have the maximum number of stacked base pairs, and hence the most negative value possible for EH (R, T ). The closer another arrangement T comes to preserving this value, the less negative the quantity Q(α(S, T ), T, T ) will be. Thus, by maximizing the value of Q(α(S, T ), T, T ) over a set of potential configurations, we obtain a measure of the minimum loss in free energy per misaligned helix. We can then optimize this value by chosing the best possible stem assignment – the one which minimizes. For a secondary structure specification, we need to determine a lower bound on the quality of the set of code strings S which would prevent the corresponding primary sequence from alternate helix-preserving foldings. Theorem 2. Let T ∈ Tn be the specification of a restricted RNA secondary structure with stem lengths w1 , . . . , wn and loop segments RL = (l0 , . . . , l2n ) with li ∈ {A}∗ . Suppose that S is a set of strings over {C, G} of quality Qn (S, T ) = −(EL (R, T ) − M (R, T )). Then there exists a stem assignment α(S, T ) = RH such that the helix-preserving minimum free energy configuration of R has the plane tree structure T . Proof. Let α be a stem assignment which minimizes Qn (S, T ). Now, for another arbitrary helix-preserving configuration T ∈ Tn of R with RH = α(S, T ): (R, T )] + [EL (R, T ) − EL (R, T )] E(R, T ) − E(R, T ) ≤ [EH (R, T ) − EH ≤ [EH (R, T ) − EH (R, T )] + [EL (R, T ) − M (R, T )] = Q(R, T, T ) + [EL (R, T ) − M (R, T )]
= Q(R, T, T ) + (−Qn (S, T )) ≤ Qn (S, T ) + (−Qn (S, T )) =0 Hence, T has lower free energy than any other T which proves the theorem.
226
9
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
Methods for Optimizing the Quality Calculation
Theorem 2 guarantees the existence of a strand R which folds to the desired minimal free energy configuration T provided we can generate a set of strings S with a certain quality Qn (S, T ). Determining whether a candidate set of strings has the necessary quality is a nontrivial task, however, since it involves considering all possible helix-preserving alternate configurations for any suitable stem assignment. Since the number of plane trees with n vertices is the nth Cata2n 1 lan number, Cn = , simply maximizing over T ∈ Tn would be an n+1 n exponential calculation. We can approximate this aspect of the quality calculation, though, restricting the number of alternate configurations which we have to consider. Recall that the set Tn,i (T ) gives the trees T ∈ Tn having up to i mismatches with T . Hence, we can restrict to calculating the quality of S with respect to T only up to m mismatches, Qm (S, T ), where m depends on our ability to approximate Q(α(S, T ), T, T ) for T ∈ Tn \ Tn,m (T ). Specifically, we consider pairs (hi , hi ) from RH = (h1 , . . . , h2n ) where hi and hi do not correspond to two sides of the same edge in T . As we did with Ee (R, T ) for helices partially or fully conserved in T , we can calculate the energy of these two helical segments hi and hi , denoted H(i, i ). (R, T ), for any T ∈Tn \ Tn,m (T ) Then we know that the calculation of EH m having greater than m mismatches with T , must include the sum j=1 H(ij , ij ) where each ij and ij occurs in at most one term since each helix hij can pair with exactly one other hij . Then we can formulate the following constraint based solution to our restricted secondary structure design problem. Again, suppose that S = {s1 , . . . , sk } is a set of distinct strings over the alphabet {C, G} with k ≤ n and S¯ the set of Watson-Crick complements of S. Assume that α(S, T ) = (h1 , . . . , h2n ) = RH is a stem assignment of S to T where, for every edge ej identified with integer j = fi = fi from the string representation T = (f1 , . . . , f2n ), there exists 1 ≤ l ≤ k such that hi = sl , hi = s¯l , and |sl | = wj . Let Constraints(S, α, T ) be the following set of constraints on S, with respect to T and α: There is one constraint in Constraints(S, α, T ) for each T ∈ Tn,m . Let I be the set of pairs of indices (i, i ) such that hi and hi correspond to a mismatched edge in T . We know that 2 ≤ |I | ≤ m. Let I be the corresponding set of matched index pairs. Thus if (i, i ) ∈ I then there exists in I either (i, j) or (j, i) and either (j , i ) or (i , j ) where the helices hi , hj and hj , hi are correctly matched in T . The constraint for T is then: H(k, l) − H(i, i )] + [EL (R, T ) − EL (R, T )] ≤ 0 [ (k,l)∈I
(i,i )∈I
Additionally, constraints are needed to handle trees in T ∈ Tn \ Tn,m (T ). Let J be a set of m pairs of indices (j, j ) where 1 ≤ j ≤ j ≤ 2n, each j and
From RNA Secondary Structure to Coding Theory
227
j occurs in at most one pair in J , and the helical segments hj and hj are not matched in T . Let J be the corresponding multiset of matched index pairs, where for (j, j ) ∈ J there exists in J either (i, j) or (j, i) and either (j , i ) or (i , j ) where the helices hi , hj and hj , hi are correctly matched in T . We note that for some correctly matched hi , hj , if both i and j each appear in a mismatched pair in J , then the pair i, j occurs twice in J. Then for each possible J we have the following constraint: [
H(k, l) − 2
(k,l)∈J
H(j, j )] + [EL (R, T ) − M (R, T )] ≤ 0
(j,j )∈J
Theorem 3. Let T ∈ Tn be the specification of a restricted RNA secondary structure with stem lengths w1 , . . . , wn and loops segments RL = (l0 , . . . , l2n ) with li ∈ {A}∗ . Suppose that S is a set of strings over {C, G} with a stem assignment α(S, T ) = (h1 , . . . , h2n ) = RH . Suppose all constraints in Constraints(S, α, T ) are satisfied. Then the helix-preserving minimum free energy of R has the plane tree structure T . Finally, in terms of optimizing over such stem assignments, we note that one possible approximation strategy would require that |S| = n and to naively assign strings and their Watson-Crick complements solely on the basis of equality between string length and edge weight. Although these methods for increasing the algorithmic efficiency may force a higher value than strictly necessary, they achieve a significant improvement in the running time of an implementation.
10
Conclusions and Future Work
In this paper, we studied the problem of designing RNA sequences with a given secondary structure, under a standard model of free energy minimization. For a restricted case, we derived conditions on the base sequences assigned to the local helical structures (stems) of the desired structure that can be satisfied using using word design strategies. Hence, we have effectively reduced this special case of the RNA secondary structure design problem to a code design question. In future work, we will relax the restrictions imposed for these initial results on the secondary structure inputs, RNA sequence outputs, and possible refoldings. In particular, we will study cases which allow arbitrary types of bases in any loop and more general stem composition. We will also consider possible alternate structures that are not helix-preserving. We expect that these extensions may require the use of “capping strategies,” which additionally stabilize loops by restricting the initial and terminal bases of the helix segments. We will also investigate efficient algorithmic solutions to the RNA word design questions arising from RNA structure design problems. While known DNA word design methods and results from coding theory provide a good starting point for this endeavor, we anticipate that additional specific techniques will be
228
Christine E. Heitsch, Anne E. Condon, and Holger H. Hoos
needed in cases where the target structure is difficult to stabilize (e.g., because of very short helices or highly asymmetric bulges). Finally, based on this work and its future extensions, we expect to obtain a much better understanding of the class of RNA secondary structures that can be designed easily and efficiently.
References [1] K. Komiya, K. Sakamoto, H. Gouzu, S. Yokoyama, M. A. A. Nishikawa, and M. Hagiya. Successive state transitions with I/O interface by molecules. In Preliminary Proc. 6th Intl. Meeting on DNA Based Computers, pages 21 – 30, June 2000. [2] D. Mathews, J. Sabina, M. Zuker, and D. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288:911–940, 1999. [3] J. SantaLucia Jr. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA, 95:1460–1465, 1998. [4] P. Schuster, W. Fontana, P. Stadler, and I. Hofacker. From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc Lond B Biol Sci, 255(1344):279–284, 1994. [5] N. Seeman. De novo design of sequences for nucleic acid structural engineering. Journal of Biomolecular Structure and Dynamics, 8(3):573–581, 1990. [6] M. Serra, D. Turner, and S. Freier. Predicting thermodynamic properties of RNA. Meth. Enzymol., 259:243–261, 1995. [7] R. P. Stanley. Enumerative combinatorics. Vol. 2. Cambridge University Press, Cambridge, 1999. With a foreword by Gian-Carlo Rota and appendix 1 by Sergey Fomin. [8] E. Winfree, F. Liu, L. Wenzler, and N. Seeman. Design and self-assembly of 2D DNA crystals. Nature, 394:539–544, August 1998. [9] H. Yan, X. Zhang, Z. Shen, and N. C. Seeman. A robust DNA mechanical device controlled by hybridization topology. Nature, 415:62–65, 2002. [10] M. Zuker, D. Mathews, and D. Turner. Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide. In J. Barciszewski and B. Clark, editors, RNA Biochemistry and Biotechnology, NATO ASI Series, pages 11–43. Kluwer Academic Publishers, 1999.
Stochastic Local Search Algorithms for DNA Word Design Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon Department of Computer Science University of British Columbia Vancouver, B.C., V6T 1Z4, Canada {dctulpan,hoos,condon}@cs.ubc.ca http://www.cs.ubc.ca/labs/beta
Abstract. We present results on the performance of a stochastic local search algorithm for the design of DNA codes, namely sets of equallength words over the nucleotides alphabet {A, C, G, T } that satisfy certain combinatorial constraints. Using empirical analysis of the algorithm, we gain insight on good design principles. We report several cases in which our algorithm finds word sets that match or exceed the best previously known constructions.1
Keywords: DNA Word Design, Combinatorics, Stochastic Local Search, Coding Theory, Hamming Distance, GC Content
1
Introduction
The design of DNA code words, or sets of short DNA strands that satisfy combinatorial constraints, is motivated by the tasks of storing information in DNA strands used for computation or as molecular bar-codes in chemical libraries [3,4,8,20]. Good word design is important in order to minimise errors due to nonspecific hybridisation between distinct words and their complements, to achieve a higher information density, and to obtain large sets of words for large-scale applications. For the types of combinatorial constraints typically desired, there are no known efficient algorithms for design of DNA word sets. Techniques from coding theory have been applied to design of DNA word sets [4,10]. While valuable, this approach is hampered by the complexity of the combinatorial constraints on the word sets, which are often hard to reason about theoretically. For these reasons, heuristic approaches such as stochastic local search offer much promise in the design of word sets. 1
This material is based upon work supported by the U.S. National Science Foundation under Grant No. 0130108, by the National Sciences and the Engineering Research Council of Canada.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 229–241, 2003. c Springer-Verlag Berlin Heidelberg 2003
230
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon
Stochastic local search (SLS) algorithms strongly use randomised decisions while searching for solutions to a given problem. They play an increasingly important role for solving hard combinatorial problems from various domains of Artificial Intelligence and Operations Research, such as satisfiability, constraint satisfaction, planning, scheduling, and other application areas. Over the past few years there has been considerable success in developing stochastic local search algorithms as well as randomised systematic search methods for solving these problems, and to date, stochastic search algorithms are amongst the best known techniques for solving problems from many domains. Detailed empirical studies are crucial for the analysis and development of such high-performance stochastic search techniques. Stochastic search methods have been used successfully for decades in the construction of good binary codes (see for example [7].) Typically, the focus of this work is in finding codes of size greater than the best previously known bound, and a detailed empirical analysis of the search algorithms is not presented. Stochastic search techniques have also been applied to the design of DNA word sets (see Section 3). However, some algorithmic details are not specified in these papers. In addition, while small sets of code words produced by the algorithms have been presented (and the papers make other contributions independent of the word design algorithms), little or no analysis of algorithm performance is provided. Consequently, it is not possible to extract general insights on the design of stochastic algorithms for code design or to compare their approaches in detail with other algorithms. Our goal is to understand what algorithmic principles are most effective in the application of stochastic local search methods to the design of DNA or RNA word sets (and more generally, codes over other alphabets, particularly the binary alphabet). Towards this end, we describe a simple stochastic local search algorithm for design of DNA codes, and analyze its performance using an empirical methodology based on run-time distributions [15]. In this study, we have chosen to design word sets that satisfy one or more of the following constraints: Hamming distance (HD), GC content (GC), and reverse complement Hamming distance (RC). We define these constraints precisely in Section 2. Our reason for considering these constraints is that there are already some constructions for word sets satisfying these constraints, obtained using both theoretical and experimental methods, with which we can compare our results. (In follow-up work, we will apply our methods to other, more realistic constraints.) Our algorithm, described in detail in Section 4, takes as input the desired word length and set size, along with a specification of which constraints the set should satisfy, and attempts to find a set that meets these requirements. The algorithm performs stochastic local search in a space of DNA word sets of fixed size that may violate the given constraints, using an underlying search strategy that is based on a combination of randomised iterative improvement and conflictdirected random walk. The basic algorithm is initialised with a randomly selected set of DNA words. Then, repeatedly a conflict, that is, a pair of words that violates a constraint, is selected and resolved by modifying one of the respective
Stochastic Local Search Algorithms for DNA Word Design
231
words. The algorithm terminates if a set of DNA words that satisfies all given constraints is found, or if a specified number of iterations have been completed. The performance of this algorithm is primarily controlled by a so-called noise parameter that determines the probability of greedy vs. random conflict resolution. Interestingly, optimal settings of this parameter appear to be consistent for different problem instances (word set sizes) and constraints. Our empirical results, reported in Section 5, show that the run-time distributions that characterise our algorithm’s performance on hard word design problems often have a “fat” right tail. This indicates a stagnation of the underlying search process that severely compromises performance. As a first approach to overcome this stagnation effect, we extended our algorithm with a mechanism for diversifying the search by occasional random replacement of a small fraction of the current set of DNA words. Empirically, this modification eliminates the stagnation behavior and leads to substantial improvements in the performance of our algorithm. We compared the sizes of the word sets obtainable by our algorithm with previously known word sets, starting with the already extensively studied case of word sets that satisfy the Hamming distance constraint only. Out of 42 tests, our algorithm was able to find a word set whose size matches the best known theoretical construction in 16 cases. We also did several comparisons with word sets that satisfy at least two of our three constraints, for which again previous results were available. Out of a total of 42 comparisons with previous results that satisfy Hamming distance and Reverse Complement constraints, we found word sets that improved on previous constructions in all but one case.
2
Problem Description
The DNA code design problem that we consider is: given a target k and word length n, find a set of k DNA words, each of length n, satisfying certain combinatorial constraints. A DNA word of length n is simply a string of length n over the alphabet {A, C, G, T }, and naturally corresponds to a DNA strand with the left end of the string corresponding to the 5’ end of the DNA strand. The constraints that we consider are: – Hamming Distance Constraint (HD): for all pairs of distinct words w1 , w2 in the set, H(w1 ,w2 ) ≥ d. Here, H(w1 ,w2 ) represents the Hamming distance between words w1 and w2 , namely the number of positions i at which the ith letter in w1 differs from the ith letter in w2 . – GC Content Constraint (GC): a fixed percentage of the nucleotides within each word is either G or C. Throughout, we assume that this percentage is 50%. – Reverse Complement Hamming Distance Constraint (RC): for all pairs of DNA words w1 and w2 in the set, where w1 may equal w2 , H(w1 ,wcc(w2 )) ≥ d. Here, wcc(w) denotes the Watson-Crick complement of DNA word w, obtained by reversing w and then by replacing each A in w by T and vice versa, and replacing each C in w by G and vice versa.
232
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon
We note that there are many alternative formulations of the problem of designing DNA words that better capture the chemical motivation of ensuring that the DNA words behave well in a laboratory experiment (see for example [11]). For example, most formulations insist that the Hamming distance between shifted copies of words is also above a threshold. Even better than designing for Hamming constraints is to directly take thermodynamic constraints into account, so that the designed words have uniformly high melting temperature and the melting temperature of mismatched pairs of words is relatively low [19]. However we believe that the simple constraints above are best suited to the goals of the work presented here, for the following reasons. First, in order to understand whether our algorithmic techniques can find large sets of words, it is very useful to be able to compare the set sizes output by our algorithm with previous constructions, and this is possible here. Second, there are cases where one does not need to take shifts into account in a word design, for example when spacers or word labels are used between words [10] or when relatively short words are concatenated to form word labels [4]. In these cases, designs according to the three constraints above are already directly useful. Third, it may be that in order to design sets of words that satisfy more realistic thermodynamic constraints, a good initial starting point would be a set designed using the simpler combinatorial constaints above (see for example [2]). Finally, it is our hope that once we have identified good stochastic search principles for DNA word design for the above constraints, these principles will also be effective in designing word sets that satisfy other constraints, and this is a research direction that we are actively pursuing. The huge number of possible sets that must be explored in order to find a big set of words suggests the use of non-exhaustive, heuristic search algorithms for solving this type of problems.
3
Related Work
Deaton et al. [5,6] and Zhang and Shin [21] describe genetic algorithms for finding DNA codes that satisfy much stronger constraints than the HD and RC constraints, in which the number of mismatches must exceed a threshold even when the words are aligned with shift. However, they do not provide a detailed analysis of the performance of their algorithms. Hartemink et al. [12] used a computer algorithm for designing word sets that satisfy yet other constraints, in which a large pool (several billion) of strands were screened in order to determine whether they meet the constraints. Several other researchers have used computer algorithms to generate word sets (see for example [3]), but provide no details on the algorithms. Some DNA word design programs are publicly available. The DNASequenceGenerator program [9] designs DNA sequences that satisfy certain subword distance constraints and, in addition, have melting temperature or GC content within prescribed ranges. The program can generate DNA sequences de novo, or integrate partially specified words or existing words into the set. The PER-
Stochastic Local Search Algorithms for DNA Word Design
233
MUTE program was used to design the sequences of Faulhammer et al. [8] for their RNA-based 10-variable computation.
4
A Stochastic Local Search Algorithm
Our basic stochastic local search algorithm performs a randomised iterative improvement search in the space of DNA word sets; this overall approach is similar to WalkSAT, one of the best known algorithms for the propositional satisfiability problem [18,13] and Casanova, one of the best-performing algorithms for the winner determination problem in combinatorial auctions [14]. All word sets constructed during the search process have exactly the target number of words. Additionally, when using the GC constraint, we ensure that all words in the given candidate set always have the prescribed GC content. The other two combinatorial constraints considered in this paper, the HD and RC constraints, are binary predicates over DNA words that have to be satisfied between any pair of words from a given DNA set. The function our algorithm is trying to minimise is the number of pairs that violate the given constraint(s). Figure 1 gives a general outline of our basic algorithm. In the following, we will describe this algorithm in more detail. The initial set of words is determined by a simple randomised process that generates any DNA word of length n with equal probability. If the GC content constraint is used, we check each word that is generated and only accept words with the specified GC content. This works well for 50% GC content and typical designs use a GC content close to 50%, but could be easily made more efficient for less balanced GC contents if needed. Our implementation also provides the possibility to initialise the search with a given set of DNA words (such sets could be obtained using arbitrary other methods); if such a set has less than k words, it is expanded with randomly generated words such that a set of k words is always obtained. Note that the initial word set may contain multiple copies of the same word. In each step of the search process (i.e. one execution of the inner for loop from Figure 1), first, a pair of words violating one of the Hamming distance constraints is selected uniformly at random. Then, for each of these words, all possible single-base modifications are considered. As an example of single-base modifications take the code word ACT T of length 4. A new code word GCT T can be obtained by replacing letter A from the first code word with letter G. When the GC content constraint is used, this is restricted to modifications that maintain the specified GC content. For a pair of words of length n without the GC content constraint, this yields 6n new words. (Some of these might be identical.) With a fixed probability θ, one of these modifications is accepted uniformly at random, regardless of the number of constraint violations that will result from it. In the remaining cases, each modification is assigned a score, defined as the net decrease in the number of constraint violations caused by it, and a modification with maximal score is accepted. (If there are multiple such modifications, one of
234
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon procedure Stochastic Local Search for DNA Word Design input: Number of words (k), word length (n), set of constraints (C) output: Set S of m words that fully or partially satisfies C for i := 1 to maxTries do S := initial set of words Sˆ := S for j := 1 to maxSteps do if S satisfies all constraints then return S end if Randomly select words w1 , w2 ∈ S that violate one of the constraints M1 := all words obtained from w1 by substituting one base M2 := all words obtained from w2 by substituting one base with probability θ do select word w from M1 ∪ M2 at random otherwise select word w from M1 ∪ M2 such that number of conflict violations is maximally decreased end with probability if w ∈ M1 then replace w1 by w in S else replace w2 by w in S end if if S has no more constraint violations than Sˆ then Sˆ := S; end if end for end for return Sˆ end Stochastic Local Search for DNA Word Design
Fig. 1. Outline of a general stochastic local search procedure for DNA word design.
them is chosen uniformly at random.) Note that using this scheme, in each step of the algorithm, exactly one base in one word is modified. The parameter θ, also called the noise parameter, controls the greediness of the search process; for high values of θ, constraint violations are not resolved efficiently, while for low values of θ, the search has more difficulties to escape from local optima of the underlying search space. Throughout the run of the algorithm, the best candidate solution encountered so far, i.e., the DNA word set with the fewest constraint violations, is memorised. Note that even if the algorithm terminates without finding a valid set of size k, a valid subset can always be obtained by iteratively selecting pairs of words that violate a Hamming distance constraint and removing one of the two words
Stochastic Local Search Algorithms for DNA Word Design
235
involved in that conflict from the set. Hence, a word set of size k with t constraint violations can always be reduced to a valid set of at least size k − t. Hamming distances between words and/or their reverse complements are not recomputed in each iteration of the algorithm; instead, these are computed once after generating the initial set, and updated after each search step. This can be done very efficiently, since any modification of a single word can only affect the Hamming distances between this word and the k − 1 remaining words in the set. The outer loop of our algorithm can perform multiple independent runs of the underlying search process. In conjunction with randomised search initialisation, multiple independent runs can yield better performance; essentially, this is the case if there is a risk for the search process to stagnate, i.e., to get stuck in a region of the underlying search space that it is very unlikely to escape from. In this context, the parameter maxSteps, that controls the number of steps after which the search is aborted and possibly restarted from a new initial word set, can have an important impact on the performance of the algorithm; this will become apparent in our experimental results presented in Section 5. The simple SLS algorithm presented above can easily be enhanced with an additional diversification mechanism that is based on occasional random replacements of small subsets of code words. In this extended algorithm, after performing a search step, i.e., after each iteration of the inner loop of Figure 1, we check whether more than nsteps steps have been performed since the last improvement over the best candidate solution found so far. Whenever this is the case, we take it as in indication for search stagnation and perform a diversification step by replacing a fixed fraction fr of the code words with words that are newly generated at random (this is done exactly as during search initialisation). After this we continue as in the simple SLS algorithms; we generally make sure that betwen any two diversification steps at least nsteps simple search steps are performed. The values nsteps and fr are additional parameters of the extended algorithm.
5
Results
To evaluate the performance of our SLS algorithm (and its variants), we performed two types of computational experiments. Detailed analyses of the runtime distributions of our algorithm on individual problem instances were used to study the behavior of the algorithm and the impact of parameter settings. For these empirical analyses, the methodology of [15] for measuring and analyzing run-time distributions (RTDs) of Las Vegas algorithms was used. Run-time was measured in terms of search steps, and absolute CPU time per search step was measured to obtain a cost model of these search steps . The other type of experiment used the optimised parameter settings obtained from the detailed analyses for obtaining DNA word sets of maximal size for various word lengths and combinatorial constraints. Introducing noise in our algorithmic approach, i.e. using probabilistic moves when taking decisions, provides robustness to our algorithm and allows it to escape from local minima encountered during search. Thorough experimentation
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon
1
1
0.9
0.9
0.8
0.8
0.7
0.7 Probability of success.
Probability of success
236
0.6 0.5 0.4 0.3
Simple SLS: k=100 Simple SLS: k=120 Simple SLS: k=140 1-2**(-x/350) 1-2**(-x/18000)
0.2
0.6 0.5 0.4 0.3 Simple SLS Random Replacement: 3% 1-2**(-x/7500)
0.2
0.1
0.1
0 10
100
1000
10000
Number of steps
100000
1e+06
0 100
1000
10000
100000
1e+06
Number of steps.
Fig. 2. RTDs for SLS algorithm. Left side: HD and GC constraints (different set sizes k ∈ {100, 120, 140}); right side: HD, GC, RC constraints (SLS with and without random replacement).
(not reported here due to restricted space) shows that the setting of the noise parameter, θ, has a substantial effect on the performance of our algorithm: the time required for solving a given problem instance can easily vary more than an order of magnitude depending on the noise parameter setting used. Somewhat surprisingly, empirically optimal settings of θ are consistently close to 0.2 for different problem instances and sizes; consequently, this value was used for all experimental results reported here. The absolute CPU time for each search step was measured on a PC with 2 1GHz Pentium III CPUs, 512 MB cache and 1GB RAM running Red Hat Linux 7.1 (kernel 2.4.9). The obtained values range between .0006 CPU seconds for HD and GC constraints and .0079 CPU seconds for HD and RC constraints, both using set size k = 100, word length n = 8, and Hamming distance d = 4. 5.1
RTD Analysis
To study and characterise the behavior of our algorithm, we measured RTDs from 1000 runs (i.e. maxTries = 1000) of the algorithm applied to individual problem instances, using extremely high settings of the cutoff parameter, maxSteps, to ensure that a solution was found in each run without using random restarts. For each run, we recorded the number of search steps (i.e. executions of the inner for loop of Figure 1) required for finding a solution. From this data, the RTD gives the probability of success as a function of the number of search steps performed. As can be seen in Figure 2 (left side), for given constraints, n, and d, the time required for obtaining a word set of size k with a fixed probability p increases with k for all p. Furthermore, for high p, this increase is much more dramatic than for low p, as can be seen in the “fat” right tails of the RTDs. This indicates that our simple SLS algorithm, even when using optimised noise settings, can suffer from stagnation behavior that compromises its performance for long run-times. The easiest way to overcome this stagnation effect is to restart the algorithm after a fixed number maxSteps of search steps. Optimal values of maxSteps can be easily
Stochastic Local Search Algorithms for DNA Word Design
237
determined from the empirical RTD data (see [15]). When solving a problem instance for the first time, however, this method is not applicable (unless the RTD can be estimated a priori), and the actual performance depends crucially on the value used for maxSteps. It is therefore desirable to find different mechanisms for ameliorating or eliminating the observed search stagnation. The random replacement variant of our simple SLS algorithm described at the end of the previous section was designed to achieve this goal, and in many cases substantially reduces the stagnation effect (Figure 2 shows a typical example), leading to a significant improvement in the robustness and performance of our algorithm. However, as can be shown by comparing the empirical RTDs with appropriately fitted exponential distributions (see Figure 2: right side), even this modification does not completely eliminate the stagnation behavior. 5.2
Quality of Word Sets Obtained by Our Algorithm
Using the improvements and parameter settings obtained from the detailed analysis described above, we used our SLS algorithm to compute word sets for a large number of DNA code design problems. The corresponding results can be summarised as follows (detailed results are tabulated in the Appendix): HD constraint only. There is a significant body of work on the construction of word sets over an alphabet of size 4 that satisfy the Hamming distance constraint only, with the best bounds summarised by Bogdanova et al. [1]. We used our algorithm to design word sets in which the word lengths ranged from 4 to 12 with the minimum Hamming distance between pairs of words ranging from 4 to 10. Out of 42 tests, our algorithm was able to find a word set whose size matches the best known theoretical construction in 16 cases. HD+RC constraints. In this case, we considered word lengths ranging from 4 to 12, and the minimum Hamming and reverse complement Hamming distances between pairs of words ranging from 4 to 10. Out of 42 tests, our algorithm was able to find a word set whose size matched or exceeded the theoretical results of Marathe et al. [17] in 41 cases. HD+GC constraints. We obtained word sets satisfying the Hamming distance and the 50% GC content constraints for words length ranging from 4 up to 20, and the minimum Hamming distance between pair of words ranging from 2 to 10. Compared to the results reported by Corn et al. [16], we obtained bigger sets in almost all cases except two particular combinations, namely word length 8 and Hamming distance 4 and word length 12 and Hamming distance 6. HD+RC+GC constraints. We obtained word sets for word lengths ranging from 4 to 12, and the minimum Hamming and reverse complement Hamming distances between pairs of words ranging in length from 4 to 10. There was no body of previous results that we could compare with except in one case, namely for words of length 8 with 50% GC content that satisfy the Hamming and reversecomplement constraints with at least 4 mismatches between pairs of words. For this case, Frutos et al. [10] constructed a set of 108 words of length 8. We were not able to obtain any 108 DNA word sets satisfying these constraints, using
238
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon
our algorithm when initialised with a random initial set. The biggest set found had 92 code words. However, when we initialised our algorithm with the set of 108 words obtained by Frutos et al., along with one additional randomly chosen word, we were able to obtain a valid set of size 109. Continuing this incremental improvement procedure, we were able to construct sets containing 112 words in less than one day of CPU time. Interestingly, this set has the same general “template-map” structure as the set of Frutos et al.
6
Conclusions
We have presented a new stochastic local search algorithm for DNA word design, along with empirical results that characterise its performance and indicate its ability to find high-quality sets of DNA words satisfying various combinations of combinatorial constraints. In future work, we will examine ways to further improve our algorithm. One possibility is to consider a more diverse neighbourhood instead of the rather small neighbourhood that we now explore (which is based on single base modifications). We conjecture that, particularly when the GC constraint is imposed, the current small neighbourhood structure limits the effectiveness of the algorithm. Another possibility is to consider more complex stochastic local search strategies, which we expect to achieve improved performance that will likely lead to larger word sets. In another direction of future work, we will test our methods on different word design constraints, including those based on thermodynamic models. While modified or additional constraints can be quite easily accommodated by relatively minor modifications of our current SLS algorithm and its variants, different algorithmic strategies might be more efficient for solving these word design problems. Finally, it would be interesting to see if better theoretical design principles can be extracted from the word sets that we have now obtained empirically. In the construction of classical codes, theory and experiment are closely linked, and we expect that the same can be true for the construction of DNA codes.
References 1. Galina T. Bogdanova, Andries E. Brouwer, Stoian N. Kapralov & Patric R. J. Ostergard, Error-Correcting Codes over an Alphabet of Four Elements, Andries Brouwer ([email protected]) 2. A. Ben-Dor, R. Karp, B. Schwikowski, and Z. Yakhini, “Universal DNA tag systems: a combinatorial design scheme,” Proc. RECOMB 2000, ACM, pages 65-75. 3. R.S. Braich, C. Johnson, P.W.K. Rothemund, D. Hwang, N. Chelyapov, and L.M. Adleman, “Solution of a satisfiability problem on a gel-based DNA computer,” Preliminary Proc. Sixth International Meeting on DNA Based Computers, Leiden, The Netherlands, June, 2000. 4. S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry,” Proc. Natl. Acad. Sci. USA, Vol 89, pages 5381-5383, June 1992.
Stochastic Local Search Algorithms for DNA Word Design
239
5. R. Deaton, R. C. Murphy, M. Garzon, D. R. Franceschetti, and S. E. Stevens, Jr., “Good encodings for DNA-based solutions to combinatorial problems,” Proc. DNA Based Computers II, DIMACS Workshop June 10-12, 1996, L. F. Landweber and E. B. Baum, Editors, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 44, 1999, pages 247-258. 6. R. Deaton, M. Garzon, R. C. Murphy, J. A. Rose, D. R. Franceschetti, and S. E. Stevens, Jr., “Genetic search of reliable encodings for DNA-based computation,” Koza, John R., Goldberg, David E., Fogel, David B., and Riolo, Rick L. (editors), Proceedings of the First Annual Conference on Genetic Programming 1996. 7. A. A. El Gamal, L. A. Hemachandra, I. Shperling, and V. K. Wei, “Using simulated annealing to design good codes,” IEEE Transactions on Information Theory, Vol. IT-33, No. 1, January 1987. 8. Faulhammer, D., Cukras, A. R., Lipton, R.J., and L. F. Landweber, “Molecular computation: RNA solutions to chess problems,” Proc. Natl. Acad. Sci. USA, 97: 1385-1389. 9. U. Feldkamp, W. Banzhaf, H. Rauhe, “A DNA sequence compiler,” Poster presented at the 6th International Meeting on DNA Based Computers, Leiden, June, 2000. See also http://ls11-www.cs.unidortmund.de/molcomp/Publications/publications.html (visited November 11, 2000). 10. A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith, and R. M. Corn, “Demonstration of a word design strategy for DNA computing on surfaces,” Nucleic Acids Research, Vol. 25, No. 23, December 1997, pages 47484757. 11. M. Garzon, R. J. Deaton, J. A. Rose, and D. R. Franceschetti, “Soft molecular computing,” Preliminary Proc. Fifth International Meeting on DNA Based Computers, June 14-15, MIT, 1999, pages 89-98. 12. A.J. Hartemink, D.K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” 4th Annual DIMACS Workshop on DNA-Based Computers, Philadelphia, Pennsylvania, June 1998 13. Holger H. Hoos, Stochastic Local Search - Methods, Models, Applications, infixVerlag, Sankt Augustin, Germany, ISBN 3-89601-215-0, 1999. 14. H.H. Hoos and C. Boutilier, “Solving combinatorial auctions using stochastic local search,” In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 22–29, Austin, TX, 2000. 15. H.H. Hoos and T. St¨ utzle, “Evaluating Las Vegas Algorithms — Pitfalls and Remedies,” In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI-98), 1998, pages 238-245. 16. M. Li, H-J. Lee, A. E. Condon, and R. M. Corn, “DNA Word Design Strategy for Creating Sets of Non-interacting Oligonucleotides for DNA Microarrays,” Langmuir, 18, 2002, pages 805-812. 17. A. Marathe, A. Condon, and R. Corn, “On combinatorial DNA word design,” J. Computational Biology, 8:3, 2001, 201-220. 18. D. McAllester, H. Kautz, and B. Selman, “Evidence for invariants in local search,” Proc. AAAI-97, 321–326, Providence, RI, 1997. 19. J. A. Rose, R. Deaton, D. R. Franceschetti, M. Garzon, and S. E. Stevens, Jr., “A statistical mechanical treatment of error in the annealing biostep of DNA computation,” Special program in DNA and Molecular Computing at the Genetic and Evolutionary Computation Conference (GECCO-99), Orlando, FL., July 13-17, 1999, Morgan Kaufmann.
240
Dan C. Tulpan, Holger H. Hoos, and Anne E. Condon
20. H. Yoshida and A. Suyama, “Solution to 3-SAT by breadth first search,” Proc. 5th Intl. Meeting on DNA Based Computers, M.I.T. 1999, pages 9-20. 21. B-T. Zhang and S-Y. Shin, “Molecular algorithms for efficient and reliable DNA computing,” Proc. 3rd Annual Genetic Programming Conference, Edited by J. R. Koza, K. Deb, M. Doringo, D.B. Fogel, M. Garzon, H. Iba, and R. L. Riolo, Morgan Kaufmann, 1998, pages 735-742.
Appendix In this appendix, we report the sizes of word sets obtained by the SLS algorithm presented in this paper. This data illustrates in detail the performance of our algorithm and will be useful as a baseline for future work. Most of the word sets are publicly available at http://www.cs.ubc.ca/∼dctulpan/papers/dna8/tables. The data on our word sets is grouped in four tables showing the results for different constraint combinations as follows: HD (Table 1), HD+GC (Table 2), HD+RC (Table 3) and HD+RC+GC constraints (Table 4). In these tables, entries in bold face match or exceed the best previously known lower bounds. All other values are smaller than known lower bounds for corresponding (n, d) values. Dashes (“–”) indicate cases of limited or no interest for DNA word design, for which we did not run our algorithm. Along with each maximal word set size found for our algorithm, we also report the corresponding median number of steps spent by our algorithm in order to find that set (in square brackets, [. . .], typically specified in thousands of search steps). Each table entry is based on between 5 and 20 runs of our algorithm; the precise number of runs was varied with problem size and solution quality; median run times are taken over successful runs only.
n\d 4 5 6 7 8 9 10 11 12
4 [.01k]
4
5 4 [.02k]
6 -
16 [1.5k]
7 -
64 [80k]
9 [15k]
4 [.02k]
78 [70k]
23 [75k]
8 [.5k]
4 [.04k]
8 -
219 [65k]
55 [40k]
19 [50k]
5 [.03k]
4 [.09k]
9 -
616 [285k]
138 [65k]
41 [50k]
15 [50k]
5 [.1k]
4 [.12k]
1783 [471k]
358 [312k]
95 [130k]
32 [75k]
12 [5k]
5 [.5k]
4 [.32k]
5598 [523k]
967 [592k]
227 [165k + .1k]
70 [36k]
27 [545k]
10 [15k]
4 [.05k]
>10000 [24k]
2689 [263k]
578 [300k]
156 [735k]
53 [390k]
23 [532k + .5k]
9 [.5k + 5k]
Table 1. Empirical bounds on HD quaternary codes.
10 -
Stochastic Local Search Algorithms for DNA Word Design
241
4-2
5-2
6-3
7-3
8-4
9-4
10-5
11-5
12-6
48 [.3k]
142 [93k]
85 [860k]
230 [650k]
209 []
400 [72k]
256 [250k]
620 [48k]
410 []
18-9
19-9
20-10
13-6
14-7
15-7
16-8
17-8
990 [36k]
500 [5k]
1571 [1000k]
966 [242k]
2355 [745k]
1200 [15k] 3451 [396k] 2193 [596k]
-
Table 2. Empirical bounds on (HD,GC) quaternary codes for different n-d combinations.
n\d 4 5 6 7 8 9 10 11 12
2 [.02k]
4
5 -
4 [.02k]
2 [.005k]
6 -
7 -
8 -
9 -
28 [28k]
4 [.1k]
2 [.02k]
40 [254k]
11 [3k]
2 [.005k]
2 [.05k]
112 [850k]
27 [7k]
10 [31k]
2 [.01k]
2 [.03k]
314 [677k]
72 [142k]
20 [37k]
8 [16k]
2 [.02k]
2 [.04k]
10 -
938 [702k]
180 [386k]
49 [287k]
16 [495k]
8 [11.01k]
2 [.01k]
2 [.02k]
2750 [117k]
488 [257k]
114 [145k]
35 [6k]
12 [1k]
5 [2k]
2 [.05k]
>8000 [72k]
1340 [400k]
290 [327k]
79 [236k]
27 [712.5k]
11 [828k]
4 [.2k]
Table 3. Empirical bounds on (HD,RC) quaternary codes.
n\d 4 6 8 10 12
4 2 [.001k]
5 -
6 2 [.006k]
7 -
8 -
9 -
10 -
11 [3k]
2 [.003k]
92 [4000k]
19 [1.2k]
7 [10k]
2 [.01k]
2 [.02k]
640 [406k]
127 [170.5k]
37 [38k]
11 [134k]
5 [1.3k]
2 [.02]
1 [.005k]
5685 [455k]
933 [531k]
210 [121.5k]
59 [77k]
21 [217k]
9 [341.5k]
3 [1.5k]
Table 4. Empirical bounds on (HD,RC,GC) quaternary codes. For the ’d entry we found a better bound, namely 112 code words as described in Section 5.2.
NACST/Seq: A Sequence Design System with Multiobjective Optimization Dongmin Kim, Soo-Yong Shin, In-Hee Lee, and Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science and Engineering Seoul National University Seoul 151-742, Korea {dmkim,syshin,ihlee,btzhang}@bi.snu.ac.kr http://bi.snu.ac.kr/
Abstract. The importance of DNA sequence design for reliable DNA computing is well recognized. In this paper, we describe a DNA sequence optimization system NACST/Seq that is based on a multiobjective genetic algorithm. It uses the concept of Pareto optimization to reflect many realistic characteristics of DNA sequences in real bio-chemical experiments flexibly. This feature allows to recommend multiple candidate sets as well as to generate the DNA sequences, which fit better to a specific DNA computing algorithm. We also describe DNA sequence analyzer that can examine and visualize the properties of given DNA sequences.
1
Introduction
Using the bio-molecules as basic computing or storage media, DNA computing wins the massive parallelism and some useful features such as the self-assembly. However, the chemical characteristics of materials involve some drawbacks in computing process. Deaton and Garzon, for example, identified various types of errors those lead to false positives in Adleman’s original techniques [1,2]. To overcome these drawbacks, they also gave a theoretical bound on the size of problems that can be solved reliably [3]. They introduced a new measure of hybridization likelihood based on Hamming distance and proposed a theory of error-preventing codes for DNA computing [4]. Since then, many researchers have proposed various algorithms and methods for the reliable DNA sequence design. These methods can be summarized into two different approaches. One is the deterministic approach. Marathe et al. [5]. proposed a dynamic programming approach based on Hamming distance and free energy. Frutos et al. [6] proposed a template-map strategy to select a huge number of dissimilar sequences. Hartemink et al. [7] implemented the program “SCAN” to generate sequences for the programmed mutagenesis using an exhaustive search method. Feldkamp et al. also proposed another sequence construction system “DNASequenceGenerator” [8] using a directed graph. M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 242–251, 2003. c Springer-Verlag Berlin Heidelberg 2003
NACST/Seq: A Sequence Design System with Multiobjective Optimization
243
A second approach to sequence design is to use evolutionary algorithms. Arita et al. [9] developed a DNA sequence design system using a genetic algorithm and a random generate-and-test algorithm. Tanaka et al. [10] listed up some useful sequence fitness criteria and then generated the sequence using simulated annealing technique with these criteria. Ruben et al. developed “PUNCH” [12] that employed a kind of genetic algorithm for the sequence optimization. And Deaton et al. proposed DNA based evolutionary search method [13]. The review above allows us to consider DNA sequence design as a numerical optimization problem given well-defined fitness measures. Based on the previous work [14,11], we formulate the DNA sequence design as a multiobjective optimization problem that is then solved by a genetic algorithm. This is implemented as a component (NACST/Seq) in the DNA computing simulation system, NACST (Nucleic Acid Computing Simulation Toolkit). NACST/Seq is especially useful in its ability to generate reliable codes and to provide the user with the flexibility of choosing optimal codes. The latter feature is attributed to the Pareto optimal set of candidate solutions produced by the multiobjective evolutionary algorithm. In addition, for the analysis and visualization of DNA sequence properties, we also developed NACST/Report, which is another component of NACST. Section 2 describes the algorithm of NACST/Seq. Section 3 presents the software architecture of NACST. The sequence generation examples are shown in Section 4. And future work is discussed in Section 5.
2
Sequence Design by Multiobjective Evolutionary Optimization
As mentioned before, DNA sequence design can be considered as a numerical optimization problem. Moreover, it involves simultaneous optimization of multiple objectives. Most of the current sequence generation systems use the classical multiobjective optimization method (e.g. objective weighting) or single objective optimization method. But in many cases, there may not exist such a solution Table 1. Objectives used by NACST/Seq. Objective Description Similarity similarity between two sequences H-measure degree of unexpected hybridization between two sequences 3’-end “H-measure” in 3’-end of a sequence GC Ratio degree of difference with target G, C portion Continuity degree of successive occurrence of the same base Hairpin likelihood of forming secondary structure Tm degree of difference with target melting temperature
244
Dongmin Kim et al.
as the best with respect to all objectives involved. Therefore, it may be useful to recommend a set of solutions in which a solution is superior to the rest of all in one or more objectives by a multiobjective optimization algorithm. The fitness terms for sequence optimization are described in Table 1. These terms are originally summarized by Tanaka [10], we refine these fitness measures as more detailed numerical formulae for NACST/Seq [11]. With these seven objectives, the sequence optimization can be written formally as follows: Λ = {A, C, G, T }, Λ∗ denotes all possible sequences, A (⊂ Λ∗ ) means the generated sequence pool, x ∈ A,
fi ∈ { fGCratio , fT m , fHairpin , f3 −end , fContinuity , fH−measure , fSimilarity },
minimize F (x) = (f1 (x), f2 (x), . . . , fn (x)).
(1)
NACST/Seq finds the sequence pool As, which satisfy that F (A), F (A ) ∃i (i = 1, 2, . . . , n) such that fi (A) < fi (A ). The set of these As in each generation step is called the first Pareto front of that generation. Fig. 1 shows a multiobjective genetic algorithm implemented in NACST/Seq. The optimization algorithm is based on NSGA (nondominated sorting genetic algorithm) [15]. The original NSGA varies from simple genetic algorithms only in the selection operator, but we customized additionally the crossover and mutation operators to reflect the modified population architecture of NACST/Seq. In simple genetic algorithms, a population represents the feasible space in which each individual usually expressed through the bit string. But in NACST/Seq, each individual indicates a pool of sequences and it is indispensable to apply the evolutionary operators to each sequence for the evolution of whole population. Therefore, the crossover and mutation operators are divided into two steps - one is for individual level operation (step 1), the other belongs to sequence level (step 2). To improve the performance, we employ the elite strategy as shown in the last step in Fig. 1 and remove the sharing parameter used original nondominated sorting procedure by making the selection operator to work based on the rank of Pareto front. More detailed explanations can be found in [16].
3
NACST in Action
NACST consists of four independent components: NACST/Data, NACST/Seq, NACST/Report and NACST/Sim. NACST/Data allows the user to import and export the generated sequences and to edit functionality. NACST/Report shows the statistical figures of the sequences and plots graphs according to these data. NACST/Sim accomplishes DNA computing in silico. In this paper, we focus on NACST/Seq with a brief description of NACST/Report. Before introducing NACST/Seq, we illustrate its design requirements. First, we require the fitness measures to be subdivided into fitness terms fine enough to consider the various characteristics of DNA sequences. Second, the
NACST/Seq: A Sequence Design System with Multiobjective Optimization
245
Start Initialize Population
Union current Pop. and global Pareto Calculate the fitness of each individual Select first Pareto as global Elites
Select more fitable individual
Yes
Yes
No
Crossover ?
Select and Crossover
Is population classified?
Identify current nondominated front
No
Yes Mutation ?
Mutate
No Sufficient Pop Size ?
Union current first Pareto front and global Elites
Next front
No
Yes No
Evolved Sufficiently ?
Yes Finish
Fig. 1. Multiobjective genetic algorithm implemented in NACST/Seq. sequence generation algorithm should be adaptable to various combinations of fitness. Third, users should be able to choose and combine objectives flexibly. Finally, the system should be able to show the generation result with the information that is sufficient to help users make decisions. 3.1
NACST/Seq
The sequence generation part of NACST, namely NACST/Seq, is implemented using C++ language in Linux platform and adopts a plug-in architecture that makes it possible to develop each fitness plug-in separately and to assure a future extension. In other words, we can add newly defined fitness plug-ins to NACST/Seq causing no alteration of pre-developed program texts and apply these plug-ins to the sequence generation process in run-time without the whole recompilation of the system. The sequence generation steps are shown in Fig. 2. The first step is to select the generation option. In this step, the user can choose one option among “generate new sequences”, “generate sequences and add those to an existing sequence pool” and “add a sequence to a pool manually”. The first option means the generation of new sequences and new pools, the second option in this step
246
Dongmin Kim et al.
Fig. 2. Sequence generation process in NACST/Seq.
implies not a simple generation of sequence pools but a consideration of existing sequences in the generation of new sequences, and the last means that the user can add the existing sequence to a pool. The sequence structure, normal and hairpin, can be selected in the second option window. The option “normal” prevents the generation of self-complementary sequences, while the “hairpin” option acts vice versa. Because some DNA computing procedure needs to form the secondary structure intentionally [17]. Then, the general sequence option window appears. In this window, the number of sequences and the length of each sequence are adjusted. Next, the fitness option window provides the functionality of the combining and weighting the objectives. If the selected fitness needs additional arguments, the user can call up the sub-windows for tuning these arguments. For example, “Melting Temperature” needs the choice of the user between GC ratio and NN (nearest neighbor) methods. In addition, the oligo and Na+ concentration should be offered, if the user select the NN method. In this step, we can decide abundant properties of the generated sequences. Finally, the options for the genetic algorithm are determined. These include the generation number, the population size, the crossover and mutation rates. After
NACST/Seq: A Sequence Design System with Multiobjective Optimization
247
execution of sequence generation, the main window shows the result sequences with their melting temperature and GC ratio. 3.2
NACST/Report
Another application of NACST is the analysis of sequence pools. Fig. 3 displays the analysis functions of NACST/Report.
Fig. 3. Functions in NACST/Report.
After the generation process described in the previous section, the results are saved in a file. Then, NACST/Report loads this results (Fig. 3-1). In fact, NACST/Report can load any sequence pool saved in its format and analyzes various aspects of the loaded sequence pools (Figs. 3-2,-3,-4)). In window 2 (Fig. 3-2), one can examine all sequence pools by the comparison of those fitness value measured through each objective used in optimization procedure. Window 3 (Fig. 3-3) provides the graphical representation of the superiority of fitness value in each sequence between two selected pools. At last, one can investigate the properties of a pool in window 4 (Fig. 3-4). For instance, it can highlight
248
Dongmin Kim et al.
the position of a specific sub-sequence in a pool, find all complementary subsequences of user’s input sequence, and mark all successive occurrence of the same base running over the threshold. These features can be put into practice, for example, predicting and analyzing the sequence properties in PCR experiments.
4
Working Examples
To demonstrate the working of the algorithm, we investigate some examples. First, We scored sequences in [1] using NACST/Seq. As we expected, ‘Good codes’ get better (lower) scores. For more details about this test can be found in [11]. Table 2. Fitness of the codes in Deaton’s paper. Sequence (5 → 3 ) CTTGTGACCGCTTCTGGGGA CATTGGCGGCGCGTAGGCTT ATAGAGTGGATAGTTCTGGG GATGGTGCTTAGAGAAGTGG TGTATCTCGTTTTAACATCC GAAAAAGGACCAAAAGAGAG TTGTAAGCCTACTGCGTGAC ATCAGCTGGATTCATCTGAA ATCAACAGAAATCCGCGGAA ATCAGCTGAGGTCTGGTGAG GTCCGCTGTATTCTCGTGAT TTCAACTGTTTTCAGCTGTG TTCACCTTTATTGAGCCGAA TTCAGCCGATTTGCGGAGAA
H-measure 3’-end Similarity Continuity Hairpin GC% Tm Good codes 82 21 34 16 3 60 62.83 50 49 15 0 3 65 71.06 19 14 41 0 6 45 54.90 34 9 20 0 0 50 57.50 46 3 5 16 11 35 50.59 19 2 2 41 0 40 54.89 0 6 60 58.87 Bad codes 179 95 177 0 9 40 55.94 139 67 74 9 6 45 61.64 102 10 87 0 12 55 59.78 48 11 32 0 0 50 59.24 51 14 38 16 10 40 52.69 29 12 20 9 9 40 53.83 9 19 50 61.06
And then, we generate the vertex sequences for the traveling salesman problem (refer to [11]) using all objectives. The genetic algorithm parameters used are: the population size is 200, the generation number is 1000, the crossover rates for the steps 1 and 2 are 0.97, and the mutation rate is 0.3. Fig. 4 shows the evolution of fitness values. As generation goes on, the algorithm finds the more suitable individuals for each objective. Especially, for hairpin, melting temperature, and GC ratio objectives, the optimal individuals are found in early steps, i.e. each fitness is zero. But, with respect to the average value of all objectives, which is usually regarded as a measure in single objective optimization methods, NACST/Seq shows relatively weak optimizing power. As a cause of this phenomenon, we conjecture that there exist some conflicts between objectives. That is to say, since one objective has the trade-off with other objectives, an individual optimized for one objective lost the fitness of other objectives reducing the influence of optimization in the average value of objectives. To confirm this connection of objectives, we repeat the generation process with the selected objectives considered as conflicting ones.
NACST/Seq: A Sequence Design System with Multiobjective Optimization 2000
exp01 exp02 exp03
250 200 H-measure
1500 Fitness Value
300
Continuity GC ratio Hairpin H-measure Similarity avg. sum. 3’-end Tm
249
1000
150 100
500 50 0
0 0
100 200 300 400 500 600 700 800 900 1000 Generation
0
Fig. 4. Fitness values over generation. 4400
3900
Similarity
Similarity
4000
4000 3900
3600 3500 4300
4400
Fig. 6. H-measure vs. Similarity.
4500
300
exp01 exp02 exp03
3700
3700 4100 4200 H-measure
250
3800
3800
4000
200
4100
4100
3900
150 Continuity
4200
4200
3600 3800
100
Fig. 5. Continuity vs. H-measure.
exp01 exp02 exp03
4300
50
3400 200
300
400
500
600 700 3’-end
800
900
1000
Fig. 7. 3’-end vs. Similarity.
Fig. 5 shows that there exist weak relations between continuity and Hmeasure, thus the algorithm works efficiently. While Fig. 6 and Fig. 7 depict other cases, the results confirm our expectation that there exist some conflicts between objectives. Because similarity has heavy inverse relations to H-measure and 3’-end, NACST/Seq explores multiple solutions between these two objectives instead of needless searching global one. This fact implies that there exist some difficulties (e.g. biased optimization) with classical approaches to the problem. Table 3 shows the result of analysis in TSP vertex generation using window (2) in Fig. 3. NACST/Report can evaluate a sequence pool designed by other sequence generators or human experts. The values listed in the last row of Table 3 came from [11], we got those values with the single objective evolutionary algorithm using the sum of objectives. As shown in Table 3, NACST/Seq can find many alternative better sequence pools than that of the single objective optimization method.
250
Dongmin Kim et al.
Table 3. Fitness of the generated vertex sequences for TSP. No. 0 1 2 3 4 5 6 7 8 9 0
5
GC%
Tm
Continuity Hairpin H-measure Similarity 3’-end NACST/Seq 5 4.7827 86 32 2033 1982.5 291 35 0 45 24 2107 1903.5 255 35 11.6674 0 22 2081 1948.5 225 65 13.6196 61 0 2160 1852.5 282 55 24.5350 196 3 1223 2658.5 116 95 24.7548 131 22 2182 1683.0 304 70 26.2218 111 15 1435 2549.0 75 35 7.3607 43 3 1848 2054.0 204 30 0.4267 36 6 1957 1980.0 217 45 12.8784 18 7 1929 2021.5 182 Single-objective evolutionary algorithm with sum of fitness values 35 4.6080 61 16 1383 2575.0 80
Discussion
In this paper, we described the evolutionary sequence generator called NACST/ Seq and NACST/Report, which were implemented as components of the DNA computing simulator, NACST (Nucleic Acid Computing Simulation Toolkit). We formulated sequence design as a multiobjective optimization problem and used a nondominated sorting procedure to generate the multiple candidate sequence pools. NACST/Seq can generate the promising DNA sequences and ensure that the user is able to choose more suitable sequences for the specific DNA experiments. NACST/Report provides the analysis and visualization of the sequence properties. This feature allows the user to investigate sequences in silico before real bio-chemical experiments. The work in progress is to build a simulation system (NACST/Sim) of DNA computing that considers the thermodynamics of DNA sequences. We also plan to verify the practical usefulness of the sequences generated by NACST/Seq with real bio-chemical experiments. Additionally, objectives functions will be refined for physical model of DNA fidelity. Acknowledgement This research was supported in part by the Ministry of Education under BK21IT Program and the Ministry of Commerce through Molecular Evolutionary Computing (MEC) Project. The RIACT at Seoul National University provides research facilities for this study.
References 1. R. Deaton, R. C. Murphy, M. Garzon, D. R. Franceschetti, and S. E. Stevens Jr., “Good encodings for DNA-based solutions to combinatorial problems,” in Proceedings of the Second Annual Meeting on DNA Based Computers, 1996.
NACST/Seq: A Sequence Design System with Multiobjective Optimization
251
2. R. Deaton, R. C. Murphy, M. Garzon, D. R. Franceschetti, and S. E. Stevens Jr., “Genetic search of reliable encodings for DNA based computation,” Late-Breaking papers at the First Genetic Programming Conference, pp. 9-15, 1996. 3. R. Deaton, M. Garzon, J. A. Rose, D. R. Franceschetti, R. C. Murphy, and S. E. Stevens Jr., “Reliability and efficiency of a DNA-based computation,” Physical Review Letters, vol. 80, no. 2, pp. 417-420, 1998. 4. M. Garzon, P. Neathery, R. Deaton, R. C. Murphy, D. R. Franceschetti, and S. E. Stevens Jr., “A new metric for DNA computing,” in Proceedings of Genetic Programming 1997., The MIT Press. pp. 472-478, 1997. 5. A. Marathe, A. E. Condon, and R. M. Corn, “On combinatorial DNA word design,” in Proceedings of 5th DIMACS Workshop on DNA Based Computers, pp. 75-89, 1999. 6. A. G. Frutos, A. J. Thiel, A. E. Condon, L. M. Smith, and R. M. Corn, “DNA computing at surfaces: 4 base mismatch word design,” in Proceedings of 3rd DIMACS Workshop on DNA Based Computers, pp. 238, 1997. 7. A. J. Hartemink, D. K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” in Proceedings of 4th DIMACS Workshop on DNA Based Computers, pp. 227-235, 1998. 8. U. Feldkamp, S. Saghafi, and H. Rauhe, “DNASequenceGenerator - A program for the construction of DNA sequences,” in Preliminary Proceedings of 7th international Workshop on DNA-Based Computers, pp. 179-188, 2001. 9. M. Arita, A. Nishikawa, M. Hagiya, K. Komiya, H. Gouzu, and K. Sakamoto, “Improving sequence design for DNA computing,” in Proceedings of Genetic and Evolutionary Computation Conference 2000, pp. 875-882, 2000. 10. F. Tanaka, M. Nakatsugawa, M. Yamamoto, T. Shiba, and A. Ohuchi, “Developing support system for sequence design in DNA computing,” in Preliminary Proceedings of 7th international Workshop on DNA-Based Computers, pp. 340-349, 2001. 11. S.-Y. Shin, D. Kim, I.-H. Lee, and B.-T. Zhang, “Evolutionary sequence generation for reliable DNA computing,” Congress on Evolutionary Computation 2002,, 2002. (Accepted) 12. A. J. Ruben, S. J. Freeland, and L. Landweber, “PUNCH: an evolutionary algorithm for optimizing bit set selection,” in Preliminary Proceedings of 7th International Workshop on DNA-Based Computers, pp. 260-270, 2001. 13. R. Deaton, R. C. Murphy, J. A. Rose, M. Garzon, D. R. Franceschetti, and S. E. Stevens Jr., “A DNA Based Implementation of an Evolutionary Search for Good Encodings for DNA Computation,” in Proceedings of the 1997 IEEE International Conference on Evolutionary Computation, pp. 267-272, 1997. 14. B.-T. Zhang and S.-Y. Shin, “Code optimization for DNA computing of maximal cliques,” in Advances in Soft Computing - Engineering Design and Manufacturing., Springer., 1999. 15. N. Srinivas, “Multiobjective optimization using nondominated sorting in genetic algorithms,” Evolutionary Computation, vol. 3, no. 2, pp. 221-248, 1995. 16. S.-Y. Shin, D. Kim, I.-H. Lee, and B.-T. Zhang, “Multiobjective evolutionary algorithms to design error-preventing DNA sequences,” Parallel Problem Solving from Nature VII, 2002. (Submitted) 17. M. Hagiya, M. Arita, D. Kiga, K. Sakamoto, and S. Yokoyama, “Towards parallel evaluation and learning of Boolean µ-formulas with molecules,” DNA Based Computers III, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 48, pp. 57-72, 1999.
A Software Tool for Generating Non-crosshybridizing Libraries of DNA Oligonucleotides Russell Deaton1 , Junghuei Chen2 , Hong Bi2 , and John A. Rose3 1
Computer Science and Engineering University of Arkansas Fayetteville, AR, USA 72701 [email protected] 2 Chemistry and Biochemistry University of Delaware Newark, DE, USA 19716 [email protected] [email protected] 3 Department of Computer Science University of Tokyo Bioinformatics Project Tokyo, JP [email protected]
Abstract. Under an all or nothing hybridization model, the problem of finding a library of non-crosshybridizing DNA oligonucleotides is shown to be equivalent to finding an independent set of vertices in a graph. Individual oligonucleotides or Watson-Crick pairs are represented as vertices. Indicating a hybridization, an edge is placed between vertices (oligonucleotides or pairs) if the minimum free energy of hybridization, according to the nearest-neighbor model of duplex thermal stability, is less than some threshold value. Using this equivalence, an algorithm is implemented to find maximal libraries. Sequence designs were generated for a test of a modified PCR protocol. The results indicated that the designed structures formed as planned, and that there was little to no secondary structure present in the single-strands. In addition, simulations to find libraries of 10-mers and 20-mers were done, and the base composition of the non-crosshybridizing libraries was found to be 2/3 A-T and 1/3 G-C under high salt conditions, and closer to uniform for lower salt concentrations.
1
Introduction
A key operation for DNA computing (DNAC) is the template-matching hybridization reaction between oligonucleotides. To implement a successful DNA computation, the hybridizations should occur as designed. Otherwise, unplanned hybridizations, (i. e. crosshybridizations), can occur, with several negative effects M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 252–261, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Software Tool for Generating Non-crosshybridizing Libraries
253
on a DNA computation, including false positive and negatives. Therefore, a first step in DNAC, which has been termed the DNA word design problem, is to find sequences that minimize crosshybridizations. DNA word designs for computation must satisfy several requirements. First, the selected oligonucleotides should hybridize only as designed. Second, the set of words, or library, should be large enough to represent the problem and implement a solution. For small libraries, it is relatively easy to pick random sequences and check that they do not crosshybridize. As the size of the library grows, however, the constraints increasingly conflict. In other words, as additional oligonucleotides are added to the library, the subspace of non-crosshybridizing sequences is depleted, and it becomes increasingly likely that new library additions will result in crosshybridizations. In the DNA 7 conference[1], DNA word design received a larger emphasis[2, 3, 4]. In previous work on DNA word design[5, 6], the measure of crosshybridization potential has ranged from Hamming and related distances[7, 8, 9] to duplex melting temperature[10, 11] and thermal stability[12, 13]. Most of these approaches produce adequate designs for small collections of oligonucleotides, though perhaps no better than a random encoding strategy[12]. In order for DNAC to solve large problems, however, large libraries of non-crosshybridizing DNA oligonucleotides potentially are required. The ongoing goal of the current work is to use computer simulation to study the characteristics of very large collections of many different DNA oligonucleotides. Therefore, a DNA word design tool was implemented with the following features: ability to simulate and generate large sets of non-crosshybridizing oligonucleotides efficiently, basis in nearest-neighbor model of DNA thermal stability, capability to check sequences and their reverse complements, options for different reaction conditions, including temperature, and salt and strand concentrations, and output of free energies of hybridization, as well as melting temperatures and alignments of most energetically stable duplex. The outline of the paper is a follows. In the first section, the equivalence of the DNA word design problem (DWD) and the independent set problem (ISET) from graph theory with an all or nothing model of hybridization is shown. This equivalence suggests applying efficient algorithms for finding maximal independent sets to finding non-crosshybridizing libraries of DNA oligonucleotides. Next, a software tool is described that implements the suggested algorithms. Some experimental results are given to support the validity of the designs produced, and simulation results suggest the base composition of a non-crosshybridizing library under different reaction conditions. Finally, the method and results are discussed, and conclusions given.
2
DWD Equivalence to ISET
The problem of finding a maximum-sized library of non-crosshybridizing DNA words, the DNA word design problem, may be expressed as follows:
254
Russell Deaton et al.
Definition 1 (DNA Word Design (DWD)) Given a set of DNA oligonucleotides T , an hybridization energy Jij = Jji ∈ Z − ∀ i, j ∈ T , a positive integer K ≤ |T |, and a threshold B ∈ Z − , does T contain a subset T ⊆ T such that |T | ≥ K, and Jij ≥ B ∀ i, j ∈ T ? In the above formulation, the constraint Jij ≥ B reflects a threshold value for hybridization energy, or an all or nothing hybridization model. The set T could be composed of individual oligonucleotides, or pairs of Watson-Crick complements. Under restriction, the DWD problem is equivalent to finding a maximum independent set of vertices in a graph, which is NP-complete[14]. Definition 2 (Independent Set (ISET)) Given a graph G = (V, E) and a positive integer L ≤ |V |, does G contain a subset V ⊆ V such that |V | ≥ L, and such that no two vertices in |V | are joined by an edge in E? The equivalence is shown by letting T = V −1 if (i, j) ∈ E Jij = 0 otherwise K =L B =0 T = V . There exists an greedy algorithm for finding maximal independent sets[15] (i. e. not properly contained in any independent set), as opposed to maximum (i. e. maximal independent set of largest size), which is NP-complete. These algorithms were adapted to the DNA word problem to find maximal non-crosshybridizing sets of oligonucleotides, or libraries of oligonucleotides that are not properly contained in any other library. Let T represent the noncrosshybridizing library, and N (T ) indicate all those oligonucleotides that hybridize with a member of the library. Then, the algorithm[16] for an initial set of oligonucleotides of size m is shown in Figure 1.
begin T ← 0 {i} for i = 1 to m do if i ∈ N(T ) then T ← T end
Fig. 1. Greedy, sequential algorithm adapted from maximal independent set problems to find maximal non-crosshybridizing libraries of oligonucleotides.
In the implementation, large random sets of oligonucleotides and their WatsonCrick complements are generated. Then, oligonucleotide are chosen in order, and
A Software Tool for Generating Non-crosshybridizing Libraries
255
added to the library if they are still available. All oligonucleotides that have an minimum energy of hybridization with the added sequence, or its complement, that are less than some threshold are eliminated from further consideration. By repeating this process, a non-crosshybridizing library can be selected from the original random population. A similar procedure was used in [17], though a full thermodynamic model was not used. Thus, by casting the DNA word design problem as an independent set problem, not only is it shown to be NP-complete, but also an efficient algorithm is suggested to find maximal libraries of noncrosshybridizing oligonucleotides.
3
Thermodynamic Calculations
The program uses the nearest-neighbor model of duplex thermal stability[18] to determine hybridization energies between oligonucleotides. The unified set of thermodynamic parameters from SantaLucia[18] are used to calculate the total free energy of hybridization (∆G◦ ) according to the formula, ni ∆G◦i + ∆G◦GC (init) + ∆G◦AT (init) − 0.114N ln[Na+ ], (1) ∆G◦ = i
where ∆G◦i are the standard free energy changes for the nearest-neighbor pairs in the existing set of data, ni is the number of nearest-neighbors i, and ∆G◦GC (init) and ∆G◦AT (init) are initiation terms for G·C and A·T , respectively. A correction term for self-complementary duplexes is omitted from Eq. 1. The symmetry term is not included in the calculation since the chance of random oligonucleotides being self-complementary is small. In addition, a salt correction term[18] is applied where N is the total number of phosphates in the duplex divided by 2 (approximately the length), and [Na+ ] is the salt concentration. Then, the algorithm of Figure 1 is implemented. Hybridizations are determined between two oligonucleotides if their minimum free energy of formation is less than a user-defined threshold. The minimum free energy of hybridization is computed using a variant of the Smith-Waterman dynamic programming algorithm[19] for finding local alignments. Dynamic programming was previously used in [7] to compute free energy values. The scoring function to construct the matrix of energy values is ∆G◦ [i][j − 1] + g ∆G◦ [i − 1][j] + g , (2) ∆G◦ [i][j] = min ∆G◦ [i − 1][j − 1] + ∆G◦ij 0 where ∆G◦ [i][j] is the value of the free energy for the current duplex, g is a penalty applied to dangling ends, loops, bulges, and mismatches not in the parameter set, and ∆G◦ij is the value of the current nearest-neighbor ij. The minimum energy value in the matrix is identified, which corresponds to the minimum free energy of hybridization over the local alignments of the two oligonucleotides.
256
Russell Deaton et al.
In addition, values of enthalpy are recorded for melting temperature calculations according to TM = T ∆H ◦ /(∆H ◦ −∆G◦ +RT ln CT )+16.6 log10 ([Na+ ]/(1.0+0.7[Na+ ]))+3.83, (3) where R is the gas constant, CT is the total oligonucleotide concentration for nonself-complementary strands, ∆H ◦ is the enthalpy change of hybridization, and the remaining terms are a salt correction[20]. Finally, by backtracking through the matrix according to recorded pointers, the alignment corresponding to the minimum free energy can be output to a file.
4
Other Software Features
The program allows the creation of a random set of oligonucleotide sequences of specific length. The set of random sequences can be created with or without Watson-Crick complements, and according to user-specified probabilities for the individual bases. In addition, sequences can be input from a file, and then, optionally, their Watson-Crick complements generated. Other outputs of the program include the library sequences, matrices of their free energies of hybridization and melting temperatures, and parameters of the run. In addition, the software can compute the matrices of all pairwise free energies of hybridization and melting temperatures for a set of randomly generated oligonucleotides or those input from a file.
5
Results
The first application of the software was to design a set of template molecules to test a PCR protocol to select maximally mismatched DNA oligonucleotides[21]. The templates are shown in Figure 2[21], and the individual sequences that compose the templates in Table 1. In addition, the melting temperatures and free energies of hybridization to the indicated oligonucleotide according to the new software and an online thermodynamic simulator, HYTHER[22], are shown. There was approximate agreement between this tool and HYTHER. In Figure 3, a native gel with the template molecules is shown. Lanes 1 and 2 contain single-strands of a 5 and 3 template. Usually, different single strands will run at different speeds in a gel because of secondary structure. In this case, the single-strands were approximately equal, indicating little secondary structure formation, as designed. In lanes 3-5, bands for the the double stranded templates indicate that the designed structures formed as planned with little or no crosshybridization. Thus, the maximally mismatched template with a large internal loop ran slowest, followed by the template with a small internal loop. In addition, with a threshold of -6 kcal/mol, multiple 10-mer and 20-mer libraries were generated from a random starting population with an uniform distribution of bases at 23◦ C and 33◦ C, salt concentrations of 1 and 0.1 M, and uniform strand concentrations of 1 × 10−6 and 1 × 10−4 M. Then, the base
A Software Tool for Generating Non-crosshybridizing Libraries P1
M1
P2’
P1’
M1’
P2
M1
P2’
257
T1
P1 T2
X
X
P1’
M2
P1
M1 XXX
P2’
P1’
M3
P2
T3
P1
M1
P2
P2’
T4 P1’
M4
P2
Fig. 2. Template configurations (T1-T4) for PCR experiments. X indicates mismatch. indicates reverse Watson-Crick complement. Hybridization P1/P1 P2/P2 M1/M1 M2/M1 M3/M1 M4/M1
Sequence TCTTCATAAGTGATGCCCGT GAAAAAACACCCCTTCGATG AAAATACCCTCCCCCCTAGA TCTTGGGGGGAGGGTCTTTT TCTAGGTCAGAGGGTATTTT TCCTATAGTTTGATATAGAT
∆G1 -29.11 -28.7 -28.5 -19.9 -13.3 -1.2
∆G2 -29.82 -29.5 -29.2 -22.8 -16.8 -0.5
1 Tm 69.1 68.4 69.8 61.2 37.1 0
2 Tm 71.3 69.3 72.3 62.7 48.2 0
Table 1. Oligonucleotides that composed template sequences used in PCR protocol experiments. First column indicates hybridization for thermodynamic data. In thermodynamic data, a 1 indicates this tool, and a 2 indicates HYTHER’s data. indicates reverse Watson-Crick complement. The units for energy are kcal/mol, and temperature ◦ C. The simulation conditions were 23◦ C, 1 M NaCl, and 1 µM DNA concentration.
Fig. 3. A native gel of the four different templates formed by annealing two single-stranded DNAs. Lanes 1 and 2 are the top and bottom strands alone, respectively. Lanes 3 to 6 are the four templates, as indicated by the diagram on the right. M is the molecular weight marker.
258
Russell Deaton et al. AACAATCTTTTCAAGCTAAC GTTAGAGAGTAAATGTTAGG TGTCTCAACGATTACCCCCG TCAAATGAAGCTATTTTGTA ATGACCATATTAGGTAGTAG GTTCCGAAATAACAGAATCG CAGCGTCTTCCCTTAAGTAC CTTTCCCAGTAGAATTACAA GTTGGAATCACCTCTATGAT GTTTTTTGATATTTTAGTCG AGACTTTATGGATACCATTC CCTTTTTTTCGTATCGCTCC ACATTTTTCTACATCCACAT TTTATCATTATTACACTATC ACTAGACCAAGAAATTTAGA TTTCCTAATACTGCTTATAT TATGCTAGGTAAAAAATAAG CCTAAAGAACTCTTATTATT AGGAGAATCTTACTTCTACG AATGTATGAGTTTATTCTAA
TGTTTCTATCTAGGCGTGAT TACCGTAGTAAACTGTCTAC TACATGACGHAAGCCAAGGG CTTTGGATTTATCTTCGACA GATCCTATATCTTAATGCAC AACGAACCTTCTAGAGTATG GCACAATTAGGCACTAACCC GGACCCTGTATAACATACAA CATAAAAAGTTAATAAGTTA ATCAGTTGTTGTTAAATTAC ATTTTAAGACTATCTCTTAG CATACTTTGTAAGTAATTAT AGTAACTTCAACCATAGGCC GTATTAATTTCCATCTAAAA GGTCTCTGTACTTTCTGACT AGGTTTAATTAGTCAAATAG CTTCTCTATATAATATTTCA AGACATAATTTTATATACTC TCTTATAGATCCCGTACTGA TCATTCATATACAAGTTATC
Table 2. Non-crosshybridizing library of 40 Watson-Crick pairs. Only one sequence of the pair is shown. The simulation conditions were 23◦ C, 1 M NaCl, and 1 µM DNA concentration.
composition of the non-crosshybridizing libraries was analyzed, and it was found that the A-T composition of the libraries was approximately 2/3, and the G-C composition approximately 1/3 for all cases except 0.1 M NaCl. In that case, the distribution was closer to uniform, but still weighted toward A-T’s. An example of a library is shown in Table 2.
6
Discussion
Of course, the algorithm implemented computes a maximal non-crosshybridizing library, not the largest possible, or maximum. Nevertheless, the algorithm is fairly efficient and has generated a library of 3953 non-crosshybridizing WatsonCrick pairs of length 20bp. There do exist parallel algorithms to compute maximal independent sets[16, 23], and these are currently being investigated in order to parallelize and compute even larger libraries. In the thermodynamics, only the minimum free energy of hybridization is computed between two oligonucleotides, not the full-blown partition function. Though the dynamic programming algorithm can be used to compute the partition function[24], in order to deal with the very large libraries envisioned here, a decision was made not to pay the extra computational cost to do so. The program, however, could be readily adapted to compute the partition function. The reasoning to limit the calculation to the minimum free energy was that if it was
A Software Tool for Generating Non-crosshybridizing Libraries
259
sufficiently small, then, the probability of hybridization, in general, would also be small. Of course, there could be pathological cases where there were many binding modes of approximately equal energy that could cause a significant probability of crosshybridization, and the current software would not account for that case. Other compromises were the threshold for hybridization and the penalty term for mismatches, dangling ends, bulges, and loops. The threshold for hybridization is set by the user. By observing melting temperatures, a threshold corresponding to GGGG/CCCC, which is about -6 kcal/mol was found to be reasonable at room temperature. Nevertheless, the size of the library generated is highly dependent on the threshold. Therefore, it was left as a user provided parameter. Instead of a complete model of loops and bulges, and in order to simplify the thermodynamic model and calculation, each loop, bulge, or mismatch not in the set of data was penalized with a constant, user-defined value. This probably accounts for the differences in the thermodynamic results to HYTHER in Table 1. By studying the alignments produced by the tool, it was found that the local dynamic programming method produced single duplex regions that contained very few internal loops, bulges, or mismatches. Thus, the duplexes generated by the tool were consistent with a modified staggered zipper model[12] of duplex thermal stability. In addition, under conditions of 1 M salt and for the larger libraries, the frequency of occurrence of the Watson-Crick nearest neighbor pairs were inversely proportional to their thermodynamic stability. This reflected the general structure of the words in which less stable A-T rich regions were separated by individual or short G-C regions. This was to be expected from the threshold criteria, which allowed matches of lower stability, as long as they were broken up by mismatch regions. Most of the compromises made were for the sake of efficiency and simplicity of implementation. The goal of the tool is not a complete thermodynamic simulator, for which other tools exist[22, 12], but to supply a speedy design tool for large libraries of DNA words for computation. Nevertheless, as indicated by the experimental results and comparison with HYTHER, the tool provides reasonable values. Moreover, the tool neither maximizes nor makes uniform the hybridization energies of the Watson-Crick complements, nor accounts for crosshybridization in ligation regions. The non-crosshybridizing libraries that are selected represent a starting pool of oligonucleotides from which these additional constraints would have to be satisfied. By adjusting the threshold criteria, however, the relative probability of cross versus Watson-Crick hybridizations can determined by computing a ratio of Boltzman factors, which for the libraries simulated to this point, heavily favor the Watson-Crick hybridizations. In addition, the work reported here is in support of an effort to manufacture non-crosshybridizing libraries in vitro[21]. Thus, the simulated libraries were constrained to match those generated in the test tube.
260
7
Russell Deaton et al.
Conclusion
Frequently, a set of mutually non-interacting (non-crosshybridizing) DNA oligonucleotides are required for specific applications, such as DNA words for nanostructures or DNA tags for universal oligonucleotide arrays. A software tool for generating non-crosshybridizing oligonucleotides has been developed and tested. The minimum free energy for duplex formation between two given oligonucleotides is calculated using a unified set of nearest-neighbor thermodynamic parameters[18], and a dynamic programming algorithm that calculates the minimum energy over all possible local alignments of the two oligonucleotides. The libraries are selected from a initial random population by applying a greedy algorithm, which is inspired by the computation of maximal independent sets, to a graph in which oligonucleotides are vertices and hybridizations are edges. The tool was used to design sequences for a experiment testing a PCR protocol for selection of maximally mismatched oligonucleotides, and no indications of crosshybridization or secondary structure were observed. The tool was also used to generate non-crosshybridizing libraries for 10-mers and 20-mers. Under high salt conditions, it was found that these libraries had a base composition of approximately 2/3 A/T’s and 1/3 G/C, which is consistent with increased thermal stability with increasing salt concentrations. As the salt concentration decreased, the distribution became more uniform.
8
Acknowledgments
This work was supported by NSF Award EIA-0130385.
References [1] Jonoska, N., Seeman, N.C., eds.: Preliminary Proceedings of the 7th International Meeting on DNA Based Computers, Tampa, FL, University of South Florida (2001) June 10-13, 2001. [2] Hussini, S., Kari, L., Konstantinidis, S.: Coding properties of DNA languages. [1] 107–118 June 10-13, 2001. [3] Feldkamp, U., Saghafi, S., Rauhe, H.: DNASequenceGenerator - a program for the construction of DNA sequences. [1] 179–188 June 10-13, 2001. [4] Hinze, T., Hatnik, U., Sturm, M.: An object oriented simulation of real occurring molecular biological processes for DNA computing and its experimental verification. [1] 13–22 June 10-13, 2001. [5] Brenner, S.: Methods for sorting polynucleotides using oligonucleotide tags. U.S. patent number 5,604,097 (1997) [6] Li, M., Lee, H.J., Condon, A.E., Corn, R.M.: DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir 18 (2002) 805–812 [7] Marathe, A., Condon, A.E., Corn, R.M.: On combinatorial DNA word design. In Winfree, E., Gifford, D.K., eds.: DNA Based Computers V, Providence, RI, DIMACS, American Mathematical Society (1999) 75–90 DIMACS Workshop, Massachusetts Institute of Technology, Cambridge, MA, June 14-16, 1999.
A Software Tool for Generating Non-crosshybridizing Libraries
261
[8] Deaton, R., Garzon, M., Rose, J.A., Franceschetti, D.R., Murphy, R.C., Stevens Jr., S.E.: Reliability and efficiency of a DNA based computation. Phys. Rev. Lett. 80 (1998) 417–420 [9] Garzon, M., Deaton, R., Neathery, P., Murphy, R.C., Stevens Jr., S.E., Franceschetti, D.R.: A new metric for DNA computing. In: Genetic Programming 1997: Proceedings of the Second Annual Conference, AAAI (1997) 479–490 Stanford University, July 13-16, 1997. [10] Hartemink, A.J., Gifford, D.K.: Thermodynamic simulation of deoxyoligonucleotide hybridization for DNA. [26] 25–39 DIMACS Workshop, Philadelphia, PA, June 23-27, 1997. [11] Ben-Dor, A., Karp, R., Schwikowski, B., Yakhini, Z.: Universal DNA tag systems: A combinatorial design scheme. J. Comput. Biol. 7 (2000) 503 [12] Rose, J.A., Deaton, R.J., Hagiya, M., Suyama, A.: The fidelity of the tag-antitag system. [1] 302–310 June 10-13, 2001. [13] Rose, J.A., Deaton, R.J., Franceschetti, D.R., Garzon, M., Stevens, Jr., S.E.: A statistical mechanical treatment of error in the annealing biostep of DNA computation. In: Proceedings of the Genetic and Evolutionary Computation Conference, Volume 2, AAAI, Morgan Kaufmann, San Francisco (1999) 1829–1834 Orlando, FL, July 1999. [14] Garey, M.R., Johnson, D.S.: Computers and Intractability. Freeman, New York (1979) [15] Erd¨ os, P.: On the graph-theorem of Tur´ an. Math. Lapok. 21 (1970) 249–251 [16] Karp, R.M., Wigderson, A.: A fast parallel algorithm for the maximal independent set problem. Journal of the Association for Computing Machinery 32 (1985) 762– 773 [17] Yoshida, H., Suyama, A.: Solution of 3-SAT by breadth first search. In Winfree, E., Gifford, D.K., eds.: DNA Based Computers V, Providence, RI, DIMACS, American Mathematical Society (1999) 9–22 DIMACS Workshop, Massachusetts Institute of Technology, Cambridge, MA, June 14-16, 1999. [18] SantaLucia, Jr., J.: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. 95 (1998) 1460–1465 [19] Smith, T.F., Waterman, M.S.: The identification of common molecular subsequences. J. Mol. Biol. 147 (1981) 195–197 [20] Wetmur, J.G.: Physical chemistry of nucleic acid hybridization. [26] 1–25 DIMACS Workshop, Philadelphia, PA, June 23-27, 1997. [21] Deaton, R., Chen, J., Bi, H., Garzon, M., Rubin, H., Wood, D.: A PCR-based protocol for in vitro selection of non-crosshybridizing oligonucleotides. In this volume. [22] : HYTHER. http://ozone.chem.wayne.edu (2002) [23] Luby, M.: A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing 15 (1986) 1036–1053 [24] McCaskill, J.S.: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29 (1990) 1105–1119 [25] Landweber, L.F., Baum, E.B., eds.: DNA Based Computers II. Volume 44., Providence, RI, DIMACS, American Mathematical Society (1998) DIMACS Workshop, Princeton, NJ, June 10-12, 1996. [26] Rubin, H., Wood, D.H., eds.: DNA Based Computers III, Providence, RI, DIMACS, American Mathematical Society (1999) DIMACS Workshop, Philadelphia, PA, June 23-27, 1997.
Splicing Systems: Regularity and Below Tom Head1 , Dennis Pixton1 , and Elizabeth Goode2 1
Department of Mathematical Sciences, Binghamton University Binghamton, New York USA 2 Mathematics Department Towson State University Towson, Maryland, USA
Abstract. The motivation for the development of splicing theory is recalled. Attention is restricted to finite splicing systems, which are those having only finitely many rules and finitely many initial strings. Languages generated by such systems are necessarily regular, but not all regular languages can be so generated. The splicing systems that arose originally, as models of enzymatic actions, have two special properties called reflexivity and symmetry. We announce the Pixton-Goode procedure for deciding whether a given regular language can be generated by a finite reflexive splicing system. Although the correctness of the algorithm is not demonstrated here, two propositions that serve as major tools in the demonstration are stated. One of these is a powerful pumping lemma. The concept of the syntactic monoid of a language provides sharp conceptual clarity in this area. We believe that there may be yet unrealized results to be found that interweave splicing theory with subclasses of the class of regular languages and we invite others to join in these investigations.
1
The Original Motivation for the Splicing Concept
The splicing concept was developed in the 1980’s [12] following the first author’s study of the first edition of B. Lewin’s beautiful book Genes [17]. The sequential feature of the biological macromolecules was used to treat these molecules as material realizations of abstract character strings. The nucleic acids, proteins, and many additional polymers admit such string models. However, the detailed nature of the splicing concept arose from considerations of the cut & paste activity made possible through the action of restriction enzymes on double stranded DNA molecules (dsDNA). There are currently more than 200 different restriction enzymes commercially available. These enzymes cut dsDNA at one covalent bond of each of the two sugar-phosphate backbones occurring in sub-segments having specific sequences. Such cuts sever the molecule leaving two freshly cut ends that have the potential, in the presence of a ligase enzyme, to be joined with appropriately matching ends of the same or other DNA molecules. For a reader who is not familiar with the derivation of the abstract model of splicing from the biochemical processes that the model idealizes, we recommend reading M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 262–268, 2003. c Springer-Verlag Berlin Heidelberg 2003
Splicing Systems: Regularity and Below
263
the explanation given in [12], [15], or [20]. The original formalism for splicing systems [12] was rigidly derived from the biochemical processes being simulated. For thinking about models of molecular processes, there is still value in the original formalism. However, for proving formal theorems at a less restricted level of generality, the formal definition of splicing used here, which is essentially Gh. Paun’s, has become standard. Let A be a finite set to be used as an alphabet. Let A∗ be the set of all strings over A. By a language we mean a subset of A∗ . A splicing rule is an element r = (u, v , u , v) of the product set (A∗ )4 . A splicing rule r acts on a language L producing the language r(L) = {xuvy in A∗ : L contains strings xuv q & pu vy for some q, p in A∗ }. For each set, R, of splicing rules we extend the definition of r(L) by defining R(L) = ∪{r(L) : r in R}. A rule r respects the language L if r(L) is contained in L and a set R of rules respects L if R(L) is contained in L. By the radius of a splicing rule (u, v , u , v) we mean the maximum of the lengths of the strings u, v , u , v. Definitions. A splicing scheme is a pair σ = (A, R), where A is a finite alphabet and R is a finite set of splicing rules. For each language L and each non-negative integer n, we define σ n (L) inductively: σ 0 (L) = L and, for each non-negative integer k, σ k+1 (L) = σ k (L) ∪ R(σ k (L)). We then define σ ∗ (L) = ∪{σ n (L) : n ≥ 0}. A splicing system is a pair (σ, I), where σ is a splicing scheme and I is a finite initial language contained in A∗ . The language generated by (σ, I) is L(σ, I) = σ ∗ (I). A language L is a splicing language if L = L(σ, I) for some splicing system (σ, I). A rule set R is reflexive if, for each rule (u, v , u , v) in R, the rules (u, v , u, v ) and (u , v, u , v) are also in R. A rule set R is symmetric if, for each rule (u, v , u , v) in R, the rule (u , v, u, v ) is in R. When R is reflexive or symmetric we say the same of any scheme or system having R as its rule set. Reflexivity and symmetry are inherent features of splicing systems as defined originally in [12]. In fact, splicing systems that model the cut & paste action of restriction enzymes and a ligase are necessarily reflexive and symmetric as is easily confirmed by envisioning the activity of the enzymes and DNA molecules in solution. Consequently, from a modeling perspective, the most important splicing systems are those that are reflexive and symmetric. The motive for the introduction of the formal splicing concept was the establishment of a passageway between formal language theory and the biomolecular sciences. The most secure prediction was that formal representations of enzymatic actions would provide a novel stimulation for the development of language theory. This prediction has been confirmed in the later chapters of [20] and by continuing developments in progress by many theoretical computer scientists. The less secure prediction was that the development of formal theory would eventually yield results of value to biomolecular scientists. One might hope, for example, that the demonstration of the regularity of splicing languages will eventually be represented in software that accepts a list of enzymes and the sequence data for a list of DNA molecules and decides whether a specified DNA molecule
264
Tom Head, Dennis Pixton, and Elizabeth Goode
could arise through the action of the specified enzymes on the DNA molecules in the given list. The long-range hope has been that splicing theory would be the initial step in the development of a much broader approach to the modeling of important enzymatic processes using language theory.
2
The Regularity of Splicing Languages
Can finite splicing systems generate only a very restricted class of languages? This was the first question asked. Splicing theory, as defined both here and originally, is concerned with sets of strings over an alphabet, not multi-sets. One of the earliest results on splicing systems [9] showed that if splicing theory were interpreted to deal with multi-sets then the action of each Turing machine could be simulated by the action of an appropriate splicing system. This result undermined confidence that splicing languages are always regular. Fortunately it was quickly announced [6] [7] that splicing languages are always regular. A later proof [22] gave an explicit construction followed by an induction on an insightfully specified inductive set. A slight reformulation of this proof appears in Chapter 6 of [20]. It had been noted early that not all regular languages are splicing languages [10]. That the regular languages (aa)∗ and a∗ ba∗ ba∗ are not splicing languages is easily confirmed. So, which regular languages are splicing languages? We would like to have a beautiful theorem that identifies the splicing languages with some crucial previously known class of regular languages, or at least some closely related class. As yet we have no such characterization. It is easily confirmed that every strictly locally testable (SLT) language is a splicing language [12]. (See [19] and [8] for the definition of SLT languages.) It is also known from examples [12] that even splicing languages that arise as explicit models of DNA behavior may fail to be SLT. The language b(aa)∗ is an abstract example of a splicing language that is neither SLT nor even aperiodic. (See [21] for the definition of aperiodic.) With no crisp characterization of the class of splicing languages as yet found, concern turned to the search for an algorithm for deciding whether a given regular language can be generated by a splicing system. There is, of course, an easily described procedure that is guaranteed to discover that a regular language L is a splicing language if L is a splicing language: For each positive integer n, for each set R of rules of radius ≤ n, and for each subset I of L consisting of strings of length ≤ n, decide whether L(σ, I) = L, where σ = (A, R). Since both L and each such L(σ, I) are regular, all these steps can be carried out. The procedure terminates when a system L(σ, I) is found, but fails to terminate when L is not a splicing language. From this triviality, however, it follows that an algorithm will become available immediately if, for each regular language L, a bound, N (L), can be calculated for which it can be asserted that L cannot be a splicing language unless there is a splicing system having rules of radius ≤ N (L) and initial strings of length ≤ N (L). We announce here such an algorithm, but we give only a skeleton of hints as to the justification of the algorithm. A complete treatment will soon be available by the latter two authors of the present article.
Splicing Systems: Regularity and Below
3
265
The Pixton-Goode Algorithm
Let L be a regular language and let M = (Q, I, F ) be the minimal deterministic automaton recognizing L, where Q, I & F are the sets of all states, the initial states, & the final states of M , respectively. We denote the state entered when M is in state q and the string x is read by qx. A procedure for deciding whether L is a reflexive splicing language will be outlined after providing justifying comments for three observations: (a) We can decide whether a given splicing rule r respects the regular language L. (b) We can adequately specify the set of all splicing rules that preserve L. (c) We can compute an upper bound for the radii of the required splicing rules. The reflexivity condition is required only to obtain (c). E. Goode proved in [11] that the regular language a∗ ba∗ ba∗ ∪ a∗ ba∗ ∪ a∗ is a splicing language and also that it cannot be generated by any reflexive splicing system. Thus the reflexivity condition used in justifying (c) is significant. However, since the underlying molecular cut & paste activities modeled by splicing inevitably yield reflexive (and also symmetric) systems, this restriction does not seem severe. (a) Observe that the rule r = (u, u , v , v) respects L = L(M ) = L(Q, I, F ) if and only if, for each ordered pair of states p, q of M , whenever L(Q, {puu}, F ) and L(Q, {qv v}, F ) are not empty, L(Q, {pu}, F ) contains {vx : x in L(Q, {qv v}, F )}. (b) The syntactic congruence relation, C, in A∗ is defined by setting uCv if and only if, for every pair of strings x & y in A∗ , either xuy and xvy are both in L or neither is in L. Since L is regular, the number of C-congruence classes is a positive integer denoted here as n(L). Suppose that a rule (u, u , v , v) respects L and that uCu”. It follows that (u”, u , v , v) also respects L: Whenever a pair xu”u y & wv vz is in L, then by the definition of C, so is the pair xuu y & wv vz. Then xuvz is in L and, by the definition of C, so is xu”vz, which confirms that (u”, u , v , v) respects L. This argument works in each of the four locations. Consequently every rule in the set of rules {(w, x, y, z) : wCu, xCu , yCv , zCv} respects L if and only if any single rule in the set respects L. (This observation, which establishes a provocative link between syntactic monoids and splicing, has been recorded independently in [11] and in [3] where it appears as Proposition 9.3.) From each of the n(L)4 quadruples of syntactic classes determined by L in A∗ , we choose one rule and test it as in (a) to determine whether it respects L. Each congruence class is itself a regular language, consequently, for each nonnegative integer k, we can list all the strings of length at most k in the class. This allows us, for each non-negative integer k, to list, in a conceptually coherent manner, all rules of radius ≤ k that preserve L. (c) It is sufficient to consider splicing rules of radius not greater than N = 2(n(L)2 + 1). Since assertion (c) requires much detailed work, its full justification must await the forthcoming article by the latter two authors. Here we state only the
266
Tom Head, Dennis Pixton, and Elizabeth Goode
two major intermediate results from which the justification is constructed. The first of these is a Lemma that plays a crucial role in intricate string calculations: Two-Sided Pumping Lemma. Let L be a regular language over an alphabet A. For each string w in A∗ , having length greater than n(L)2 , there is a factorization w = xyz with y non-null for which, for every non-negative integer k, and every quadruple of strings p, q, s, t in A∗ , pxq is in L sxt is in L
if and only if if and only if
pxy k q is in L; and sy k zt is in L.
The second required tool is a Proposition that was proved earlier [11] where it played a crucial role in answering previous splicing questions: Proposition. A regular language L is a reflexive splicing language if and only if there is a finite reflexive set, R, of splicing rules for which R(L) is contained in L and L\R(L) is finite. If L is a splicing language L(σ, I) with σ = (A, R) then the only the strings in L that can fail to lie in R(L) are those in I and consequently L\R(L) is finite. Thus necessity is trivial. When L\L(R) is finite, one might hope L = L(σ, L\L(R)) with σ = (A, R). Although this is not in general the case, by using the assumed reflexivity of R, the finite sets R and L\L(R) can be finitely enlarged to produce sets R and I , respectively, for which L = L(σ , I ) with σ = (A, R ) as demonstrated in [11]. With each regular language L and each positive integer k we associate the following reflexive set of splicing rules: Tk = {(u, u , v , v) : the radius of (u, u , v , v) is ≤ k and L is preserved by each of the three rules (u, u , v , v), (u, u , u, u ) and (v , v, v , v)}. Theorem. A regular language L is a reflexive splicing language if and only if L\Tk (L) is finite where k = 2(n(L)2 + 1). Recall that n(L) is the number of syntactic congruence classes of A∗ determined by L and that, from (a) and (b) above, Tk (L) is algorithmically constructible. Consequently the Theorem assures that, since the finiteness of L\Tk (L) can be decided, it can be decided whether L is a reflexive splicing system.
4
Room at the Bottom
An extensive literature exists relating various extensions of the splicing system concept with universal computational schemes as exposited in [20]. Such extensions were motivated by the desire to find additional new models for biomolecular computation following L. Adleman’s wet-lab computation [1]. Many splicing theorists have regarded finite splicing systems as an impoverished level of the
Splicing Systems: Regularity and Below
267
theory. However, when the motive is to model enzymatic processes, then it is a joy when one finds that extremely simple systems are adequate models. The study of sub-classes of the regular languages is an intensely algebraic theory that is well developed [19] [21] [23] [2]. The syntactic congruence is a fundamental tool in this literature and it has greatly clarified the work we have reported here. Can more extensive interactions be found between this literature and the study of restricted types of finite splicing systems? We hope so and we recommend that the interested reader join in this search. In [18] the class of simple splicing systems was introduced and studied. This work motivated a detailed re-investigation in [13] of the null-context splicing systems, which were introduced originally in [12]. The null-context level has also been examined recently in relation to the naturally occurring DNA restructuring carried out by ciliates [16]. Progressively less simple splicing systems have been defined and studied in [14] and in [11], which also includes the solution of the open problem proposed in [14]. We believe there is more room for worthwhile research at the bottom of splicing theory. Late Breaking News. On arrival at DNA-8, the first author was delighted to be given by G. Mauri a copy of [3], which contains one of the most provocative observations made in the present article: the connection between syntactic monoids and splicing. More recently, C. De Felice has forwarded [5] to us. This mutual interest in finite splicing is especially encouraging. The reader who finds the present article of interest will surely wish to see these ’BFMZ’ references and the additional references they contain, such as [4]. Acknowledgments. The first author is exceedingly grateful for the invitation from Masami Hagiya to speak at the 8th Workshop on DNA Computers. All three authors profited at various intervals during the previous decade from the support of their research through the NSF grants CCR-9201345, CCR-9509831 and by a subcontract through Duke University of the DARPA/NSF CCR-9725021 research program headed by John Reif. This support is gratefully acknowledged.
References 1. L. Adleman, Molecular computation of solutions of combinatorial problems, Science 266(1994)1021-1024. 2. J. Almeida, Finite Semigroups and Universal Algebra, World Scientific, Singapore (1994). 3. P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza, On the power of linear and circular splicing systems, (submitted). 4. P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza, Decision problems for linear and circular splicing, DLT2002 volume of LNCS (to appear). 5. P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza, The structure of reflexive splicing languages via Schutzenberger constants, (submitted). 6. K. Culik II, T. Harju, The regularity of splicing systems and DNA, in: Proc. ICALP ’89, LCNS 372(1989)222-233. 7. K. Culik II, T. Harju, Splicing semigroups of dominoes and DNA, Discrete Appl. Math. 31(1991)261-277.
268
Tom Head, Dennis Pixton, and Elizabeth Goode
8. A. DeLuca, A. Restivo, A characterization of strictly locally testable languages and its application to subsemigroups of a free semigroup, Inform. and Control 44(1980)300-319. 9. K.L. Denninghoff, R. Gatterdam, On the undecidability of splicing systems, Inter. J. Computer Math. 27(1989)133-145. 10. R.W. Gatterdam, Splicing systems and regularity, Inter. J. Computer Math. 31(1989)63-67. 11. T.E. Goode Laun, Constants and Splicing Systems, Dissertation Binghamton University, Binghamton, New York (1999). 12. T. Head, Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors, Bull. Math. Biology 49(1987)737-759. 13. T. Head, Splicing representations of strictly locally testable languages, Discrete Appl. Math. 87(1990)139-147. 14. T. Head, Splicing languages with one-sided context, in: Gh. Paun, Ed., Biomolecular Computing - Theory and Experiment, Springer-Verlag (1998)269-282. 15. T. Head, Gh, Paun, D. Pixton, Language theory and molecular genetics: generative mechanisms suggested by DNA recombinations, Chapter 7, Vol. 2 of: G. Rozenberg & A. Salomaa, Eds., Handbook of Formal Languages, Springer, Berlin (1997)295360. 16. J. Kari, L. Kari, Context-free recombination, in: C. Martin-Vide & V. Mitrana, Eds., Where Mathematics, Computer Science, Linguistics and Biology Meet, Kluwer Academic Pub., Dordrecht (2001). 17. B. Lewin, Genes, (1st. ed.) Wiley, New York (1983). 18. A. Mateescu, Gh. Paun, G. Rozenberg, A. Salomaa, Simple splicing systems, Discrete Appl. Math. 84(1996)145-163. 19. R. McNaughton, S. Papert, Counter-free Automata, MIT Press, Cambridge, MA (1971). 20. Gh. Paun, G. Rozenberg, A. Salomaa, DNA Computing - New Computing Pardigms, Springer-Verlag, Berlin (1998). 21. J.E. Pin, Varieties of Formal Languages, Plenum Pub. Co. (1986). 22. D. Pixton, Regularity of splicing systems, Discrete Appl. Math. 69(1996)101-124. 23. H. Straubing, Finite Automata, Formal Logic, and Circuit Complexity, Birkhauser, Boston, MA (1994).
On the Computational Power of Insertion-Deletion Systems Akihiro Takahara1 and Takashi Yokomori2 1 2
IBM Japan, 1-14 Nisshin-cho, Kawasaki-ku, Kawasaki 210-8050, JAPAN [email protected] Department of Mathematics, School of Education, Waseda University, 1-6-1 Nishiwaseda, Shinjuku-ku, Tokyo 169-8050, JAPAN, and CREST, JST (Japan Science and Technology Corporation) [email protected]
Abstract. Gene insertion and deletion are basic phenomena found in DNA processing or RNA editing in molecular biology. The genetic mechanism and development based on these evolutionary transformations have been formulated as a formal system with two operations of insertion and deletion, called insertion-deletion systems ([1], [2]). We investigate the generative power of insertion-deletion systems (InsDel systems), and show that the family IN S11 DEL11 is equal to the family of recursively enumerable languages. This gives a positive answer to an open problem posed in [2] where it was conjectured to be negative.
1
Introduction
Formal language theory enjoys rich fruits in analyzing formal computing systems in its half-century history, and remarkable is a large amount of knowledge accumulation of the computational capability on a variety of language generative mechanisms called grammars or systems. On the other hands, unlike most of those language generative devices investigated nowadays that are based on the operation of rewriting, two language operations have come to paticular attention of researchers in DNA computing theory : insertion operation and deletion operation, originally motivated mainly from linguistics, because these two have been recognized to be of interest in relation to some phenomena in genetics. Intuitively, for a given string xuvx an insertion operation with context (u, w, v) produces a new string xuwvx . (Thus, a new string is obtained by inserting w between u and v in the given string.) Conversely, by deletion operation with (u, w, v) one obtains xuvx from xuwvx . Theoretically, both insertion and deletion operations may be performed by using biomolecular techniques. For example, consider a single stranded DNA sequence α of the form xuvyz, where x, y, u, v, z are all strings. Figure 1 illustrates an implementation process of an insertion operation (u, w, v) to the string α, producing a new string β, where, a single stranded DNA sequence u ¯ is the Watson-Crick complement of the string u, and z¯ plays a role of a primer. In a similar (but reverse) way, one can also implement another operation of deletion (see, e.g., page 189 in [3]). M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 269–280, 2003. c Springer-Verlag Berlin Heidelberg 2003
270 α
Akihiro Takahara and Takashi Yokomori x
u
y
v w
u
u u
x
z v
y
v v
annealing w
β
x
u
w
z
y
v
x
z
u u
w
v v
y
z z
melting
x x
cut (via restriction enzyme)
polymerase extention u u
w w
v v
y y
z z
Fig. 1. Molecular implementation of insertion operation (u, w, v) ([3]) This paper will focus on the theoretical issues on the computational capability of rewriting systems based on insertion and deletion operations, called insertion-deletion systems, and will present Turing computability characterizations of those systems which settle open problems posed in [2].
2
Preliminaries
An insertion-deletion system (InsDel system, for short) is a quadruple γ = (V, T, A, R), where V is an alphabet, T (⊆ V ) is a terminal alphabet, A is a finite set of strings over V (called axioms), and R is a finite set of insertion/deletion rules. An insertion rule is of the form : (u, λ/α, v), and a deletion rule is of the form : (u, β/λ, v), where u, v ∈ V ∗ and α, β ∈ V + . Intuitively, an insertion rule (u, λ/α, v) offers us a rewriting rule uv → uαv, while a deletion rule (u, β/λ, v) provides us with uβv → uv. For x, y ∈ V ∗ , a binary relation x =⇒ y on V ∗ is defined by either (1). x = x1 uvx2 and y = x1 uαvx2 for some insertion rule (u, λ/α, v) ∈ R, and for some x1 , x2 ∈ V ∗ , or (2). x = x1 uβvx2 and y = x1 uvx2 for some deletion rule (u, β/λ, v) ∈ R, and for some x1 , x2 ∈ V ∗ . As usual, =⇒∗ denotes the reflexive and transitive closure of =⇒. Then, a language L(γ) generated by γ is defined as follows : L(γ) = {w ∈ T ∗ | there exists x ∈ A such that x =⇒∗ w}. An InsDel system γ = (V, T, A, R) is said to be of weight (m, n; p, q) iff m = max{lg(α)|(u, λ/α, v) ∈ R}, n = max{lg(u)|(u, λ/α, v) ∈ R or (v, λ/α, u) ∈ R}, p = max{lg(β)|(u, β/λ, v) ∈ R}, q = max{lg(u)|(u, β/λ, v) ∈ R or (v, β/λ, u) ∈ R}, where lg(x) denotes the length of x. Further, we define the total weight of γ as the sum : m + n + p + q.
On the Computational Power of Insertion-Deletion Systems
271
n By IN Sm DELqp , we denote the family of all languages generated by InsDel systems of weight (m , n ; p , q ) satisfying m ≤ m, n ≤ n, p ≤ p and q ≤ q. n The following results about families IN Sm DELqp are of great interest. (RE denotes the family of recursively enumerable languages.)
Theorem 1. (1) IN S12 DEL11 = IN S21 DEL02 = IN S12 DEL02 = RE. ([2]) (2) IN S11 DEL02 = RE. ([3]) In what follows, without loss of generality we may assume that the rules in P of a type-0 grammar G = (N, T, P, S) are of one of the following forms ([3]) : type1 : X → α1 α2 , for α1 , α2 ∈ N ∪ T such that X = α1 , X = α2 , α1 = α2 , type2 : X → λ, type3 : XY → XZ, for X, Y, Z ∈ N such that X = Y, X = Z, Y = Z.
3
Main Result
We are now in a position to present the main result on the generative power of InsDel systems of weight (1,1 ; 1,1). Theorem 2. IN S11 DEL11 = RE. Proof. By Church’s thesis it holds that IN S11 DEL11 ⊆ RE, so that we have only to prove the inclusion RE ⊆ IN S11 DEL11 . Consider a language L ⊆ T ∗ , L ∈ RE, generated by a type-0 grammar G = (N, T, P, S) in the Penttonen normal form, where without loss of generality we may assume that P satisfies the property mentioned previously. Also, we assume the rules of P to be labeled uniquely. We construct the InsDel system γ = (V, T, A, R), where V = N ∪ T ∪ {[r], (r), r | r is the label of a rule in P } ∪ {B, E}, A = {BSE}, and the set R is constructed as follows. Group 1 : For each rule r : X → α1 α2 ∈ P of type 1, with α1 , α2 ∈ N ∪ T , we consider the following set of insertion/deletion rules : (r.1) (β, λ/[r], X), for β ∈ N ∪ T ∪ {B}, (r.2) (X, λ/(r), β), for β ∈ N ∪ T ∪ {E}, (r.3) ([r], X/λ, (r)), (r.4) ([r], λ/α1 , (r)), (r.5) (α1 , λ/α2 , (r)), (r.6) (β, [r]/λ, α1 ), for β ∈ N ∪ T ∪ {B}, (r.7) (α2 , (r)/λ, β), for β ∈ N ∪ T ∪ {E}.
272
Akihiro Takahara and Takashi Yokomori
Group 2 : For each rule X → λ ∈ P of type 2, we introduce the deletion rule : (β1 , X/λ, β2 ), for β1 ∈ N ∪ T ∪ {B} and β2 ∈ N ∪ T ∪ {E}. Group 3 : For each rule r : XY → XZ ∈ P of type 3, with X, Y, Z ∈ N , we consider the following insertion/deletion rules: (r.1) (β, λ/[r], X), for β ∈ N ∪ T ∪ {B}, (r.2) (X, λ/(r), Y ), (r.3) (Y, λ/r , β), for β ∈ N ∪ T ∪ {E}, (r.4) ([r], X/λ, (r)), (r.5) ((r), Y /λ, r ), (r.6) ([r], (r)/λ, r ), (r.7) ([r], λ/X, r ), (r.8) (X, λ/Z, r ), (r.9) (β, [r]/λ, X), for β ∈ N ∪ T ∪ {B}, (r.10) (Z, r /λ, β), for β ∈ N ∪ T ∪ {E}. (Note that rules (r.1) and (r.9) operate in a reverse way each other.) Group 4 : We also consider the deletion rules (λ, B/λ, λ), (λ, E/λ, λ). (Note that from the uniqueness of each label r for a rule in P , r can be regarded as another notation for a triple [X, α1 , α2 ] if r is X → α1 α2 or for a triple [X, Y, Z] if r is XY → XZ.) We then show the equality L(G) = L(γ). (L(G) ⊆ L(γ)) : Consider a string BwE in γ, where w is any sentential form in G. (Thus, initially we have w = S.) Suppose that S =⇒∗ w ∈ T ∗ . If BSE =⇒∗ BwE in γ, then using rules of Group 4, one can always derive w from BwE. So, it suffices to show that any one step derivation of G can be simulated by some steps of derivations in γ. Since one step derivation of G by a rule of type 2 : X → λ is directly simulated by a rule of Group 2 in γ, we have only to prove for the other two cases of one step derivation. In order to simulate one step derivation from w1 Xw2 in G by a rule of type 1 r : X → α1 α2 , we begin with applying to Bw1 Xw2 E a rule either (r.1) or (r.2) from Group 1, and can eventually derive Bw1 α1 α2 w2 E whose simulation process is illustrated in (a) of Figure 2. For the other case of one step derivation in G by a rule of type 3 : XY → XZ, we can simulate it by beginnig with Bw1 XY w2 E and applying a certain sequence of rules, and eventually derive Bw1 XZw2 E, which is illustrated in (b) of Figure 2.
On the Computational Power of Insertion-Deletion Systems
273
(L(G) ⊇ L(γ)) : For terminology, a string BwE in γ is valid if w is a sentential form of G. Further, a derivation in γ is valid if it only derives a valid string in γ. In particular, it is called a valid derivation of r if the derivation is carried out by only using rules (r.i) corresponding to r. It suffices to show that γ can generate only strings in L(G). First, no extra terminal string can be derived by applying rules from Group 4 to any string at an intermediate stage either (1) through (8) of (a) or of (1) through (17) of (b) in Figure 2. Second, the deletion rules (β1 , X/λ, β2 ) from Group 2 exactly correspond to erasing rules in P , and they require the left context β1 and the right context β2 , which implies that there are six chances to apply to : α2 of (7) and α1 of (8) in (a) of Figure 2 ; Y of (1), X of (3), X of (16) and Z of (17) in (b) of Figure 2. After one application of a rule from Group 2 the resulting string in each case can derive at the best a valid string Bw E, where w is obtained from w by an erasing rule in G. Thus, we will prove that only valid derivations are generated at the best in γ when using rules from Groups 1 and 3. (Note that all rules in P satisfy the property mentioned previously.) For terminology, a valid derivation of r in γ is basic if there is no duplication of the use of rules from Groups 1 and 3 in it. Further, a string BwE in γ is potentially valid if w is in (N ∪ T )∗ . Consider any valid string BwE in γ, where initially we have w = S. [Important Notes] (i) In order to derive a potentially valid string in γ from BwE, one has to start with one of the rules (r.1) or (r.2) and end by using rules (r.6) and (r.7) from Group 1, or start with one of (r .1), (r .2) or (r .3) and end by using (r .9) and (r .10) from Group 3, where r and r are distinct. (ii) No matter how complicated the structure of a derivation for a potentially valid string is, it must be well-nested in the sense mentioned above. (iii) Thus, we may concentrate on checking the validity of strings in one basic derivation of r derived from another basic derivation of r in γ, for any given r, r of type 1 or type 3. Suppose rules (r.1) and (r.2) from Group 1 are applied to Bw1 Xw2 E. Then, we have a derivational diagram illustrating all of the basic valid derivations in simulating X → α1 α2 , shown by (a) of Figure 2. Further applications of any rule to either w1 or w2 in each intermediate string of (1) through (8) can be regarded as an independent trial of another simulation process. Claim (A) Only valid strings are derived from (a) of Figure 2. Note that to get started a new simulation process with a rule r of G, there are five rules available in γ : (r .1) and (r .2) from Group 1; (r .1), (r .2) and (r .3) from Group 3. Case (a1): Suppose that a rule (r .1) : (β, λ/[r ], X) from Group 1 is chosen to apply. There are five cases to be considered: (a1-1) A nonterminal X of (2) is applied to. Then, we have (2) : Bw1 X(r)w2 E =⇒ Bw1 [r ]X(r)w2 E.
Since r = r , we immediately get stuck.
274
Akihiro Takahara and Takashi Yokomori
(a1-2) A nonterminal α2 of (6) is applied to. Then, we have (6) : Bw1 [r]α1 α2 (r)w2 E =⇒ Bw1 [r]α1 [r ]α2 (r)w2 E =⇒∗ Bw1 α1 [r ]α2 w2 E.
The last string is nothing but the first step to start another simulation process for r : α2 → α1 α2 . (a1-3) A nonterminal α2 of (7) is applied to. Then, we have (7) : Bw1 [r]α1 α2 w2 E =⇒ Bw1 [r]α1 [r ]α2 w2 E
=⇒ Bw1 α1 [r ]α2 w2 E =⇒∗ Bw1 [r]α1 α1 α2 w2 E.
The former of the last is the same as (a1-2), while the latter is the case when a new simulation process for r : α2 → α1 α2 has been pre-processed before r. (a1-4) A nonterminal α1 of (8) is applied to. Then, we have (8) : Bw1 α1 α2 (r)w2 E =⇒ Bw1 [r ]α1 α2 (r)w2 E
=⇒ Bw1 [r ]α1 α2 w2 E =⇒∗ Bw1 α1 α2 α2 (r)w2 E.
The former of the last is nothing but the first step to start another simulation process for r : α1 → α1 α2 , while the latter is the case when a new simulation process for r has been pre-processed before r. (a1-5) A nonterminal α2 of (8) is applied to. Then, we have (8) : Bw1 α1 α2 (r)w2 E =⇒ Bw1 α1 [r ]α2 (r)w2 E =⇒ Bw1 α1 [r ]α2 w2 E.
This is the same phase as (a1-2). Case (a2): Suppose that a rule (r .2) : (X, λ/(r ), β) from Group 1 is chosen to apply. There are five cases to be considered : (a2-1) A nonterminal X of (1) is applied to. Then, we have (1) : Bw1 [r]Xw2 E =⇒ Bw1 [r]X(r )w2 E.
Since r = r , we immediately get stuck. (a2-2) A nonterminal α1 of (6) is applied to. Then, we have ∗ (6) : Bw1 [r]α1 α2 (r)w2 E =⇒ Bw1 [r]α1 (r )α2 (r)w2 E
=⇒ Bw1 α1 (r )α2 w2 E =⇒∗ Bw1 α1 α2 α2 (r)w2 E.
The former of the last is nothing but the first step to start another simulation process for r : α1 → α1 α2 , while the latter is the same phase as the latter of (a1-4). (a2-3) A nonterminal α1 of (7) is applied to. Then, we have (7) : Bw1 [r]α1 α2 w2 E =⇒ Bw1 [r]α1 (r )α2 w2 E =⇒ Bw1 α1 (r )α2 w2 E.
This is the same phase as the former of (a2-2). (a2-4) A nonterminal α2 of (7) is applied to. Then, we have (7) : Bw1 [r]α1 α2 w2 E =⇒ Bw1 [r]α1 α2 (r )w2 E
=⇒ Bw1 α1 α2 (r )w2 E =⇒∗ Bw1 [r]α1 α1 α2 w2 E.
The former of the last is nothing but the first step to start another simulation process for r : α2 → α1 α2 , while the latter is the case when a new simulation process for r has been pre-processed before r. (a2-5) A nonterminal α1 of (8) is applied to. Then, we have (8) : Bw1 α1 α2 (r)w2 E =⇒ Bw1 α1 (r )α2 (r)w2 E
=⇒ Bw1 α1 (r )α2 w2 E =⇒∗ Bw1 α1 α2 α2 (r)w2 E.
On the Computational Power of Insertion-Deletion Systems
275
The former of the last is the same phase as the former of (a2-2), while the latter is the case when a new simulation process for r : α1 → α1 α2 has been pre-processed before r. Case (a3): Suppose that a rule (r .1) from Group 3 is chosen to apply. We see that this case is checked exactly in the same way as Case (a1), due to the fact that two (r .1)s from Groups 1 and 3, having the same structure, can apply to the same place. Case (a4): Suppose that a rule (r .2) : (X, λ/(r ), Y ) from Group 3 is chosen to apply. There are seven cases to be considered : (a4-1) A substring α1 α2 of (6) is applied to. Then, we have (6) : Bw1 [r]α1 α2 (r)w2 E =⇒ Bw1 [r]α1 (r )α2 (r)w2 E =⇒∗ Bw1 α1 (r )α2 w2 E.
The last string is nothing but the first step to start another simulation process for r : α1 α2 → α1 α2 . For two cases : (a4-2) A substring α1 α2 of (7) is applied to, and (a4-3) A substring α1 α2 of (8) is applied to, we eventually have the same phase as (a4-1). (a4-4) A substring Xβ of (1) is applied to, where β is the leftmost nonterminal of w2 . Then, we have (1) : Bw1 [r]Xw2 E =⇒ Bw1 [r]X(r )w2 E.
Since r = r , we immediately get stuck. (a4-5) A substring βX of (2) is applied to, where β is the rightmost nonterminal of w1 . Then, we have (2) : Bw1 X(r)w2 E =⇒ Bw1 (r )X(r)w2 E.
Since r = r , we immediately get stuck. (a4-6) A substring α2 β of (7) is applied to, where β is the leftmost nonterminal of w2 . Then, we have (7) : Bw1 [r]α1 α2 w2 E =⇒ Bw1 [r]α1 α2 (r )w2 E =⇒ Bw1 α1 α2 (r )w2 E.
The last string is nothing but the first step to start another simulation process for r : α2 β → α2 β . (a4-7) A substring βα1 of (8) is applied to, where β is the rightmost nonterminal of w1 . Then, we have (8) : Bw1 α1 α2 (r)w2 E =⇒ Bw1 (r )α1 α2 (r)w2 E =⇒ Bw1 (r )α1 α2 w2 E.
The last string is nothing but the first step to start another simulation process for r : βα1 → βα1 . Case (a5): Suppose that a rule (r .3) from Group 3 is chosen to apply. We also see that this case is checked exactly in the same way as Case (a2), due to the fact that these two rules having the same structure can apply to the same place. Thus, in either case above, one can see that only valid strings are derived at the best. Now, let us move to the second round. Suppose, in turn, rules (r.1), (r.2) and (r.3) from Group 3 are applied to Bw1 XY w2 E. Then, we have a derivational diagram illustrating all of the basic valid derivations in simulating XY → XZ, shown by (b) Figure 2. Again, further applications of any rule to either w1 or
276
Akihiro Takahara and Takashi Yokomori
w2 alone in each intermediate string of (1) through (15) can be regarded as an independent trial of another simulation process. Claim (B) Only valid strings are derived from (b) of Figure 2. Again, to get started a new simulation process with a rule r of G, there are five rules available in γ : (r .1) and (r .2) from Group 1; (r .1), (r .2) and (r .3) from Group 3. Case (b1): Suppose that a rule (r .1) : (β, λ/[r ], X) from Group 1 is chosen to apply. (b1-1) There are five subcases where the rule applies to X : (b1-1.1) A nonterminal X of (2) is applied to. Then, we have (2) : Bw1 X(r)Y w2 E =⇒ Bw1 [r ]X(r)Y w2 E =⇒∗ Bw1 [r ]X(r)rw2 E.
This eventually gets stuck, because of r = r . Two other cases (b1-1.2) for X of (6) and (b1-1.3) for X of (9) can be reduced to (b1-1.1). (b1-1.4) A nonterminal X of (3) is applied to. Then, we have (3) : Bw1 XY rw2 E =⇒ Bw1 [r ]XY rw2 E =⇒∗ Bw1 [r ]X(r)rw2 E.
This is the same phase as (b1-1.1)(and, therefore, eventually gets stuck). (b1-1.5) A nonterminal X of (16) is applied to. Then, we have (16) : Bw1 XZrw2 E =⇒ Bw1 [r ]XZrw2 E =⇒ Bw1 [r ]XZw2 E.
The last string is nothing but the first step to start another simulation process for r : X → α1 α2 . (b1-2) There are three subcases where the rule applies to Y : (b1-2.1) A nonterminal Y of (1) is applied to. Then, we eventually have (1) : Bw1 [r]XY w2 E =⇒ Bw1 [r]X[r ]Y w2 E =⇒∗ Bw1 [r]Xα1 α2 w2 E.
The last string is nothing but the case when a new simulation process for r : Y → α1 α2 has been pre-processed before r, and it can only survive as Bw1 Xα1 α2 w2 E by (r.9). (Note that Y = α1 .) (b1-2.2) A nonterminal Y of (3) is applied to. Then, we eventually have (3) : Bw1 XY rw2 E =⇒ Bw1 X[r ]Y rw2 E.
This immediately gets stuck, because of r = r . (b1-2.3) A nonterminal Y of (5) is applied to. Then, we have (5) : Bw1 [r]XY rw2 E =⇒ Bw1 [r]X[r ]Y rw2 E ⇐⇒ Bw1 X[r ]Y rw2 E.
This immediately gets stuck. (Note that r : XY → XZ.) (b1-3) There are three subcases where the rule applies to Z : (b1-3.1) A nonterminal Z of (15) is applied to. Then, we have (15) : Bw1 [r]XZrw2 E =⇒ Bw1 [r]X[r ]Zrw2 E =⇒∗ Bw1 X[r ]Zw2 E.
The last string is nothing but the first step to start another simulation process for r : Z → α1 α2 . Two other subcases (b1-3.2) for a nonterminal Z of (16) and (b1-3.3) for a nonterminal Z of (17) can be eventually reduced to (b1-3.1). Case (b2): Suppose that a rule (r .2) : (X, λ/(r ), β) from Group 1 is chosen to apply.
On the Computational Power of Insertion-Deletion Systems
277
(b2-1) There are six subcases where the rule applies to X : (b2-1.1) A nonterminal X of (1) is applied to. Then, we have (1) : Bw1 [r]XY w2 E =⇒ Bw1 [r]X(r )Y w2 E =⇒ Bw1 X(r )Y w2 E.
The last string is nothing but the first step to start another simulation process for r : X → α1 α2 . (b2-1.2) A nonterminal X of (3) is applied to. Then, we have (3) : Bw1 XY rw2 E =⇒ Bw1 X(r )Y rw2 E =⇒∗ Bw1 α1 α2 Y rw2 E.
The last string is nothing but the case when a new simulation process for r : X → α1 α2 has been pre-processed before r. (b2-1.3) A nonterminal X of (5) is applied to. This case can be reduced to (b2-1.2). (b2-1.4) A nonterminal X of (15) is applied to. Then, we have (15) : Bw1 [r]XZrw2 E =⇒ Bw1 [r]X(r )Zrw2 E =⇒∗ Bw1 X(r )Zw2 E.
The last string is nothing but the first step to start another simulation process for r : X → α1 α2 . Two other cases (b2-1.5) for X of (16) and (b2-1.6) for X of (17) can be reduced to (b2-1.4). (b2-2) There are four subcases where the rule applies to Y : (b2-2.1) A nonterminal Y of (1) is applied to. Then, we eventually have (1) : Bw1 [r]XY w2 E =⇒ Bw1 [r]XY (r )w2 E =⇒∗ Bw1 [r]Xα1 α2 w2 E.
This is the same phase as (b1-2.1). (b2-2.2) A nonterminal Y of (2) is applied to. Then, we eventually have (2) : Bw1 X(r)Y w2 E =⇒ Bw1 X(r)Y (r )w2 E.
This eventually gets stuck, because of r = r . Two other cases (b2-2.3) with Y of (4) and (b2-2.4) with Y of (7) can be reduced to (b2-3.2). (b2-3) The rule applies to a nonterminal Z of (17): (17) : Bw1 [r]XZw2 E =⇒ Bw1 [r]XZ(r )w2 E =⇒ Bw1 XZ(r )w2 E.
The last string is nothing but the first step to start another simulation process for r : Z → α1 α2 . Case (b3): Suppose that a rule (r .1) from Group 3 is chosen to apply. We see that this case is checked exactly in the same way as Case (b1), due to the fact that two (r .1)s from Groups 1 and 3, having the same structure, can apply to the same place. Case (b4): Suppose that a rule (r .2) : (X, λ/(r ), Y ) from Group 3 is chosen to apply. There are five cases to be applied : (b4-1) There are three subcases where the rule applies to XY : (b4-1.1) A substring XY of (1) is applied to. Then, at the best we have (1) : Bw1 [r]XY w2 E =⇒ Bw1 [r]X(r )Y w2 E =⇒ Bw1 X(r )Y w2 E.
The last string is nothing but the first step to start another simulation process for r : XY → XZ .
278
Akihiro Takahara and Takashi Yokomori
(b4-1.2) A substring XY of (3) is applied to. Then, we have (3) : Bw1 XY rw2 E =⇒ Bw1 X(r )Y rw2 E.
Since r = r , we eventually get stuck. The case (b4-1.3) where a substring XY of (5) is applied to is reduced to (b4-1.2). (b4-2) There are three subcases where the rule applies to XZ : (b4-2.1) A substring XZ of (15) is applied to. Then, we have (15) : Bw1 [r]XZrw2 E =⇒ Bw1 [r]X(r )Zrw2 E =⇒∗ Bw1 X(r )Zw2 E.
The last string is nothing but the first step to start another simulation process for r : XZ → XZ . Two other cases (b4-2.2) for XZ of (16) and (b4-2.3) for XZ of (17) can be reduced to (b4-2.1). (b4-3) There are five subcases where the rule applies to βX, where β is the rightmost nonterminal of w1 : (b4-3.1) A substring βX of (2) is applied to. Then, we have (2) : Bw1 X(r)Y w2 E =⇒ Bw1 (r )X(r)Y w2 E.
This eventually gets stuck, because of r = r . (b4-3.2) A substring βX of (3) is applied to. Then, we have (3) : Bw1 XY rw2 E =⇒ Bw1 (r )XY rw2 E.
This eventually gets stuck, because of r = r . Two other cases (b4-3.3) with βX of (6) and (b4-3.4) with βX of (9) can be reduced to (b4-3.2). (b4-3.5) A substring βX of (16) is applied to. Then, we have (16) : Bw1 XZrw2 E =⇒ Bw1 (r )XZrw2 E =⇒ Bw1 (r )XZw2 E.
The last string is nothing but the first step to start another simulation process for r : βX → βX . (b4-4) There are four subcases where the rule applies to Y β, where β is the leftmost nonterminal of w2 : (b4-4.1) A substring Y β of (1) is applied to. Then, we have (1) : Bw1 [r]XY w2 E =⇒ Bw1 [r]XY (r )w2 E =⇒ Bw1 XY (r )w2 E.
The last string is nothing but the first step to start another simulation process for r : Y β → Y β . (b4-4.2) A substring Y β of (2) is applied to. Then, we have (2) : Bw1 X(r)Y w2 E =⇒ Bw1 X(r)Y (r )w2 E.
This eventually gets stuck, because of r = r . Two other cases (b4-4.3) with Y β of (4) and (b4-4.4) with Y β of (7) can be reduced to (b4-4.2). (b4-5) A substring Zβ of (17) is applied to, where β is the leftmost nonterminal of w2 . Then, we have (17) : Bw1 [r]XZw2 E =⇒ Bw1 [r]XZ(r )w2 E =⇒ Bw1 XZ(r )w2 E.
On the Computational Power of Insertion-Deletion Systems
279
Bw1 XYw2E (r.1) & (r.9)
(r.3)
(r.2)
(1) Bw1 [r]XYw2 E (2) Bw1 X(r)Yw2E (3) Bw1 XY r w2E (r.2)
(r.1)&(r.9) (r.3)
(r.3)
(r.1)&(r.9)
(r.2)
(4) Bw1 [r]X(r)Yw2E (5)Bw1 [r]XY r w2 E (6) Bw1 X(r)Y r w 2E (r.4)
(r.1)
Bw1 Xw2 E
(7) Bw [r](r)Yw2 E 1
(r.2) (r.3)
(2) Bw1 X(r)w2E
(1) Bw1 [r]Xw2 E
(r.2)
(r.3)
(r.5)
(r.1)&(r.9)
(8) Bw1 [r]X(r)Y r w2 E (r.4)
(9) Bw1 X(r) r w2E
(r.5)
(r.1)&(r.9)
(11) Bw1 [r]X(r) r w2 E
(10) Bw1 [r](r)Y r w2 E
(r.1)
(r.2)
(r.5)
(3) Bw1 [r]X(r)w2 E
(r.4)
(12) Bw1 [r](r) r w2E
(r.3) (r.6)
(4) Bw1 [r](r) w2E
(13) Bw1 [r] r w2E
(r.4)
(r.7)
(5) Bw1 [r]α1 (r) w2E
(14) Bw1 [r]X r w2 E
(r.5)
(6) Bw1 [r]α1 α2 (r) w2E (r.7)
(r.6)
(7) Bw1 [r]α1 α2w2E
(8) Bw1 α 1α 2 (r) w2E
(r.7)
(r.6)
Bw1 α1α 2w2E
(r.8)
(15) Bw1 [r]XZ r w2 E (r.10)
(r.9)&(r.1)
(16) Bw1 XZ r w2E
(17) Bw1 [r]XZw2 E (r.9)&(r.1)
(r.10)
(a)
Bw1 XZ w2E
(b)
Fig. 2. Basic valid derivations in simulating X → α1 α2 and XY → XZ The last string is nothing but the first step to start another simulation process for r : Zβ → Zβ . Case (b5): Suppose that a rule (r .3) from Group 3 is chosen to apply. We also see that this case is checked exactly in the same way as Case (b2), due to the fact that these two rules having the same structure can apply to the same place. Thus, in each case discussed above, one eventually see that only valid strings are derived at the best. Consequently, γ can generate only strings in L(G). As a corollary, it immediately follows from Theorem 2. Corollary 1. IN S21 DEL11 = IN S11 DEL21 = RE.
4
Conclusions
Within the framework of InsDel systems of weight (m, n; p, q), a natural question that has been posed in [2] is to find as small as possible values for (m, n; p, q)
280
Akihiro Takahara and Takashi Yokomori
n such that IN Sm DELqp = RE (the family of recursively enumerable languages). We have shown that the families of languages IN S11 DEL21 , IN S21 DEL11 and IN S11 DEL11 coincide with RE. These results can shed some new light on the relationship between the weights and the generative capabilities of the InsDel systems. In fact, the latter two characterization results give answers to the open questions posed in [2]. Tabel 1 summarizes the current status of the relationship among a variety of InsDel systems with the total weight 4 and 5. In relation to the existing results of InsDel systems with the total weight 4, (at least to authors’ knowledge) a result IN S11 DEL02 = RE was only known. Admitted that InsDel systems of weight (1, 1; 2, 0) are smaller in the total weight than those of weight (1, 1; 1, 2) or (2, 1; 1, 1) considered here, all these are incomparable each other in their nature of computational mechanism. Our main result (Theorem 2), clearly the strongest among three in this paper, is rather surprising, in contrast to that the family IN S11 DEL11 was conjectured to be smaller than RE ([3]). (This leaves an open question concerning the family IN S12 DEL01 .) Finally, we remark that a construction method mentioned in Section 6 of [2] can apply here to show the existence of universal (programmable) InsDel systems of weight (1, 1; 1, 1).
Table 1. Weights and Language families of InsDel systems total weight (m,n ; p,q) family generated (1,2; 1,1) RE (1,2; 2,0) RE 5 (2,1; 2,0) RE (1,1; 1,2) RE (2,1; 1,1) RE (1,1; 2,0) RE 4 (1,1; 1,1) RE (1,2; 1,0) ?
references [2] [2] [2] Corollary 1 Corollary 1 [3] Theorem 2
Acknowledgements This work is supported in part by the Okawa Foundation for Information and Telecommunications no.01-22, and Waseda University Grant for Special Research Projects no.2001B-009.
References 1. L.Kari and G.Thierrin : Contextual insertion/deletion and computability, Information and Computation, 131, 1 (1996), 47-61. 2. L.Kari,Gh.Paun, G.Thierrin, S.Yu : At the crossroads of DNA computing and formal languages : Characterizing RE using insertion-deletion systems, Proc. of 3rd DIMACS Workshop on DNA Based Computing, Philadelphia, (1997), 318-333. 3. Gh.Paun, G.Rozenberg and A.Salomaa : DNA Computing, Springer-Verlag, 1998.
Unexpected Universality Results for Three Classes of P Systems with Symport/Antiport Mihai Ionescu1 , Carlos Mart´ın-Vide2 , Andrei P˘ aun3 , and Gheorghe P˘ aun2,4 1
2
Faculty of Mathematics, University of Bucharest Str. Academiei 14, 70109 Bucure¸sti, Romania Research Group on Mathematical Linguistics, Rovira i Virgili University Pl. Imperial T´ arraco 1, 43005 Tarragona, Spain [email protected] 3 Department of Computer Science, University of Western Ontario London, Ontario, Canada N6A 5B7 4 Institute of Mathematics of the Romanian Academy PO Box 1-764, 70700 Bucure¸sti, Romania
Abstract. Symport and antiport are biological ways of transporting molecules through membranes in “collaborating” pairs; in the case of symport the two molecules pass in the same direction, in the case of antiport the two molecules pass in opposite directions. Here we first survey the results about the computing power of membrane systems (P systems) using only symport/antiport rules (hence these systems compute only by using communication), then we introduce a novel way of defining the result of a computation in a membrane system: looking for the trace of certain objects in their movement through membranes. Rather unexpected, in this way we get characterizations of recursively enumerable languages by means of membrane systems with symport/antiport which work with multisets of objects (note the qualitative difference between the data structure used by computations – multisets: no ordering – and the data structure of the output – strings: linear ordering). A similar remark holds true for the case of analysing P systems: the sequence of certain distinguished objects taken from the environment during a computation is the string recognized by the computation. We also survey universality results from this area, with sketched proofs.
1
Membrane Systems with Symport/Antiport Rules
We assume that the reader has a minimal familiarity with membrane computing, so we do not recall the basic notions from this area used here. Details can be found at http://psystems.disco.unimib.it and in [15]. Also, we assume some familiarity with formal language theory; in this respect, the reader might want to consult [10] (available in the above mentioned web page). We start from the biological observation [1], [2] that there are many cases where two chemicals pass at the same time through a membrane, with the help of
Corresponding author.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 281–290, 2003. c Springer-Verlag Berlin Heidelberg 2003
282
Mihai Ionescu et al.
each other, either in the same direction, or in opposite directions; in the former case we say that we have a symport, in the latter case we have an antiport. Mathematically, we can capture the idea of symport by considering rules of the form (ab, in) or (ab, out) associated with a membrane, and stating that the objects a, b can enter, respectively, exit together the membrane. For antiport we consider rules of the form (a, out; b, in), stating that a exits and at the same time b enters the membrane. Generalizing such kinds of rules, we can consider rules of the unrestricted forms (x, in), (x, out) (generalized symport) and (x, out; y, in) (generalized antiport), where x, y are strings representing multisets of objects, without any restriction on the length of these strings. Based on rules of these types, in [12] there are considered membrane systems (currently called P systems [14]) with symport/antiport in the form of constructs Π = (V, µ, w1 , . . . , wm , E, R1 , . . . , Rm , io ), where: (i) V is an alphabet (its elements are called objects); (ii) µ is a membrane structure consisting of m membranes, with the membranes (and hence the regions) injectively labeled with 1, 2, . . . , m; m is called the degree of Π; (iii) wi , 1 ≤ i ≤ m, are strings over V which represent multisets of objects associated with the regions 1, 2, . . . , m of µ, present in the system at the beginning of a computation; (iv) E ⊆ V is the set of objects which are supposed to continuously appear in the environment in arbitrarily many copies; (v) R1 , . . . , Rm are finite sets of symport and antiport rules over the alphabet V associated with the membranes 1, 2, . . . , m of µ; (vi) io is the label of an elementary membrane of µ (the output membrane). For a symport rule (x, in) or (x, out), we say that |x| is the weight of the rule. The weight of an antiport rule (x, out; y, in) is max(|x|, |y|). The rules from a set Ri are used with respect to membrane i as explained above. In the case of (x, in), the multiset of objects x enters the region defined by the membrane, from the immediately upper region; this is the environment when the rule is associated with the skin membrane. In the case of (x, out), the objects specified by x are sent out of membrane i, into the region immediately outside; this is the environment in the case of the skin membrane. The use of a rule (x, out; y, in) means expelling from membrane i the objects specified by x at the same time with bringing in membrane i the objects specified by y. The objects from E are supposed to appear in arbitrarily many copies in the environment (because we only move objects from a membrane to another membrane, hence we do not create new objects in the system, we need a supply of objects in order to compute with arbitrarily large multisets.) The rules are used in the nondeterministic maximally parallel manner specific to P systems with symbolobjects (the objects to evolve and the rules by which these objects evolve are nondeterministically chosen, but the set of rules used at any step is maximal, no further object can evolve at the same time). In this way, we obtain transitions between the configurations of the system. A configuration is described by the m-tuple of the multisets of objects present in the m regions of the system, as well as the multiset of objects which were sent out of the system during the
Unexpected Universality Results for Three Classes of P Systems
283
computation, others than the objects appearing in the set E; it is important to keep track of such objects because they appear in a finite number of copies in the initial configuration and can enter again the system. We do not need to take care of the objects from E which leave the system because they appear in arbitrarily many copies in the environment (the environment is supposed inexhaustible, irrespective how many copies of an object from E are introduced into the system, still arbitrarily many remain in the environment). The initial configuration is (w1 , . . . , wm , λ). A sequence of transitions is called a computation, and with any halting computation we associate an output, in the form of the number of objects present in membrane io in the halting configuration. The set of these numbers computed by a system Π is denoted by N (Π). The family of all sets N (Π), computed by systems Π of degree at most m ≥ 1, using symport rules of weight at most p and antiport rules of weight at most q, is denoted by N Pm (symp , antiq ); when any of the parameters m, p, q is not bounded, we replace it with ∗. P systems with symport/antiport rules were considered in many places, [8], [9], [13], [16], [6], [11], etc. We recall some universality results, especially from the last two papers, without proofs. (By REG, RE we denote the families of regular and of recursively enumerable languages, respectively; N RE is the family of recursively enumerable sets of natural numbers.) First, we mention a universality result with a minimal number of membranes. Theorem 1. N P1 (sym1 , anti2 ) = N RE. The antiport rules used in Theorem 1 have the weight two, hence they are not very “bio-realistic”. One can eliminate this drawback (actually, one can eliminate the antiport rules at all, which is a rather unexpected result!), at the price of using more membranes: four. The number of membranes can be reduced at the expense of using symport rules of a larger weight (and no antiport rule). In fact, a trade-off relation seems to exist between the number of membranes and the weight of symport rules necessary for obtaining the computational universality. Theorem 2. N P4 (sym2 , anti0 ) = N P2 (sym3 , anti0 ) = N RE. At the first sight, antiport rules are a generalization of symport rules, because a rule (u, out) can be transformed into (u, out; d, in), where d is a dummy object, and the same for rules (u, in). Actually, this is not true, as it is clear that no system Π using only antiport rules can compute both the number 0 and any non-null number. However, all recursively enumerable sets of non-null numbers (we denote this family by N RE) can be computed by P systems using only antiport rules. Theorem 3. N P1 (sym0 , anti2 ) = N RE. The number of membranes used for characterizing N RE can be reduced to three also when the symport rules have permitting or forbidding context conditions (promoting or inhibiting objects). A symport rule with a permitting condition is given in the form (x, in)a , (x, out)a , where a is an object; this object
284
Mihai Ionescu et al.
should be present in membrane i when a rule (x, in)a , (x, out)a is applied (in the second case, a should not be counted as an element of x). A symport rule with a forbidding condition is given in the form (x, in)¬a , (x, out)¬a and such a rule can be used only if object a is not present in the membrane where the rule is applied. The use of permitting (forbidding) conditions is indicated by replacing sym by psym (fsym, respectively) in the notation N Pm (symp , antiq ). Theorem 4. N P3 (psym2 , anti0 ) = N P3 (f sym2 , anti0 ) = N RE. Note that all these theorems deal with N RE, that is, with the computation of sets of numbers. It is a qualitative difference to consider the computation of sets of strings (languages) – as we will do in the next sections.
2
Following the Traces of Objects
In the case of P systems with symport/antiport rules it is not natural to consider an external output, in the form of the sequences of objects sent into environment during a computation (the objects are arranged in the order they leave the system), because we already have objects in the environment. However, working with multisets of objects and trying to generate strings is a challenging task. In the case of purely communicative systems, as those with symport/antiport are, there is a nice possibility to answer such a question, namely, by considering strings associated with the itineraries of certain objects through membranes. We introduce this idea for the general case of using promoters/inhibitors. Specifically, we consider P systems of the form Π = (V, t, T, h, µ, w1 , . . . , wm , E, R1 , . . . , Rm ), where V is an alphabet, t ∈ V is a distinguished object (“the traveler”), T is an alphabet, h : {1, 2, . . . , m} −→ T ∪ {λ} is a weak coding, w1 , . . . , wm are strings over V ∗ representing the multisets of objects present in the m regions of µ, E is the set of objects present in the environment in arbitrarily many copies, and R1 , . . . , Rm are the sets of symport and antiport rules (with promoters/inhibitors) associated with the m membranes. The traveler is present in exactly one copy in the system, that is, |w1 . . . wm |t = 1 and t ∈ / E. Let σ = C1 C2 . . . Ck , k ≥ 1, be a halting computation with respect to Π, with (i) (i) C1 = (w1 , . . . , wm ) the initial (internal) configuration, and Ci = (z1 , . . . , zm ) the configuration at step i, 1 ≤ i ≤ k (we ignore now the environment, because (i) we do not count it when defining the trace of the traveler). If |zj |t = 1 for some 1 ≤ j ≤ m, then we write Ci (t) = j (therefore, Ci (t) is the label of the membrane (i) where t is placed in configuration Ci ). If |zj |t = 0 for all j = 1, 2, . . . , m (this means that the traveler is outside the system, in the environment), then we put Ci (t) = λ. Then, the trace of t in the computation σ is trace(t, σ) = C1 (t)C2 (t) . . . Ck (t).
Unexpected Universality Results for Three Classes of P Systems
285
Note that the trace starts with the label of the membrane where t is placed in the initial configuration. The computation σ is said to generate the string h(trace(t, σ)); the language generated by Π is L(Π) = {h(trace(t, σ)) | σ is a halting computation in Π}. We denote by LPm (psymp , pantiq ) the family of languages L(Π) generated by P systems with at most m membranes, with symport rules of weight at most p and antiport rules of weight at most q, using promoters; when the rules use the associated conditional symbols in the forbidding mode, then we write fsym, fanti instead of psym, panti; when the rules are used in the free mode (they have no promoter/inhibitor symbols associated), we remove the initial “p” and “f” from psym, panti and fsym, fanti. As usual, the subscripts m, p, q are replaced by ∗ when no bound on the corresponding parameter is considered. Obviously, the number of symbols appearing in the strings of a language induces an infinite hierarchy with respect to the number of membranes used by a P system able to generate that language: each symbol should correspond to a membrane, hence L ∈ LPm (psym∗ , panti∗ ) ∪ LPm (f sym∗ , f anti∗ ) implies card(alph(L)) ≤ m. Otherwise stated, the hierarchies on the number of membranes are infinite for all types of systems (and irrespective of the weights of rules). That is why, for any family F L of languages, it is natural to consider the subfamily nF L, of all languages in F L over the alphabets with n symbols. The following result has been proved in [7] (by making use of the frequent technique in membrane computing of simulating a matrix grammar in a normal form by a P system): Theorem 5. LP∗ (psym2 , panti∗ ) = LP∗ (f sym2 , f anti∗ ) = RE. The result has been essentially improved in [6] (starting from register machines): Theorem 6. nRE = nLPn+1 (sym0 , anti2 ) = nLPn+1 (sym3 , anti0 ) = nLPn+2 (sym2 , anti0 ). Thus, at the same time both the permitting/forbidding conditions have been removed and the weight of the antiport rules has been bounded by small values. Note that the last equality is obtained in terms of symport rules only, of a weight as encountered in biology: two symbols at a time pass through a membrane. At the first sight, this is a rather unexpected result, but the explanation lies in the natural connection with register machines and the fact that P systems with symport/antiport rules have an in-built context-sensitivity.
3
Analysing P Systems
Let us now consider a way to identify a language by means of a P system with symport/antiport rules which is dual to the generative one: we distinguish a subset T ⊆ V of terminal objects and consider the sequence of elements of T which enter the system during a halting computation. If several objects enter
286
Mihai Ionescu et al.
the system at the same time, then all their permutations are accepted. In this way, a system Π accepts a language, A(Π), of all strings recognized as above. We denote by APm (symp , antiq ) the family of languages A(Π) recognized by systems with at most m membranes, using symport rules of weight at most p, and antiport rules of weight at most q. Such systems were considered in [4], where the following result is proved: Theorem 7. RE = AP1 (sym0 , anti2 ). This result provides an easy proof of the equality nRE = nLPn+1 (sym0 , anti2 ) from Theorem 6: consider an alphabet V = {a1 , . . . , an }; in the unique membrane of a system Π constructed for proving the equality RE = AP1 (sym0 , anti2 ), we introduce n membranes, one for each symbol of V ; in each membrane i we introduce a symbol bi , and in the skin membrane we introduce the traveler object t; when a symbol ai is brought into the system (by rules of Π), we move it to membrane i, by a rule (bi , out, ai t, in) (we also associate the rule (t, out; bi , in) with membrane i); in this way, the symbols of the string recognized by the starting P system “indicate” the way the traveler has to follow through membranes, hence the trace of the traveler is exactly the string recognized by Π (note that t ends up in the skin region and each bi goes back to the corresponding membrane i). In the previous mode of analysing a string, if a string w is recognized by Π, then its symbols should exist in E; because each element of E appears in the environment in arbitrarily many copies, we cannot introduce a symbol of w into the system by using a symport rule, because an arbitrarily large number of copies would be introduced, hence the string will not be finite. Antiport rules are thus obligatory. In order to cope with this difficulty, and in order to recognize strings in a way closer to automata style, we consider a restricted mode of accepting strings, called initial: take a string x = x(1)x(2) . . . x(n), with x(i) ∈ T, 1 ≤ i ≤ n; in the steps 1, 2, . . . , n of a computation we place one copy of x(1), x(2), . . . , x(n), respectively, in the environment (together with the symbols of E); in each step 1, 2, . . . , n we request that the symbol x(1), x(2), . . . , x(n), respectively, is introduced into the system (by a symport or an antiport rule, alone or together with other symbols); after exhausting the string x, the computation may continue, maybe introducing further symbols into the system, but without allowing the symbols of x to leave the system; if the computation eventually halts, then the string x is recognized. The language of strings accepted by Π in the initial mode is denoted by AI (Π), and the family of languages AI (Π) corresponding to APm (symp , antiq ) is denoted by AI Pm (symp , antiq ). We do not know whether or not analysing P systems working in the initial mode are computationally universal, but this is true for systems using conditional rules. The proof is based on simulating register machines by means of P systems with symport/antiport rules (without conditions on the rules application). Similar techniques were recently used in many places ([6], [17], [4]), hence we do not recall them here, but we just call the R2P procedure the way of passing from a register machine to a P system as used in [4] (for each label q of an
Unexpected Universality Results for Three Classes of P Systems
287
instruction of the register machine, this procedure uses the objects q, q , q , q , which will appear in the construction below; also, a new symbol f is sometimes used). No symport rule, but only antiport rules of weight 2 are provided by the R2P procedure. Theorem 8. If L ∈ RE, L ⊆ V ∗ , for some alphabet V = {b1 , . . . , bk }, then L ∈ AI P1 (sym0 , pantik+2 ) ∩ AI P1 (sym0 , f antik+2 ). Proof. Consider a language L ⊆ V ∗ , V = {b1 , . . . , bk }, L ∈ RE. It is known (see, e.g., [5]) that there is a register machine M = (n, R, qs , qh ) (n is the number of registers, qs is the start label, qh is the halt label, and R is the set of instructions, of the form q1 : (op(r), q2 , q3 ); op(r) can be A(r) meaning “add 1 to register r”, and S(r), meaning “subtract 1 from register r”; in the successful case one goes to instruction with label q2 , in the failure case one continues with the instruction with label q3 ; the set of all labels of instructions from R is denoted by lab(M )) which accepts exactly the set val(L) = {val(x) | x ∈ L}, where val(x) is the value in base k + 1 of the string x ∈ V ∗ , with the most significant digit in the left hand of the string. Starting from M , we construct the analysing P system with promoters Π1 = (V ∪ E, V, [ 1 ] 1 , cd, E, R1 ), with E = {q, q , q , q | q ∈ lab(M )} ∪ {ar | 1 ≤ r ≤ n} ∪ {c, d}, and the set R1 containing the following rules (d, out; dbj aj1 , in)c , for all 1 ≤ j ≤ k, (c, out; c, in), , in)c , (a1 , out; ak+1 1 (c, out; qs bj aj1 , in), for all 1 ≤ j ≤ k, as well as all antiport rules (without promoters) for simulating the rules of M , provided by the R2P procedure. The system Π1 works as follows. The symbols of the string w to be recognized are introduced one by one into the system, and at the same time one computes the value in base k + 1 of the prefix of w already introduced, by means of the , in)c , which is used in parallel for all copies of a1 . When the rule (a1 , out; ak+1 1 rule (c, out; qs bj aj1 , in) introduces the initial label of M in the system we start the simulation of the work of M . The computation will eventually halt if and only if the string w is recognized by M , hence Π1 accepts exactly the strings of the language L. For the case of forbidding conditions we construct the system Π2 = (V ∪ E, V, [ 1 ] 1 , de, E, R1 ), with E = {q, q , q , q | q ∈ lab(M )} ∪ {ar | 1 ≤ r ≤ n} ∪ {gj | 1 ≤ j ≤ k} ∪ {c, d, e, f }, and the set R1 containing the following rules
288
Mihai Ionescu et al.
(e, out; ebj aj1 , in)¬c , for all 1 ≤ j ≤ k, (d, out; d, in), (a1 , out; ak+1 , in)¬c , 1 (d, out; cgj bj , in), and (gj , out; qs aj1 , in), for all 1 ≤ j ≤ k, as well as all antiport rules (without inhibitors) for simulating the rules of M . The task of checking the equality AI (Π2 ) = L is left to the reader. For one-letter languages we do not need promoters or inhibitors: Theorem 9. 1RE = 1AI P1 (sym0 , anti2 ). Proof. For a language L ⊆ {a}∗ with length(L) recognized by a register machine M = (n, R, qs , qh ) we construct the P system Π = (V, {a}, [ 1 ] 1 , d, V − {a}, R1 ), with V = {ar | 1 ≤ r ≤ n} ∪ {q, q , q , q | q ∈ lab(M )} ∪ {a, d, d , f }, and R1 contains the rules (d, out; d a, in), (d , out; da1 , in), (d , out; qs a1 , in), as well as the antiport rules for simulating the rules from R given by the R2P procedure. The equality AI (Π) = L is obvious. Note that the string at (implicitly, the number t) is recognized by the previous system, not generated, hence the above result is not directly comparable with those from Section 1. However, a link between these two classes of results can be established: Lemma 1. Let Q ∈ N Pm (symp , antiq ) be a set of numbers generated by a P system of the type associated with the family N Pm (symp , antiq ), which counts in the halting configuration the symbols from a set T present in an elementary membrane io and such that the only rules associated with membrane io and involving elements of T are of the form (a, in), a ∈ T . Then {bj | j ∈ Q} ∈ 1AI Pm+2 (symp , antiq ), where p = max(p, 2). Proof. Let Π be a (generative) symport/antiport system with the above property, Π = (V, µ, w1 , . . . , wm , E, R1 , . . . , Rm , io ). We construct the analysing system of degree m + 2 , Rm+1 , Rm+2 ), Π = (V ∪ {b, c}, T, µ, w1 c, w2 , . . . , wm , λ, λ, E, R1 , . . . , Rm
where b, c are symbols not in V , with µ obtained from µ by introducing two new membranes, with labels m + 1 and m + 2, inside membrane io , and with the following sets of rules: / {io , m + 1, m + 2}, Ri = Ri ∪ {(c, in), (c, out)} ∪ {(b, in)}, for all i ∈ Rio = Rio ∪ {(b, in), (c, in)}, Rm+1 = {(cα, in), (cα, out) | α ∈ T ∪ {b}}, Rm+2 = {(ba, in) | a ∈ T }.
Unexpected Universality Results for Three Classes of P Systems
289
The system Π works exactly as Π for generating numbers, in the form of multisets of objects from T sent into membrane io , but at the same time, in the first steps of a computation, it brings from the environment copies of the object b. These objects also go to membrane i0 . From here, pairs ba with a ∈ T can enter membrane m + 2 and remain forever there. In this way, we can compare the number of objects from T generated by the system Π and the number of symbols b brought into the system. If the two numbers are not equal, then the computation will never end: the extra copies of b or the extra copies of objects from T can wait in membrane io as long as c is not there (this can happen an arbitrary number of steps, because of the rules (c, in), (c, out) present in all sets Ri with i different from i0 ), and will pass back and forth through membrane m + 1 after bringing c in membrane io . Consequently, Π recognizes exactly strings bj with j generated by Π. Theorem 10. 1RE = 1AI Pm (symp , antiq ) for (m, p, q) ∈ {(3, 2, 2), (5, 2, 0), (4, 3, 0)}. Proof. This is a direct consequence of the previous lemma and of Theorems 1 and 2. Note that the systems used in the proofs of these results in [6] and [11] have the properties from the statement of Lemma 1. The result from Theorem 9 is optimal as the number of membranes and length of symport and antiport rules (antiport rules of the form (x, out; y, in) with |x| = |y| = 1 do not change the number of objects present in the system). This is not necessarily the case with Theorem 10. We leave as an open problem the question of finding improved results. We close this section by mentioning that analysing P systems with symport rules were also considered in [3], under the name of P automata, with two important differences from the case discussed here: one uses only rules of the form (x, in)y , where y can be a multiset (hence the communication is done in a oneway manner, top-down), and the transitions are defined in a sequential way, at most one rule is used in each step in each region (there also are some other not so important differences which we do not mention here). Such systems with seven membranes were proved in [3] to be universal (but it is highly expected that the number of membranes sufficient for obtaining the universality can be decreased).
4
Final Remarks
Computing by communication (in the framework of membrane systems) proves to be surprisingly powerful – universality is reached by systems with a very small number of membranes, and this happens for the “standard” P systems with symport/antiport, when following the traces of certain objects through membranes, and when considering the analysing mode of using a system, as well. The biologically well motivated symport/antiport P systems deserve a special attention, both from a mathematical and a computational point of view.
290
Mihai Ionescu et al.
References 1. B. Alberts et al.: Essential Cell Biology. An Introduction to the Molecular Biology of the Cell. Garland Publ. Inc., New York, London, 1998. 2. I.I. Ardelean: The Relevance of Biomembranes for P Systems. Fundamenta Informaticae, 49, 1–3 (2002), 35–43. 3. E. Csuhaj-Varju, G. Vaszil. P Automata. Pre-Proceedings of Workshop on Membrane Computing, Curtea de Arge¸s, Romania, 2002, MolCoNet Publication No. 1, 2002, 177–192. 4. R. Freund, M.Oswald: A Short Note on Analysing P Systems. Bulletin of the EATCS, 78 (October 2002). 5. R. Freund, Gh. P˘ aun: On the Number of Non-terminal Symbols in Graph-controlled, Programmed and Matrix Grammars. Proc. Conf. Universal Machines and Computations, Chi¸sin˘ au, 2001 (M. Margenstern and Y. Rogozhin, eds.), LNCS 2055, Springer-Verlag, 2001, 214–225. 6. P. Frisco, H.J. Hoogeboom: Simulating Counter Automata by P Systems with Symport/Antiport. Pre-Proceedings of Workshop on Membrane Computing, Curtea de Arge¸s, Romania, 2002, MolCoNet Publication No. 1, 2002, 237–248. 7. M. Ionescu, C. Mart´ın-Vide, Gh. P˘ aun: P Systems with Symport/Antiport Rules: The Traces of Objects. Grammars, 5 (2002). 8. C. Martin-Vide, A. P˘ aun, Gh. P˘ aun: On the Power of P Systems with Symport Rules. Journal of Universal Computer Science, 8, 2 (2002), 317–331. 9. C. Martin-Vide, A. P˘ aun, Gh. P˘ aun, G. Rozenberg: Membrane Systems with Coupled Transport: Universality and Normal Forms. Fundamenta Informaticae, 49, 1-3 (2002), 1–15. 10. C. Martin-Vide, Gh. P˘ aun: Elements of Formal Language Theory for Membrane Computing. Technical Report 21/01 of the Research Group on Mathematical Linguistics, Rovira i Virgili University, Tarragona, 2001. 11. A. P˘ aun: Membrane Systems with Symport/Antiport: Universality Results. PreProceedings of Workshop on Membrane Computing, Curtea de Arge¸s, Romania, 2002, MolCoNet Publication No. 1, 2002, 333–344. 12. A. P˘ aun, Gh. P˘ aun: The Power of Communication; P Systems with Symport/Antiport. New Generation Computers, 20, 3 (2002), 295–306. 13. A. P˘ aun, Gh. P˘ aun, G. Rozenberg: Computing by Communication in Networks of Membranes. International Journal of Fundamentals of Computer Science, to appear. 14. Gh. P˘ aun: Computing with Membranes. Journal of Computer and System Sciences, 61, 1 (2000), 108–143 15. Gh. P˘ aun: Computing with Membranes: An Introduction. Springer-Verlag, Berlin, 2002. 16. Gh. P˘ aun, M. Perez-Jimenez, F. Sancho-Caparrini: On the Reachability Problem for P Systems with Porters. Proc. Automata and Formal Languages Conf., Debrecen, Hungary, 2002. 17. P. Sosik. P Systems Versus Register Machines: Two Universality Proofs. PreProceedings of Workshop on Membrane Computing, Curtea de Arge¸s, Romania, 2002, MolCoNet Publication No. 1, 2002, 371–382.
Conformons-P Systems Pierluigi Frisco1 and Sungchul Ji2 1
L.I.A.C.S., Leiden University Niels Bohwerg 1, 2333 CA Leiden, The Netherlands [email protected] 2 Dep. of Pharm. and Toxic., Rutgers University Piscataway, N.J. 08855, U.S.A. [email protected]
Abstract. The combination of a theoretical model of the living cell and membrane computing suggests a new variant of a computational model based on membrane-enclosed compartments defined and presented in this paper for the first time. This variant is based on simple and basic concepts: conformons, a combination of information and energy; locality of the interactions of conformons, permitted by the presence of membrane-enclosed compartments; communication via conformons between membrane-enclosed compartments. The computational power of this new system is sketched. Possible other variants of this model and links with Petri nets are outlined.
1
Introduction
One of the aspects of natural computing is to interpret all processes present in a cell as computational processes. The extrapolation of some basic principles of the functioning of a cell and their definition from a mathematical point of view have led to the creation of theoretical computational models. This contribution expounds the theoretical facet of biomolecular computing considering also the investigation of the generative capability, complexity, universality, etc, of such models. In our research we have combined some basic principle of biocybernetics [9], a general molecular theory of living processes, with membrane computing, a novel distributed parallel way of computing. In [9] biocybernetics is formulated on the basis of principles, concepts and analogies imported from physics, chemistry and cybernetics. The most novel physical concept to emerge in that theory is that of gnergy, a hybrid physical entity composed of free energy and information that is postulated to be ultimately responsible for driving all molecular machines. Discrete physical entities carrying gnergy are called gnergons and there are two examples of gnergons identified in biology: conformons, sequence-specific conformational strains of biopolymers,
Work partially supported by contribution of EU commission under The Fifth Framework Program, project ”MolCoNet” IST-2001-32008
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 291–301, 2003. c Springer-Verlag Berlin Heidelberg 2003
292
Pierluigi Frisco and Sungchul Ji
and IDS (intracellular dissipative structures) intracellular chemical and mechanical stress gradients and waves. Conformons and IDSs are utilized to formulate what appears to be the first coherent theoretical model of the living cell known as the Bhopalator [8, 12]. Conformons (as better explained in Section 2) are visualized as a collection of a small number of catalytic residues of enzymes or segments of nucleic acids that are arranged in space and time with appropriate force vector so as to cause chemical transformations or physical changes on a substrate or a bound ligand. Membrane computing is based on membrane systems (also called P systems) a new class of distributed and parallel computing devices introduced in [18]. In that paper the author considers systems based on a hierarchically arranged, finite cellstructure consisting of several cell-membranes embedded in a main membrane called the skin. The membranes delimit regions where objects, elements of a finite set, and evolution rules can be placed. The objects evolve according to given evolution rules associated to a region, and they may also move between regions. A computation starts from an initial configuration of the system, defined by a cell-structure with objects and evolution rules in each cell, and terminates (halts) when no further rule can be applied. It is possible to assign a result to a computation in two ways: (1) a multiset, considering the multiplicity of objects present in a specific (output) membrane in a halting configuration, or (2) a set of strings, composed of the strings over a specific alphabet sent out of the system. Combining the outputs of each possible computation the behaviour of the system is obtained, a multiset-language (a set of vectors) or a string-language (a set of strings). In [18] the author examines three ways to view P systems: transition, rewriting and splicing P systems. Starting from these, several variants were considered (see for instance [2, 14, 15, 19, 21]). Each of these variants has been shown to generate recursively enumerable sets or vectors of natural numbers. The latest information about P systems can be found at the url http://psystems.disco. unimib.it/. Conformons-P systems, the variant of P systems presented in Section 3, consider as objects conformons - an ordered pair of name and value. This way to consider objects is substantially different from the others described in the literature. Until now an object has been considered either as simple entity without internal structure or a string, that is an entity with a well defined structure. Objects considered in our research may be placed in between these two categories as the only structure related to the name is the value related to it. The main result of this paper, presented in Section 4, states that conformonsP systems with priorities generate precisely the family of recursively enumerable sets of natural numbers. In Section 5 we outline possible variants of conformons-P systems with priorities. Moreover in that section we delineate how this model may be used to describe and study the massive parallelism so fundamental in biomolecular computing, and to which fields the computation based on conformons might be extended.
Conformons-P Systems
293
Due to lack of space we omit here the proofs of our results. The interested reader may refer to [4].
2
Conformons in Molecular Biology
The developments, during the second half of the last century, of the automata theory, neural nets, genetic programming, evolutionary computation, and most recently DNA-based molecular computing all attest to the fecundity of the interaction between computer science and biology. The computer science-biology interactions are a two-way street: not only can biology serve as a rich source of ideas and inspirations for computer scientists aspiring to discover and design novel computational frameworks but may also absolutely require computer science to formalize and test its theories in order to solve the mysteries of life on the molecular and cellular levels. The present contribution is primarily concerned with the former aspect of the computer science-biology interactions, and the latter aspect has been dealt with elsewhere by one of us [8, 9, 10, 12]. One of the basic concepts to develop in molecular biology during the past three decades is the notion of conformons, defined as sequence-specific mechanical strains embedded in biopolymers, such as DNA supercoils and protein conformational deformations, that provide both the free energy and information needed for biopolymers to drive molecular processes essential for life [7, 11]. The free energy content of conformons has been estimated to be in the range of 5 ∼10 Kcal/mole in proteins and 500 ∼ 2,000 Kcal/mole in DNA, while the information content per conformon has been calculated to be in the range of 20 ∼ 40 bits (note that 20 ∼ 200 bits as reported in [10] is an error) in proteins and 200 ∼ 600 bits in DNA [9, 11]. Conformons and conformon-like entities invoked in the biological literature during the past three decades have been classified into 10 families based on their biological functions, including the origination of life, thermal fluctuations, substrate and product bindings, formation of the transition-state complex, free energy transfer, DNA replication, timing in proteins, and timing in DNA [11]. Given such a multiplicity of conformon families, each with a large number (from 103 to 106 ?) of members, it is possible, at least in principle, to account for all living processes in the cell in molecular terms. This has led to the postulates (1) that the number of conformons active in and utilized by living cells are finite in number and (2) that conformons are quanta of biological actions, akin to quanta of action in quantum mechanics [11]. Another fundamental feature of the living cell, postulated to be the smallest molecular computing system in nature [10], is the biological membranes consisting of a phospholipid bilayer of about 50 angstroms (i.e., 50 × 10−8 cm) in thickness with many different kinds of proteins, either attached to its surface or deeply embedded in it. The basic function of biomembranes is to divide the Euclidean space into multiple compartments, to be referred to as membrane-enclosed compartments or simply as membranes when there is no danger of ambiguity. The principle of biological membranes began to be capitalized in developing new
294
Pierluigi Frisco and Sungchul Ji
computing paradigms during the past several years, giving rise to the P system [18, 19, 20, 21].
3
Basic Definitions
As sketched in the Introduction, the basic ideas underlying the concept of conformon and the interaction between two conformons have been of inspiration for us to define a new computability model. What in biocybernets is a pair of information and free energy in this section is defined from a mathematical point of view as an ordered pair name-value, the interaction between two conformons is modeled as passage of all or a part of the value from one pair to another. Let V be a finite alphabet and N the set of natural numbers. A conformon is an element of the relation name-value: V × N 0 (where N 0 = N ∪ {0}) denoted by [X, x]. We will refer to x as the value of X and to X as the name of the conformon [X, x]. The symbol X will also indicate the conformon itself; the context will help the reader to understand when we refer only to the name aspect of the conformon or to the whole conformon. Moreover let r = A, e, B, A, B ∈ V, e ∈ N , be a e rule (also indicated as A → B) defining the passage of (part of) the value from one conformon to another so that: [A, a] [B, b]
⇒r
[A, a − e] (1) [B, b + e]
with a, b ∈ N 0 , a ≥ e indicating that [A, a] and [B, b] interact according to r. Informally this means that e is subtracted from the value of the conformon (with name) A and e is added to the value of the conformon (with name) B only if the value of A is at leat e. A multiset (over V ) is a function M : V → N 0 ∪ {+∞}; for a ∈ V , M (a) defines the multiplicity of a in the multiset M. We will indicate this also with (a, M (a)). In case the multiplicity of an element of a multiset is 1 we will indicate just the element. The support of a multiset M is the set supp(M ) = {a ∈ V | M (a) > 0}. Informally we will say that a symbol belongs to a multiset M if it belongs to the support of M . Let M1 , M2 : V → N 0 ∪ {+∞} be two multisets. The union of M1 and M2 is the multiset M1 ∪ M2 : V → N 0 ∪ {+∞} defined by (M1 ∪ M2 )(a) = M1 (a) + M2 (a), for all a ∈ V . The difference M1 \M2 is here defined only when M2 is included in M1 (which means that M1 (a) ≥ M2 (a) for all a ∈ V ) and it is the multiset M1 \M2 : V → N 0 ∪ {+∞} given by (M1 \M2 )(a) = M1 (a) − M2 (a) for all a ∈ V . A conformons-P system with priorities of degree m, m ≥ 1, is a construct Π = (V, µ, l, a, L1, . . . , Lm , R1 , . . . , Rm ), where V is an alphabet; µ = (N, E) is a direct labeled graph underlying Π. The set N ⊂ N contains vertices, for simplicity we define N = {1, . . . , m}. Each vertex in N defines a membrane of the system Π. The set E ⊆ N × N × pred(N 0 ) defines directed labeled edges between vertices, indicated by (i, j, pred(n)) where for each n ∈ N 0 we consider
Conformons-P Systems
295
pred(n) = {≥ n, ≤ n, = n} set of predicates. For x ∈ N 0 , p ∈ pred(n), p(x) may be (≥ n)(x) or (≤ n)(x) or (= n)(x) (only one of them), indicating x ≥ n, x ≤ n and x = n respectively. The symbol l ∈ N defines the final membrane while a ∈ N the acknowledgment membrane that is initially empty. The multisets Li over V × N 0 ∪{+∞}, i ∈ N , contain conformons; Ri , i ∈ N , are finite sets of rules. Two conformons present in a membrane i may interact according to a rule r present in the same membrane such that the multiset of conformons Mi changes into Mi . So, for i ∈ N , [A, a], [B, b] ∈ Mi and r = A, e, B ∈ Ri , A, B ∈ V, a, b, e ∈ N 0 we have what indicated in (1) so that Mi = (Mi \{[A, a], [B, b]}) ∪ {[A, a − e], [B, b + e]}. A conformon [X, x] present in a membrane i may pass to a membrane j if it cannot interact with any other conformon present in the same membrane, if (i, j, p) ∈ E and p(x) holds, changing the multisets of conformons Mi and Mj to Mi and Mj respectively. In this case Mi = Mi \{[X, x]} and Mj = Mj ∪ {[X, x]}. The fact that the passage of an object to a membrane is regulated by some features present in the membranes is already discussed by others in literature when membrane with electrical charge and variable thickness have been used [20] or only communication was used to compute [22]. The application of a rule and the passage of a conformon from one membrane to another are the only operations that may be performed by a conformons-P system with priorities. A conformon present in a membrane may be involved in one of these two operations or none of them. It is important to note that the interaction between conformons has priority on the passage of a conformon to another membrane. This anyhow does not mean that if a conformons may interact with another one or pass to another membrane it has to. So the feature “all the objects which can evolve should evolve”, present in most of the other variants of P system introduced until now, is not applied here. The presence of such a universal clock, common in digital computers but not in biological processes, is very powerful from a computational point of view as it forces the system to a maximal parallelism. Not considering it does not limit the parallelism of conformons-P systems with priorities as it is possible that an operation is performed when it can be performed. The possibility to perform one of the two allowed operations in a same membrane or none of them let conformons-P systems with priorities to be nondeterministic. Non determinism may also arise from the configurations of a conformons-P system with priorities if in a membrane a conformon may interact with more than one conformon. A configuration of Π is an m-tuple (M1 , . . . , Mm ) of multisets over V × N 0 ∪ {+∞}. The m-tuple (L1 , . . . , Lm ) is denoted as initial configuration. For two configurations (M1 , . . . , Mm ), (M1 , . . . , Mm ) of Π we write (M1 , . . . , Mm ) ⇒ ) that (M1 , . . . , Mm ) indicating a transition from (M1 , . . . , Mm ) to (M1 , . . . , Mm is the parallel application of operations or of no operation in each membrane of µ. If no operation is applied to a multiset Mi then Mi = Mi . The reflexive and transitive closure of ⇒ is indicated by ⇒∗ .
296
Pierluigi Frisco and Sungchul Ji
A computation is a finite sequence of transitions between configurations of a system Π starting from (L1 , . . . , Lm ). Initially La = ∅. The result of a computation is given by the multisets of conformons present in membrane l and having elements of T as name when a (generic) conformon is present in membrane a. When this happens the computation halts, that is no other operation is performed. This feature is new in the area of membrane computing: it provides an alternative to the way of defining successful computations as halting computations. When this happens the multisets of all such conformons present in membrane l define the language generated by Π, indicated by L(Π). Formally: ) ⇒∗ (M1 , · · · , Mm ), L(Π) = {supp(Ml ) | (L1 , · · · , Lm ) ⇒∗ (M1 , · · · , Mm supp(Ma ) = ∅, supp(Ma ) = ∅}. In Section 4 we sketch the proof that conformons-P systems with priorities are computationally complete. In the proof of this result we need the notion of program machines. Non-rewriting Turing machines were introduced by M. L. Minsky in [16] and then reconsidered in [17] under the name of program machines. After their introduction such machines and some variants of them have been studied under different names: in [5] they were called (multi)counter machines, in [1] multipushdown machines, in [13] register machines and in [6] counter automata. Such devices have counters (also called registers) each of infinite capacity recording a natural number or zero. Simple operations can be performed on the counters: addition of one unit and conditional subtraction of one unit. After each or these operations the machine may change state. The main difference between the original models and some of the subsequent variants indicated above is that the latter may have a read only tape where the input is recorded. In the model introduced by M. L. Minsky and considered by us such tape is not present and the input is recorded as a number in one of the counters of the machine. Formally a program machine with n counters (n ∈ N ) is defined as M = (S, R, s0 , f ), where S is a finite set of states, s0 , f ∈ S are respectively called the initial and final states, R ⊆ (s, op(i), a, b), s, a, b ∈ S, s = f, op(i) ∈ {i+ , i− }, 1 ≤ i ≤ n, is the set of instructions of the following form: – (s, i− , a, b): in state s if the contents of counter i is greater than 0 then subtract 1 from that counter and change state into a, otherwise change state into b; – (s, i+ , a, a): in state s add 1 to counter i and change state into a. A configuration of a program machine M with n counters is given by the n+1tuple (s, N n0 ), s ∈ S. Given two configurations (s, x1 , . . . , xn ), (t, y1 , . . . , yn ) we define a computational step as (s, x1 , . . . , xn ) (t, y1 , . . . , yn ) if (s, op(i), a, b) ∈ R and: – if op(i) = i− and xi = 0, then t = a, yi = xi − 1, yj = xj , j = i, 1 ≤ j ≤ n; if op(i) = i− and xi = 0, then t = b, yj = xj , 1 ≤ j ≤ n; – if op(i) = i+ then t = a, yi = xi + 1, yj = xj , j = i, 1 ≤ j ≤ n.
Conformons-P Systems
297
A computation is a finite sequence of transitions between configurations of a program machine M starting from the initial configuration (s0 , x1 , . . . , xn ) with x1 = 0, xj = 0, 2 ≤ j ≤ n. If the last of such configurations has f as state then we say that M accepted the number x1 . The language accepted by M is defined as L(M ) = {x1 ∈ N | M accepts x1 }. For every program machine it is possible to create another one accepting the same language and having all counters empty in the final state.
4
The System
In this section we discuss how a conformons-P system with priority may generate all recursive enumerable (RE) sets of natural numbers, but first we introduce the notion of module. A module is a group of membranes in a conformons-P system with priority able to perform a specific task. In the figure representing conformons-P system with priority in this paper, modules are depicted as unique vertices with a thicker line. Such modules will be elements of the set W in the graph underlying a conformons-P system with priority. Each element of W will have a label indicating the kind of module. A subscript is add to differentiate labels referring to the same kind of module. Lemma 1. (Splitter) There exists a module that, when a conformon [Z, z] with z ∈ {z1 , . . . , zs }, zi < zi+1 , 1 ≤ i ≤ s − 1 is present in a specific membrane of it, may pass such a conformon to other specific membranes according to its value z. Sketch of the proof. No conformon and no rule is present in the initial configuration of this module. If a conformon passes to a specific membrane d a subsequent filtering on decreasing levels of values is performed via several membranes. If in a membrane the value of a conformon is smaller then a certain quantity then the conformon may pass to the next membrane, otherwise it leaves the module. 2 The number of membranes present in a splitter is equal to the number of edges outgoing this module. The label for a splitter is spl. Considering that in [16] it is proved that a program machine with 2 counters may generate all RE sets of natural numbers we can reach our aim simulating such machine. Theorem 1. The class of numbers generated by e-P systems with priorities coincides with the one generated by program machines. Sketch of the proof. For each counter of the simulated program machine there are infinite occurrences of a specific conformon in a membrane q of the conformons-P system with priority. Every time that a unit is added to a counter an occurrence of the related conformon passes from membrane q to the final one; conversely
298
Pierluigi Frisco and Sungchul Ji
for the inverse operation. Initially the final membrane contains the occurrences of the conformons related to the initial configuration of the simulated program machine. Considering that the passage of conformons between membranes is determined only by the value present in a conformon, the value of specific conformons may be increased so that the conformons may move to different membranes through splitters in order to perform different tasks. The addition or the conditional subtraction of one unit to a counter are simulated via other conformons related to the states of the simulated program machine. Only one of such conformons per time may move through the system so to perform the operation associated to it. When the passage of the program machine to a final state is simulated a conformon passes to the acknowledge membrane terminating the simulation. 2 Figure 1 represents a conformons-P system with priorities with 21 membranes simulating a program machine with one counter c. The simulation of program machines with more counters would not change the underlying graph of the conformons-P system with priority but its sets Li and Ri . In this figure conformons present in a membrane in the initial configuration are written in bold, while the others indicate in which membrane a conformon can be during the computation. What in Theorem 1 is indicated as membrane q in the picture is membrane 3; membrane 2 represent the acknowledge membrane while membrane 4 the final one. In this system one interaction per time may occur in a membrane and the nondeterminism of the system is limited only to membrane 1. It is interesting to note that the just described conformons-P system with priorities increases or decreases the value of conformons but the total amount of all values of the confomormons present in it is always constant. We consider an infinite supply of conformons with value 0 in membrane 3 and the result of the computation is given by the conformons with value 0 present in membrane 4 (see details in [4] and Figure 1).
5
Final Remarks
It is possible to modify the definition of conformons-P with priorities given in Section 3 removing or adding features in order to get variants. The simplest variant we were able to imagine is the one in which there are no priorities between the interaction and the passage of conformons to other membranes. We can also imagine that both the value and the number of conformons present in a system are finite or that a conformon cannot accumulate more that a fixed value defined by a membrane or by the total system. Moreover an operation, that is the interaction between two conformons or the passage of a conformon from one membrane to another, may subtract a finite amount from the value of the conformon involved in the operation. By the same token, we can also imagine that the value of conformons may increase upon entering membranes.
Conformons-P Systems spl1
2
1
= 6 [sj , 8] or [sj , 7] or [sj , 6]
[s j , 6]
≥1
[si , 1] or [si , 2] or [si , 3] =1
=2 =7
10
=8
spl2
3 5
[sj , 0] [s j , 0] [sj , 0] 8
si → sj
=0
7
si → s j
1
si → ≤8
[sj , 1] 7
[¯ sj , 10] → sj
9
= 3/= 4
[c, 7] ≥ 3
6
[s j , 7]
6
s j → c
s j → sj
[sj , 8]
8 [si , 1]
[sj , 3]
[¯ sj , 4] 3
sj → sj
≥9 ≤1
3
s j → sj
2
si → s j
s¯j → sj
9
5 ≤0
9
1 spl3 [sj , 0]
[s j , 0]
=0
3
≥ 10
=8
[s j , 7]
10
[sj , 6]
[c, 4]
=7
≥4
[s j , 3]
spl2
[s j , 7] or [sj , 3]
4
s¯j → sj ≥ 10 7
[s j , 9] ¯j s j → s
c → sj
6
7
[¯ sj , 1]
4
([c, 0], k)
[s j , 9]
[sj , 9]
≤0
6 [c, 4] [s j , 3] ≥ 9
4
9
[si , 0]
=1
c → s j
≥1
≤ 0/≥ 9
8
≤0
7
sj → c
sj
spl3
6
si → s j
[sj , 8]
[sj , 8]
[si , 1]
sj
spl3
≥9
[si , 9]
([c, 0], +∞)
=8
299
1
[¯ sj , 1] =1 7
Fig. 1. The conformons-P system with priority related to Theorem 1
It is also possible to consider restrictions to the graph underlying a conformons-P system limiting it, for instance, to a tree, or to see how the power of a conformons-P system changes if maximal parallelism is present. These variants are currently under investigation. Also under investigation is the possibility to r model the parallelism present in conformonsP systems with Petri nets. This kind of nets were introduced by C. A. Petri in his seminal [A, a − e] [B, b + e] PhD thesis in 1964 to model concurrent and Fig. 2. A Petri net represent- distributed system. The parallelism, basic in ing the interaction of two confor- the theoretical facet of biomolecular computing, has not yet been studied and formalized. mons [A, a]
[B, b]
300
Pierluigi Frisco and Sungchul Ji
The interaction between two conformons defined in (1) in page 294, may be represented by the Petri net presented in Figure 2 where we used the notation used in the first chapter of [23]. The concept of the conformon and the filtering process performed by membranes may be interpreted in a different way than described in this paper. The value associated to a conformon may be seen also as the electrical charge, the mass, the size, the momentum, the spin, the speed or frequency of it. The interaction between conformons would allow the passage of one or more of such features from one conformon to another and membranes might allow the passage of conformons depending on one or more such parameters. Moreover a conformon may be seen not only as associated to a biopolymer but as a generic molecule or a particle akin to a photon or an electron.
References [1] B. S. Baker and R. V. Book. Reversal-bounded multipushdown machines. Journal of computer and system science, 8:315–332, 1974. [2] J. Castellanos, G. P˘ aun, and A. Rodriguez-Paton. P systems with worm objects. In IEEE 7th. Intern. Conf. on String Processing and Information Retrieval, SPIRE 2000, La Coruna, Spain, pages 64–74, 2000. See also CDMTCS TR 123, Univ. of Auckland, 2000 (www.cs.auckland.ac.nz/CDMTCS). [3] P. Frisco and S. Ji. About Conformons-P systems, 2002. (manuscript in preparation). [4] P. Frisco and S. Ji. Conformons-P systems. In DNA Computing, 8th international Workshop on DNA-Based Computers, DNA 2002, Sapporo, Japan, 10-13 June 2002, pages 161–170. Hokkaido University, 2002. [5] S. A. Greibach. Remarks on blind and partially blind one-way multicounter machines. Theoretical Computer Science, 7:311–324, 1978. [6] H. J. Hoogeboom. Carriers and counters. P-systems with carriers vs. (blind) counter automata. [7] S. Ji. A general theory of ATP synthesis and utilization. Ann. N.Y. Acad. Sci., 227:211–226, 1974. [8] S. Ji. The Bhopalator: a molecular model of the living cell based on the concepts of conformons and dissipative structures. J. theoret. Biol., 116:395–426, 1985. [9] S. Ji. Biocybernetics: a machine theory of biology, pages 1–237. Molecular theories and cell life and death. Rutgers University Press, New Brunswick, 1991. S. Ji, ed. [10] S. Ji. The cell as the smallest DNA-based molecular computer. BioSystems, 52:123–133, 1999. [11] S. Ji. Free energy and information contents of conformons in proteins and DNA. BioSystems, 54:107–214, 2000. [12] S. Ji. The Bhopalator: an information/energy dual model of the living cell (II). Fundamenta Informaticae, 49(1-3), 2002. [13] J. P. Jones and Y. V. Matijaseviˇc. Register machine proof of the theorem on exponential Diophantine representation of enumerable sets. Journal of Symbolic Logic, 49(3):818–829, September 1984. [14] M. Madhu and K. Krithivasan. P systems with active objects: Universality and efficiency. MCU, Chisinau, Moldova, 2001. Proceedings in LNCS 2055, 276-287.
Conformons-P Systems
301
[15] C. Martin-Vide and V. Mitrana. P systems with valuations. In I. Antoniou, C.S. Calude, and M.J. Dinneen, editors, Unconventional Models of Computation, UMC’2K, Solvay Institutes, Brussel, 13 - 16 December 2000, DIMACS: Series in Discrete Mathematics and Theoretical Computer Science., pages 154–166. Centre for Discrete Mathematics and Theoretical Computer Science the International Solvay Institute for Physics and Chemistry and the Vrije Universiteit Brussel Theoretical Physics Division, Springer Verlag, Berlin, Heidelberg, New York., 2000. [16] M. L. Minsky. Recursive unsolvability of Post’s problem of ”tag” and other topics in theory of Turing machines. Annals of Mathematics, 74(3):437–455, November 1961. [17] M. L. Minsky. Computation: finite and infinite machines. automatic computation. Prentice-Hall, 1967. [18] G. Paun. Computing with membranes. Journal of Computer and System Sciences, 1(61):108–143, 2000. See also Turku Centre for Computer Science-TUCS Report No. 208, 1998 http://www.tucs.fi. [19] G. Paun. Computing with membranes. a variant: P systems with polarized membranes. Inter. J. of Foundations of Computer Science, 1(11):167–182, 2000. [20] G. Paun. P systems with active membranes: attacking NP complete problems. J. Automata, Languages and Combinatorics, 1(6):75–90, 2001. [21] G. Paun and Takashi Yokomori. Membrane computing based on splicing. In Erik Winfree and David K. Gifford, editors, Proceedings 5th DIMACS Workshop on DNA Based Computers, held at the Massachusetts Institute of Technology, Cambridge, MA, USA June 14 - June 15, 1999, pages 217–232. American Mathematical Society, 1999. [22] A. P˘ aun, G. P˘ aun, and G. Rozenberg. Computing by communication in networks of membranes. submitted, 2001. [23] W. Reisig and G. Rozenberg, editors. Lectures on Petri Nets I: Basic Models, volume 1491 of LNCS. Springer Verlag, Berlin, Heidelberg, New York., 1998.
Parallel Rewriting P Systems with Deadlock Daniela Besozzi1 , Claudio Ferretti2 , Giancarlo Mauri2 , and Claudio Zandron2 1
Universit` a degli Studi dell’Insubria Dipartimento di Scienze Chimiche, Fisiche e Matematiche Via Valleggio 11, 22100 Como, Italy [email protected] 2 Universit` a degli Studi di Milano-Bicocca Dipartimento di Informatica, Sistemistica e Comunicazione Via Bicocca degli Arcimboldi 8, 20136 Milano, Italy ferretti/mauri/[email protected]
Abstract. We analyze P systems with different parallel methods for string rewriting. The notion of deadlock state is introduced when some rules with mixed target indications are simultaneously applied on a common string. The computational power of systems with and without deadlock is analyzed and a lower bound for the generative power is given, for some parallelism methods. Some open problems are also formulated.
1
Introduction
The P systems were introduced in [7] as a class of distributed parallel computing devices of a biochemical type. The basic model consists of a membrane structure composed by several cell-membranes, hierarchically embedded in a main membrane called the skin membrane. The membranes delimit regions and can contain objects, which evolve according to given evolution rules associated with the regions. Such rules are applied in the following way: in one step all regions are processed simultaneously by using the rules in a nondeterministic and maximally parallel manner, and at each step all the objects which can evolve should evolve. All the evolved objects are then communicated to the prescribed regions, which are always specified by a target indication associated with each rule. A computation device is obtained: we start from an initial configuration and we let the system evolve. A computation halts when no further rule can be applied. The objects expelled through the skin membrane (or collected inside a specified output membrane) are the result of the computation. (A survey and an up-to-date bibliography can be found at the web address http://bioinformatics.bio.disco.unimib.it/psystems.) In this paper we consider rewriting P systems ([7], [11], [3]) and our aim is to extend the application of evolution rules from sequential rewriting to the parallel one (see also [4], [5], [6]). This fact is also biological motivated, as a
Work partially supported by contribution of EU commission under The Fifth Framework Programme, project ”MolCoNet” IST-2001-32008.
M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 302–314, 2003. c Springer-Verlag Berlin Heidelberg 2003
Parallel Rewriting P Systems with Deadlock
303
cellular substance could be processed by many chemical reactions (each on a different site) at the same time. Observe that using parallel rewriting means that at each step of a computation a string has to be processed, if possible, by more than one rule in parallel, according to the prescribed parallel rewriting method. So in parallel rewriting P systems we have a three-level parallelism, involving membranes, objects and rules as well. On the other hand, if the rules we apply on the same string have mixed target indications, then we have consistency problems for the communication of the resulting string, as there are contradictory indications about the region where the string should be at the next step. This problem has been previously faced and solved with different strategies ([4]), for example by counting the number of target indications of types here, in, out appearing after the parallel application of rules, and then communicating the string to the region corresponding to the maximal number of indications. Another possibility consists in choosing as target region the one which corresponds to the indication (if existing) that appears exactly once after the parallel application of rules. Here we use a different approach for facing the problem: we say that when rules with mixed target indications are applied at the same time on a common string, then we have a deadlock state inside the system (see, e.g., [10] for a definition of deadlock and the ways of dealing with it in the field of Concurrent Systems). When a situation of deadlock arises for a string, then such string is not sent to outer or inner regions but it remains inside the current membrane, though it will not be processed anymore by any other rule. Hence the deadlock state for that string causes its further processing and communication to be stopped. In this paper we do not consider any biological counterpart of deadlock, we only propose a theoretical analysis of parallel rewriting and of its consequences. We begin here the analysis of P systems which use a few different parallel rewriting methods and whose configurations may present deadlock states or not. In particular, we compare parallel P systems (with and without deadlock) to (1) Lindenmayer systems, (2) other parallel P systems to the aim of determining the eventual modification of the generative power with respect to the possibility of having deadlock states or not. Several open problems remain to be studied, associated to some types of parallel rewriting P systems not yet analyzed.
2
Formal Language Notions: Parallel Rewriting Methods
We denote by V ∗ the free monoid generated by the alphabet V under the operation of concatenation. The empty string is denoted by λ, V + = V ∗ − {λ} is the set of non-empty strings over V , |w| represents the length of a string w ∈ V + and #a (w) is the number of all occurrences of a symbol a ∈ V which appear in the string w. We will refer to [1] and [8] for other formal language theory notions.
304
Daniela Besozzi et al.
In this section we present some kind of parallel rewriting methods for context free rules. We assume the condition that no more than one rule will be allowed to rewrite a symbol at the same time, as in interactionless Lindenmayer systems ([8], [9]). In this paper we will use the following methods of parallel rewriting: (M) With a maximal parallelism rewriting step, all occurrences of all symbols (which can be the subject of a rewriting rule) are simultaneously rewritten by rules which are nondeterministically chosen in the set of all applicable rewriting rules. That is, if the string w = x1 a1 x2 a2 x3 a3 . . . xn an xn+1 , with w ∈ V + , a1 , . . . , an ∈ V (not necessarily distinct) and xi ∈ (V \ {a1 , . . . , an })∗ ∀i = 1, . . . , n + 1, is such that there are no rules defined over symbols in the strings x1 , . . . , xn+1 , and there are some rules r1 : a1 → α1 , . . . , rm : am → αm , not necessarily distinct, then we obtain in one maximal parallel rewriting step the string w = x1 α1 x2 α2 x3 α3 . . . xm αm xm+1 . (T) As in (E)T 0L systems, we can consider the set of rewriting rules divided into subsets of rules, that is tables of rules. In this case, if we have a string w and some tables t1 : [r11 : a1 → α1 , . . . , rk1 : ak → αk1 ], . . . , tl : [r1l : a1 → α1 , . . . , rkl : ak → αkl ], then in one step only the rules from a table (which is nondeterministically chosen) can be applied, and these rules must be applied in parallel on all occurrences of all symbols in w, but not necessarily following the order the rules appear in the table. Moreover, if some rules in the chosen table are defined over symbols not in w, or if the number of rules in the table exceeds the length of the string, then we skip those (not defined or exceeding) rules without forbidding the application of the entire table.
3
Parallel Rewriting P Systems: Introducing Deadlock
A membrane structure µ is a construct consisting of several membranes hierarchically embedded in a unique membrane, called the skin membrane. We identify a membrane structure with a string of correctly matching square parentheses, placed in a unique pair of matching parentheses; each pair of matching parentheses corresponds to a membrane. We can also associate a tree with the structure, in a natural way; the height of the tree is the depth of the structure itself. Each membrane identifies a region, delimited by it and the membranes (if any) immediately inside it. A membrane is said to be elementary if it does not have any other membrane inside. If we place multisets of objects in the region from a specified finite set V , we get a super-cell. A super-cell system (or P system) is a super-cell provided with evolution rules for its objects. We will work with string-objects, so with every region i = 0, 1, . . . , n of µ we associate a multiset of finite support over V , that is a map Mi : V ∗ → N where Mi = {(x1 , Mi (x1 )), . . . , (xp , Mi (xp ))}, for some xk ∈ V + such that Mi (xk ) > 0 ∀k = 1, . . . , p. A parallel rewriting P system of degree n + 1 is defined by the construct Π = (V, T, µ, M0 , . . . , Mn , R0 , . . . , Rn )
Parallel Rewriting P Systems with Deadlock
305
where: 1. V is the alphabet of the system; 2. T ⊆ V is the terminal alphabet; 3. µ is a membrane structure with n + 1 membranes, which are injectively labeled by numbers in the set {0, 1, . . . , n}; 4. M0 , . . . , Mn are multisets over V , representing the strings initially present in the regions 0, 1, . . . , n of the system; 5. R0 , . . . , Rn are finite sets of evolution rules of the form a → (α, tar), with a ∈ V, α ∈ V ∗ , tar ∈ {here, out, in}, associated with the regions of µ. (The indication here is often omitted). The application of evolution rules is performed as follows: in one step all regions are processed simultaneously by using the rules in a nondeterministic and maximally parallel manner. This means that in each region the objects to evolve and the rules to be applied to them are nondeterministically chosen, but all objects which can evolve should evolve. Moreover, at each step of a computation a string has to be processed, if possible, by more than one rule in parallel, according to the chosen parallelism method. The strings resulting after the parallel application of the rules must be communicated to the prescribed region, which is always specified by the target indication associated with each rule. When we apply two or more rules to the same string, we have to check that their target indications match before communicating the resulting string to the right region. To this aim, for every region i = 0, . . . , n of the membrane structure we divide the set Ri of evolution rules into mutually disjoint subsets of rules which have the same target indications, that is Ri = Hi ∪ Oi ∪ Ii , where Hi = {r ∈ Ri | tar(r) = here}, Oi = {r ∈ Ri | tar(r) = out} and Ii = {r ∈ Ri | tar(r) = in}. Observe that for every elementary membrane the set Ii will always be empty, and that for any other region any subset of rules could be empty as well. Consider now some rules r1 , . . . , rm , for some m ≥ 2, which can be applied on a common string w at the same time. If it holds that (1) r1 , . . . , rm ∈ Hi or (2) r1 , . . . , rm ∈ Ii or (3) r1 , . . . , rm ∈ Oi , that is we apply in parallel only rules which match on the target membrane, then the resulting string w (obtained after the parallel application of r1 , . . . , rm ) (1) remains inside the current region i, (2) is communicated to a (nondeterministically chosen) inner region, (3) is communicated to the outer region. In particular, if the string exits the system, it can never come back, and it may contribute to the output of the system. Otherwise, if the set of rules {r1 , . . . , rm } = R which we want to apply have mixed target indications, that is, for instance, R ∩ Hi = ∅ ∧ R ∩ Ii = ∅, then we have consistency problems for the communication of the resulting string, as there are contradictory indications about the region where the string should be at the next step. This problem has been previously faced and solved with different strategies ([4]), here we choose to use a different approach and we say that when rules with mixed target indications are applied at the same time on a common string,
306
Daniela Besozzi et al.
then we have a deadlock state inside the system. The string is not sent to outer or inner regions but it remains inside the current membrane, though it will not be processed anymore by any other rule; this choice does not mean that the indication here determines the target region, it means that further string processing and communications are stopped when a situation of deadlock arises for that string. In particular, we may choose to consider a deadlock state only for that string, or to define the deadlock state for the entire membrane where such string is placed. In other words, the membrane could be seen and used in two distinct ways: (D1 ) other strings can enter the membrane and be processed by local rules, but they can never exit the region even if they are not in a deadlock state; (D2 ) other strings can enter the membrane, be processed by local rules and even exit the region (if they are not in a deadlock state after the application of local rules). In the first case, (D1 ), it happens that a consistency problem for target matching on a single string causes the system to lose an entire computing unit, as no strings are allowed to exit that membrane anymore. This interpretation differs, for example, from the dissolving action δ of membranes (see [7]), in fact after dissolving a membrane we lose the membrane but we recover its objects in the outer region, while for deadlock membrane both the membrane and its objects are lost. In the second case, (D2 ), the membrane would act like a filter for wrong or right strings, that is for strings with or without deadlock, stopping the wrong ones and letting the right ones proceed. A wrong string could be seen as an error taking place during the computation of the P system, and hence it must be stopped. At a given time, the membrane structure together with all multisets of objects associated with the regions represent the configuration of the system at that time. The (n + 2)-tuple C0 = (µ, M0 , . . . , Mn ) constitutes the initial configuration of the system. For two configurations Ct = (µ, M0t , . . . , Mnt ), Ct+1 = (µ, M0t+1 , . . . , Mnt+1 ) of the system we say that there is a transition from Ct to Ct+1 if we can apply the rules present in the regions according to the above prescriptions. We say that a generic configuration Ct = (µ, M0t , . . . , Mnt ) is free if there are no deadlock states inside the system at that time. Otherwise, we say that the system is in a deadlock configuration, and we denote by Mjt all multisets which contain at least a deadlock string. A transition starting from a deadlock configuration will always reach another deadlock configuration, that is we do not consider the possibility of removing deadlock states. So if Ct = (µ, M0t , . . . , Mjt , . . . , Mnt ) is a deadlock configura tion, then for all t ≥ t it holds that Ct = (µ, M0t , . . . , Mjt , . . . , Mnt ), where other multisets besides Mjt could have reached a deadlock state. We remark that the multiset Mjt still represents a deadlock state even though it is not necessarily equal to Mjt , because other strings may have entered membrane j.
Parallel Rewriting P Systems with Deadlock
307
A configuration where all multisets are in a deadlock state is said to be a global deadlock configuration, otherwise we talk about local deadlock configurations. A sequence of transitions of free and (local) deadlock configurations forms a computation. We say that a computation is halting when there is no rule which can be further applied in the current configuration. If we interpret deadlock as in (D1 ), then a global deadlock configuration always causes the computation to halt. Observe that if the P system is processing a single string and if there are no rules which can increase the number of the strings, then in both cases (D1 ) and (D2 ) even a local deadlock configuration causes the computation to halt. A computation is said to be non-halting if there is at least one rule which can be applied forever. In this paper we consider extended P systems. The output of an extended system is the set of strings over T (if any) sent out of the system during a computation which eventually halts. Anyway, a string which exits the system but contains symbols not in T does not contribute to the generated language. Non-halting computations provide no output. We denote by EP arRPnk (π, ∆) the family of languages generated by extended rewriting P systems of degree n and depth k, where π ∈ {(M ), (T )} denotes the used parallelism method and ∆ ∈ {D, nD} denotes systems with or without the possibility of having deadlock states, respectively. Similarly, we use the notation EP arRP (π, ∆) for systems whose depth or degree are not specified.
4
Analysis of Deadlock
Let us now consider a generic string w ∈ V + which, at a given time during the computation, is inside membrane i, for some i = 0, . . . , n. We want to analyze under which circumstances the membrane i can be considered a safe region ([2]) for the string w; that is, given the pair (w, Ri ), we want to determine if there is no possibility for w to be in a deadlock state after the parallel application of some rules from Ri . Otherwise, we say that membrane i is an unsafe region for the string w. To this aim, suppose that the set Ri contains m evolution rules which are defined over a set of distinct symbols a1 , . . . , al ∈ V , with l ≤ m and with the condition that, for all j = 1, . . . , l, there exists at least one rule in Ri which is defined over aj . Moreover, suppose that #aj (w) ≥ 0 ∀j = 1, . . . , l and that |w| ≥ 2 (so we can have parallel rewriting). Given the above assumptions, consider the following conditions: ($) there exists at least one couple of symbols aj1 , aj2 such that #aj1 (w) > 0, #aj2 (w) > 0 (for some j1 , j2 ∈ {1, . . . , l}, j1 = j2 ), and there exists at least one couple of rules rj1 , rj2 ∈ Ri , defined over aj1 , aj2 respectively, such that tar(rj1 ) = tar(rj2 ); ($$) there exists at least one symbol aj such that #aj (w) > 1 (for some j ∈ {1, . . . , l}), and there exists at least one couple of rules rj1 , rj2 ∈ Ri , both defined over aj , such that tar(rj1 ) = tar(rj2 ).
308
Daniela Besozzi et al.
Depending on the parallel method we decide to use, the conditions ($), ($$) define the only possibilities for the membrane i to be an unsafe region for w. Observe that one condition does not exclude the other, they both could hold at the same time in the same membrane. For maximal and table parallelisms, any membrane i results unsafe for any string w if condition ($) or ($$) holds. For (T )-parallelism, obviously this is true if the rules in conditions ($), ($$) belong to the same table.
5
The Computational Power
In this section we begin the analysis of the computational power of parallel P systems which use maximal or table parallelism. These systems are compared to extended Lindenmayer systems (with tables). Then we show that, when using maximal parallelism, the possibility of having deadlock states does not modify the generative power of the corresponding P system without deadlock. 5.1
Relations with L Systems
In this section we need the notion of an ET0L system (see [8], [9]), which is a construct G = (V, T, w, P1 , . . . , Pm ), m ≥ 1, where V is an alphabet, T ⊆ V, w ∈ V + , and Pi , 1 ≤ i ≤ m, are finite sets (tables) of context-free rules over V such that for each A ∈ V there is at least one rule A → x in each set Pi (we say that these tables are complete). In a derivation step, all the symbols present in the current sentential form are rewritten using one table. The language generated by G, denoted by L(G), consists of all strings over T which can be generated in this way, starting from w. We show that the family of languages generated by ET 0L systems is included in the family of languages generated by parallel rewriting P systems (with deadlock) when using maximal parallelism, (with and without deadlock) when using table parallelism. Theorem 1. ET 0L ⊆ EP arRP32 ((M ), D). Proof. According to Theorem 1.3 in [9], for each language L ∈ ET 0L there exists an ET 0L system G which generates L and contains only two tables, that is G = (V, T, w, P1 , P2 ). At the first step of a derivation, we use table P1 . After using table P1 , we either use again table P1 or we use table P2 , and after each use of table P2 we either use table P1 or we stop the derivation. Making use of this observation, we construct the P system Π = (V , T, [0 [1 ]1 [2 ]2 ]0 , M0 , ∅, ∅, R0 , R1 , R2 ), such that L(Π) ∈ EP arRP32 ((M ), D), where the alphabet is V = V ∪T ∪{X, X1 , X2 , X3 , X4 , †}, with V ∩T ∩{X, X1 , X2 , X3 , X4 , †} = ∅, and the initial multiset is M0 = Xw, with w ∈ V + . The system contains the following sets of rules:
Parallel Rewriting P Systems with Deadlock
309
R0 = {X → X1 (in), X2 → X1 (in), X2 → X3 (in), X4 → X1 (in), X4 → λ(out)}; R1 = {A → xX2 (out) | A → x ∈ P1 } ∪ {X1 → λ(out), X3 → †(out)}; R2 = {A → xX4 (out) | A → x ∈ P2 } ∪ {X3 → λ(out), X1 → †(out)}. Each table Pi , i = 1, 2, is simulated in membrane mi , i = 1, 2 respectively. The computation starts in the skin membrane. Using the rule X → X1 (in), the string Xw can be sent to any of the inner membranes. If it enters membrane 2, then the trap symbol † is introduced and no output will be produced. In this way, we correctly simulate the fact that a derivation in G must always begin using the productions from table P1 . Otherwise, if the string enters membrane 1, we simulate the application of table P1 , the symbol X1 is erased and some occurrences of the new symbol X2 are introduced. The string then returns to membrane 0, where both rules X2 → X1 (in) and X2 → X3 (in) can be applied. If both rules are simultaneously applied, in any case the trap symbol will be introduced in membrane 1 or 2, by means of the rule X3 → †(out) or X1 → †(out) respectively. Again, no output will be produced. Instead, if only one rule is applied in the skin membrane to all occurrences of X2 , then the string will enter membrane 1 or 2 and the computation can proceed. In membrane 1 we simulate the table P1 once more, and the procedure can be iterated in the same way. In membrane 2 we simulate table P2 , in this case we erase the symbol X3 and introduce some occurrences of the new symbol X4 , then the string returns to membrane 0. At this moment, we can either choose to stop the simulation of G by using the rule X4 → λ(out) (which deletes all occurrences of X4 ), or to start another simulation of table P1 by using the rule X4 → X1 (in) (which rewrites all occurrences of X4 into X1 ). In any case, we correctly simulate the fact the after using table P2 we cannot use it again. If X4 → λ(out) is used, the string exits the system: if it is a terminal string, then it will be accepted, otherwise it will not contribute to the generated language. Observe that if both rules over X4 are simultaneously applied in the skin membrane, then we have a deadlock state and no output will be produced. It follows that L(Π) = L(G). Actually it holds that ET 0L is strictly included in the family of parallel rewriting P systems with deadlock which use the maximal parallelism method: Corollary 1. ET 0L ⊂ EP arRP44 ((M ), D). Proof. The inclusion ET 0L ⊆ EP arRP44 ((M ), D) follows from Theorem 1. It is possible to show that the language L = {(abn )m | m ≥ n ≥ 1}, which does not belong to ET 0L, can be generated by a P system of type EP arRP44 ((M ), D). The family of languages ET 0L is also included in the class of parallel rewriting P systems which use table parallelism. Theorem 2. ET 0L ⊆ EP arRP11 ((T ), nD) ⊆ EP arRP11 ((T ), D).
310
Daniela Besozzi et al.
Proof. The second inclusion follows from the definitions. To show the first inclusion, consider again an ET 0L system G = (V, T, w, P1 , P2 ) in the normal form described above. We construct the system Π = (V , T, [0 ]0 , M0 , R0 ), such that L(Π) ∈ EP arRP11 ((T ), nD), where the alphabet is V = V ∪ T ∪ {X, X , †}, with V ∩ T ∩ {X, X , †} = ∅, and the initial multiset is M0 = Xw, with w ∈ V + . The set R0 contains the following tables of rules: t1 t2 t3 t4
= = = =
[{A → x(here) | A → x ∈ P1 }, X → X(here), X → †(here)]; [{A → x(here) | A → x ∈ P1 }, X → X (here), X → †(here)]; [{A → x(here) | A → x ∈ P2 }, X → X(here), X → †(here)]; [{A → x(out) | A → x ∈ P2 }, X → λ(out), X → †(out)].
Each table Pi , i = 1, 2, is simulated by means of both tables ti and ti+1 , i = 1, 2, respectively. At the first step of a computation, only table t1 or t2 can be used, otherwise the trap symbol † is introduced and we will have no output. The table t1 can be applied as many times as we want, but after one application of t2 we cannot apply t1 nor t2 anymore, and we have to use table t3 or t4 . In fact, when we use t2 we introduce the symbol X2 , which forbids the application of both t1 and t2 . If the string is now rewritten using the rules from t3 , then we reintroduce the symbol X and the computation can proceed only with the application of t1 or t2 . Otherwise, if the string is rewritten using the rules from t4 , then the symbol X1 is deleted and the string exits the system. If it is a terminal string, then we obtain the output, otherwise it is ignored. Hence L(Π) = L(G). 5.2
Deadlock Vs. Non Deadlock
A natural question, concerning systems with parallel application of the rules, is whether or not the possibility of having deadlock states modifies the computational power of such systems. We show that this is not the case at least when considering systems with maximal parallelism rewriting steps. Anyway, the advantage of using a parallel P system with deadlock is that it needs a smaller number of membranes and a smaller depth with respect to its equivalent parallel P system without deadlock. Theorem 3. (i) EP arRPnk ((M ), nD) ⊆ EP arRPnk ((M ), D); k+2 (ii) EP arRPnk ((M ), D) ⊆ EP arRP7n ((M ), nD). Proof. (i). The inclusion EP arRPnk ((M ), nD) ⊆ EP arRPnk ((M ), D) directly follows from the definitions. (ii). Let Π = (V, T, µ, M0 , . . . , Mn−1 , R0 , . . . , Rn−1 ) be a system such that L(Π) ∈ EP arRPnk ((M ), D), where the rules are applied according to maximal parallelism method and with the possibility of having deadlock states. We assume that m0 is the skin membrane in µ.
Parallel Rewriting P Systems with Deadlock
311
We show how to construct a P system Π = (V , T, µ , M0 , . . . , Mm , R0 , . . . , k+2 such that L(Π ) ∈ EP arRPm ((M ), nD), where no deadlock states can occur and which generates the same language as Π. Its alphabet is V = V ∪ V¯ ∪ {X, X1 , X2 , Xhere , Xin , Xout , †}, where V¯ = {A¯ | A ∈ V } and V ∩ V¯ ∩ {X, X1 , X2 , Xhere , Xin , Xout , †} = ∅. Consider a generic membrane mi of Π, for any i = 0, . . . , n − 1. Such a membrane contains a set of strings Mi , a set of rules Ri and, possibly, a set of other membranes; we will denote with mi,1 , . . . , mi,h the membranes placed immediately inside mi . We show how to simulate this generic membrane in the system Π , and we point out that the simulation of all other membranes follows the same recursive description below. So, the corresponding membrane mi in Π is obtained by replacing every string w in Mi with a string Xw, where X is a symbol not in V . The set of rules of the membrane mi will be: Rm ),
Ri = {X → X1 (in), Xhere → X, Xin → X(in), Xout → X(out)}. (In the skin membrane (i = 0), the last rule is replaced with Xout → λ(out)). Then, immediately inside mi , we add three new membranes denoted by mi(here) , mi(in) , mi(out) . Each new membrane mi(tar) , with tar ∈ {here, in, out}, will contain the following rules: Ri(tar) = {A → y¯(in) | A → y(tar) ∈ Ri } ∪ ¯ → B(out) | ∀B ¯ ∈ V¯ } ∪ {X2 → Xtar (out), X → †(out)}. {B Moreover, each membrane mi(here) , mi(in) and mi(out) will contain a single inner membrane, which will be denoted by mi(here),check , mi(in),check and mi(out),check respectively. The rules belonging to mi(tar),check , for each tar ∈ {here, in, out}, are: Ri(tar),check = {A → †(out) | A → y(tar ) ∈ Ri , tar ∈ {here, in, out} and tar = tar} ∪ {X1 → X2 (out)}. Finally, we add the rule X1 → †(out) in each membrane mi,1 , . . . , mi,h , which are all placed inside mi and correspond to the membranes mi,1 , . . . , mi,h originally placed in membrane mi . (Observe that, if i = 0, then the rule X1 → †(out) will be placed also inside mi because of the recursive construction of the system). From the construction of µ , it follows that we need seven membranes in Π to simulate each membrane in Π, hence m = 7n, and that the depth is increased from the value k to the new value k + 2. Let us now compare the computations in Π and Π . Consider a string w in a membrane mi of Π and the string Xw in the corresponding membrane mi . First of all, notice that in membrane mi we always have to apply the rule X → X1 (in). Then, the obtained string X1 w is sent to one of the inner membranes. If the string reaches a membrane nondeterministically chosen among
312
Daniela Besozzi et al.
mi,1 , . . . , mi,h we have to apply the rule X1 → †(out). This rule introduces a trap symbol which will never be removed; hence, no output is produced. Of course, this is not always the case: the string can be sent to one of the added membranes mi(here) , mi(in) , mi(out) . In this case, we have four different possibilities, depending on the computation in Π: 1. No rule in Ri can be applied to w. When the string X1 w reaches one of the added membranes mi(here) , mi(in) , mi(out) , no other rule can be applied and the string will remain in that membrane forever. No output is produced, as in Π. 2. The rules in Ri that have to be applied to w surely lead to a deadlock condition. (We assume, for instance, that some rules which have to be applied to w have target here while other rules have target in. It is easy to see that the following consideration is still valid, with only minor changes, for different conflicting targets). If the corresponding string X1 w reaches the membrane mi(here) we will apply all the rules of the form A → y¯(in) which correspond to applicable rules of the form A → y(here) in mi . The obtained ¯ is sent to membrane mi(here),check , where we check for the applistring X1 w cability of other rules in mi that lead to deadlock. In fact, we have to apply the rules of the form B → †(out) (corresponding to the rules B → z(in) in Ri ); in this way we introduce the trap symbol, hence no output is obtained. It is easy to see that the trap symbol is introduced even in the case that the string X1 w first reaches the membrane mi(in) and then the membrane mi(in),check . If the string X1 w reaches the membrane mi(out) , then no string can be applied and still we produce no output. 3. All the rules in Ri which can be applied to w have the same target, that is no deadlock can occur. Let us assume that the target of the applicable rules is in. If the string X1 w reaches the membrane mi(here) or the membrane mi(out) the computation cannot further proceed and no output is produced. To correctly simulate the productions in Ri , we have to send the string in membrane mi(in) ; here, we can apply the rules of the form A → y¯(in) which correspond to the rules of the form A → y(in) that can be applied in Ri . The string is sent to membrane mi(in),check , where we check the application of rules with a different target. No such rule can be applied, hence the trap symbol is not introduced. The only applicable rule is X1 → X2 (out). The string is sent back to membrane mi(in) , where we have ¯ → B(out) and the rule X2 → Xin (out). to apply the rules of the form B In membrane mi we obtain the string Xin w1 , where w1 is the string which corresponds to a correct simulation of the rules in Ri that can be applied on w. Now we can apply the rule Xin → X(in) which sends the string Xw1 to an inner membrane; if the string reaches again one of the membranes mi(here) , mi(in) or mi(out) , we apply the rule X → †(out). If, on the contrary, it is sent to one of the membranes mi,1 , . . . , mi,h we can correctly continue there the computation on the string. The cases for the simulation of rules with target here or out is similar. The only difference is in the last rule applied in mi : if we simulate the application of rules with target here, the
Parallel Rewriting P Systems with Deadlock
313
last production we apply will be Xhere → X(here), and the string is ready for a new simulation in the same membrane. If we simulate the application of rules with target out, the last production we apply will be Xout → X(out), and the string is sent to the membrane immediately outside, where we can start the simulation of a new set of rules. 4. The rules in Ri that have to be applied to w can lead to a deadlock condition. This is the case when we have a deadlock condition only for some nondeterministic choices of the rules to be applied, while other nondeterministic choices do not lead to deadlock. Of course, the rewriting of the string w in Π proceeds only if the nondeterministic choice of the rules to be applied does not lead to a deadlock condition. In the other cases, w will never exit the system; such situations are not simulated in Π , so if we have some output in Π then we have the same output in Π , otherwise no string will be produced in Π if it cannot be produced in Π as well. In Π the nondeterministic choice of the rules to be applied is taken in mi . In fact, as previously said, starting from Xw we obtain X1 w, and this string is sent to an inner membrane. The nondeterministic choice of the membrane in Π corresponds to the nondeterministic choice of the rules to be applied in Π. Once the string has reached one membrane among mi(here) , mi(in) or mi(out) , we simulate the application of the rules with the corresponding target and then we check for the application of other rules with a different target, as previously described. All the choices in mi which do not lead to deadlock are correctly simulated in mi , and no string that cannot be rewritten in mi can be rewritten in mi . Hence, the set of strings generated by Π is exactly the same as the set generated by Π, that is L(Π ) = L(Π). The same result still holds when keeping the depth of the systems fixed to the value k = 2. 2 Corollary 2. EP arRPn2 ((M ), D) ⊆ EP arRP4n+3 ((M ), nD).
Proof (sketch). Consider the system Π = (V, T, µ, M0 , . . . , Mn−1 , R0 , . . . , Rn−1 ), such that L(Π) ∈ EP arRPn2 ((M ), D), where the skin membrane is labeled with 0 and the inner membranes with i = 1, . . . , n − 1. It is possible to show how to construct a system Π = (V , T, µ , M0 , . . . , Mm , R0 , . . . , Rm ) 2 such that L(Π ) ∈ EP arRPm ((M ), nD) and L(Π ) = L(Π). In this case, the skin membrane m0 and six new membranes m0(tar) , m0(tar),check , for tar ∈ {here, in, out} are used in Π to simulate the skin membrane m0 in Π, while any other membrane mi in Π is simulated by means of four new membranes mi(tar) , mi(tar),check , for tar ∈ {here, out}. The way these membranes work is similar to Theorem 3. So 7 + 4(n − 1) membranes are needed in Π to correctly simulate each and every computation of Π, hence m = 4n + 3, while the depth of the system is left unchanged.
314
Daniela Besozzi et al.
The results from Theorem 3 also states that the possibility of having deadlock states does not increase the power of such systems (with no bounds on the number and depth of membranes): Corollary 3. EP arRP ((M ), D) = EP arRP ((M ), nD).
6
Final Remarks
Section 3 defines several families of languages generated by parallel rewriting systems, obtained by changing the bounds on the number and depth of membranes and by choosing a different way of parallel rewriting. This paper only considers some of those families, comparing them to ET 0L and analyzing whether the possibility of having deadlock states or not affects the computational power of a system. Many other comparisons and hierarchies are still open to the research, while keeping the formalisms we have introduced. Considering L systems, referred to in Section 5.1, we mention that they have also been compared to usual rewriting P systems in [3].
References 1. J. Dassow, Gh. P˘ aun, Regulated Rewriting in Formal Language Theory, SpringerVerlag, Berlin, 1989. 2. L. Fajstrup, E. Goubault, M. Raussen, Detecting deadlocks in concurrent systems, Concurrency Theory (CONCUR’98), LNCS 1466, 332–347, Springer-Verlag, Berlin, 1998. . 3. C. Ferretti, G. Mauri, C. Zandron, G. P˘ aun, On Three Variants of Rewriting P Systems, Theoretical Computer Science, to appear. 4. S. N. Krishna, Languages of P systems: computability and complexity, PhD Thesis, 2001. 5. S. N. Krishna, R. Rama, A Note on Parallel Rewriting in P Systems, Bulletin of the EATCS, 73 (February 2001), 147-151. 6. S. N. Krishna, R. Rama, On the Power of P Systems Based on Sequential/Parallel Rewriting, Intern. J. Computer Math., 77, 1-2 (2000), 1-14. 7. Gh. P˘ aun, Computing with membranes, Journal of Computer and System Sciences, 61, 1 (2000), 108–143 (see also Turku Center for Computer Science-TUCS Report No 208, 1998, www.tucs.fi). 8. G. Rozenberg, A. Salomaa, eds., Handbook of Formal Languages, Springer-Verlag, Heidelberg, 1997. 9. G. Rozenberg, A. Salomaa, The Mathematical Theory of L Systems, Academic Press, New York, 1980. 10. A. S. Tanenbaum, Modern Operating Systems, Prentice Hall, 2001. 11. C. Zandron, A Model for Molecular Computing: Membrane Systems, PhD Thesis, 2001.
A DNA-based Computational Model Using a Specific Type of Restriction Enzyme Yasubumi Sakakibara1 and Hiroshi Imai2 1
Department of Biosciences and Informatics, Keio University CREST, JST 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223-8522, Japan ([email protected]) 2 NEC Corporation, Japan
Abstract. The restriction enzyme is an important device which provides cutting operations of DNA strands to construct a DNA-based computational model such as splicing systems [3]. In this paper, we employ a specific type of restriction enzyme which cut on both sides of their recognition sequences [6], and propose a new DNA-based computational model which has several advantages compared with conventional models. The new computational model is shown to achieve universal computability using only natural DNA-based methods such as annealing, cut, ligation and circular strands without any practically hard assumption. Furthermore, while the generative power of the computational model is shown to be universal, the parsing (accepting) computation ability is more appealed. That is, given any string, the model computes whether it accepts the string, and most conventional DNA-based model have not offer this accepting process. We show that the new computational model efficiently computes the parsing process for context-free grammars and finite sequential transducers.
1
Introduction
The restriction enzyme is an important device which provides cutting operations of DNA strands to construct a DNA-based computational model such as splicing systems. A restriction enzyme will bind to DNA at a specific recognition site and then cut DNA mostly within this recognition site. The cut can be blunt, or staggered leaving sticky ends [6]. Combined with ligases which link two fragments of DNA strands, the restriction enzyme leads to a formal computational model of the recombinant behavior of DNA strands, called splicing systems [3]. A disadvantage of the splicing systems using restriction enzymes which cut within the recognition sites is that the system requires finitely many restriction enzymes which have different recognition sites to implement different computational rules. On the other hand, some restriction enzymes cut DNA strands outside of their recognition site and from the viewpoint of constructing DNA computers, these enzymes are more useful and interesting. Shapiro et al. [1] have successfully M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 315–325, 2003. c Springer-Verlag Berlin Heidelberg 2003
316
Yasubumi Sakakibara and Hiroshi Imai
implemented the finite-state machine by the sophisticated use of the restriction enzyme (actually, FokI) which cut outside of its recognition site. For example, the restriction enzyme FokI cleaves a double-stranded DNA at a fixed distance (bases) away from its recognition site shown as follows: 5’ 3’
GGATGNNNNNNNNN^ 3’ CCTACNNNNNNNNNNNNN^ 5’
where “GGATG” is recognition site and N represents any base (A, C, G, or T). Thus, an important feature of using this type of restriction enzymes is that any intended encoding can be put into the subsequences at the sticky ends and therefore the only one restriction enzyme is enough to implement many different computation rules. In this paper, we employ more specific type of restriction enzyme which cut on both sides of their recognition sites [2,6]. For example, the restriction enzyme Bsp24I cleaves a double-stranded DNA as follows: 5’ 3’
^NNNNNNNNGACNNNNNNTGGNNNNNNNNNNNN^ 3’ ^NNNNNNNNNNNNNCTGNNNNNNACCNNNNNNN^ 5’
where “GAC” and “TGG” are two recognition sites and N represents any base (A, C, G, or T). As you can see, there is a part between two recognition sites which is of a fixed length and consists of any bases. Any intended encoding can be put into this part and this is the most useful feature of this type of restriction enzymes which we will fully utilize for implementing a DNA-based computational model.
2
Operations Using the Specific Restriction Enzyme
In this section, we propose a main operation using the specific type of restriction enzyme introduced in the previous section. We employ the restriction enzyme, ligase, and linear and circular single DNA strands to construct a new DNA-based computational model. We illustrate our new DNA-based operation in Figure 1 and 2. First, we prepare a single DNA strand which contains a subsequence urL xrR v consisting of two recognition sites rL , rR for the restriction enzyme cutting on both sides and some subsequence x between two sites, and a circular single DNA strand urL xrR vwˆ which contains a complementary subsequence for urL xrR v (Figure 1, (a)). Here, y denotes the complementary sequence of y in the sense of Watson-Crick complementarity and yˆ denotes a circular sequence. Second, the subsequence urL xrR v in the linear single DNA strand is annealed to the complementary subsequence urL xrR v in the circular single DNA strand and constitutes a double-stranded DNA subsequence (Figure 1, (b)). Third, the annealed double-stranded DNA subsequence is cut by the the restriction enzyme and the cut leaves two sticky ends (Figure 1, (c)). Fourth, we put special DNA strands, called hinge molecules, which have a hairpin structure and complementary sticky ends for the sticky ends produced
A DNA-based Computational Model
(a)
5
y
u rL
x
rR v
u rL #
x
rR v
317
z
3 a linear single-stranded DNA
a circular single-stranded DNA "
w
!
⇓ (b) 5
y
#
u rL u rL
x x
"
w
rR v rR v
z
3
!
⇓ (c)
5
y
z
#
"
w
3
Fig. 1. (a) Two initial DNA strands, (b) Annealing, (c) Cut by the restriction emzyme.
by the restriction emzyme cut (Figure 2, (d)). (Sometimes, this type of hairpin DNA molecules are called “cap” molecules.) Fifth, the special DNA strands are annealed to the sticky ends and ligated to the main DNA strand (Figure 2, (e)). Lastly, by the denaturation, the main DNA strand becomes a linear single DNA strand which constitutes of y, w and z. Thus, this operation has an effect of deleting the part rL xrR from the initial linear DNA strand and replacing it with w which is a subsequence contained in the initial circular DNA strand. We mathematically abstract the above operation and formally define the following computation (rewriting) rule, called replacement rule. Let V denote an alphabet (a set of abstract symbols) on which a DNA-based computational model is working. The rule is defined as follows: (y rL x rR z, rL x rR wˆ) → y w z
318
Yasubumi Sakakibara and Hiroshi Imai
(d)
hinge molecules ;
y
5
z
#
"
3
w
⇓ (e)
y
5
; #
"
w
z
3
⇓ (f)
5
y
w
z
3
Fig. 2. (d) Hinge molecules, (e) Annealing, (f) Final linear DNA strand. where rL , rR are two recognition sites for the restriction enzyme cutting on both sides, x, y, z, w ∈ V ∗ , and x is of a fixed length. In the following sections, we construct a new DNA-based computational model using this replacement rule and examine the computational ability of this operation. On the construction of the DNA-based computational model, we enjoy several advantages of this operation: 1. The subsequences x, y, z and w in the initial linear DNA sequence and circular DNA sequence are allowed to be any sequences on the alphabet V so that any intended encoding can be put into these parts. It turns out that the only one replacement rule is enough to construct a universal DNA-based computational model. 2. Both recognition sites rL and rR will be cut out after the application of the replacement rule.
3
A New DNA-based Computational Model and Computability
In this section, we define a formal computational model which generates a language (a set of strings) using the replacement rule and show that the computa-
A DNA-based Computational Model
319
tional model has universal computability in the sense that it generates the class of recursively enumerable languages. The computational model is a mathematical model which abstracts the biological operations discussed in the previous section. The computational model consists of a set of strings (called axiom) and the replacement rules, and iteratively applies the replacement rules to a set of strings to generate a language. The model could be viewed as a variant of the splicing system [3,5]. An alphabet is a finite nonempty set of abstract symbols. For an alphabet V we denote by V ∗ the set of all strings of symbols in V . The empty string is denoted by λ. Each subset of V ∗ is called a language over V . For x ∈ V ∗ , |x| denotes the length of the string x. For a symbol a in the alphabet V , we denote by a the complementary symbol of a associated with some predefined complementarity relation. Let V denote the set of complementary symbols for the alphabet V , that is, V = {a | a ∈ V }. Further, we consider two subsets of the alphabet V . T ⊂ V is the terminal alphabet and E ⊂ V is the recognition-site alphabet. We assume T ∩ E = ∅. A circular string over V is a sequence x = a1 a2 . . . an for ai ∈ V (1 ≤ i ≤ n), with the assumption that a1 follows an . We denote by xˆ the circular string associated to the linear string x ∈ V ∗ . A replacement rule r is a pair of strings over the recognition-site alphabet E and a positive integer which indicates the fixed length (the fixed number of bases) between the two recognition sites: r = (rL , rR , d) for some rL , rR ∈ E ∗ and a positive integer d. With respect to such a rule r and for α, β, γ ∈ V ∗ , we write (α, βˆ) ⇒r γ iff α = yrL xrR z, β = rL xrR w, γ = ywz, for some x, y, z, w ∈ V ∗ such that |x| = d. We consider two sets of axioms A and B, one consists of linear strings and the other consists of circular strings. A replacement computational system is a 6-tuple M = (V, T, E, A, B, R) where V is an alphabet, T ⊂ V , E ⊂ V , A is a finite set of linear strings over V , B is a finite set of circular strings over V , and R is a finite set of replacement rules. We define the language generated by a replacement computational system M = (V, T, E, A, B, R) as follows. For a triple σ = (V, B, R) of the alphabet V ,
320
Yasubumi Sakakibara and Hiroshi Imai
the set B of circular strings over V and the set R of replacement rules, and for a language L ⊆ V ∗ , we define σ(L) = {γ ∈ V ∗ | (α, βˆ) ⇒r γ, for some α ∈ L, βˆ ∈ B, r ∈ R}. Further, we consider the iterative application of the operation σ, that is, the iterative application of ⇒r (replacement rules). A successive application of ⇒r to the result of previous application ⇒r is defined as follows: (α1 , β1ˆ) ⇒r1 α2 , (α2 , β2ˆ) ⇒r2 α3 , for β1 , β2ˆ ∈ B and r1 , r2 ∈ R. This will be able to happen because a new linear string α2 will have a new recognition site which is originally contained in the circular string β1ˆ and inserted into α2 by the replacement operation ⇒r1 . We define the iterative application of the operation σ as follows: σ 0 (L) = L, σ i+1 (L) = σ i (L) ∪ σ(σ i (L)), i ≥ 0, σ ∗ (L) = σ i (L). i≥0
The language generated by M = (V, T, E, A, B, R) is defined by L(M ) = σ ∗ (A) ∩ T ∗ . Now, we show a main theorem that the replacement computational system has universal computability. The main idea is to construct a replacement computational system which simulates a grammar in the Geffert normal form for any recursively enumerable language. Theorem 1. For every recursively enumerable language L ⊆ T ∗ , there exists a replacement computational system M such that L(M ) = L. Further, the replacement computational system M has only one replacement rule. Proof. We give a brief sketch of the proof. First, we quote the Geffert normal form theorem. That is, every recursively enumerable language can be generated by a grammar G = (N, T, S, P ) with N = {S, A, B, C, D} and the rules in P of the forms S → uSv, S → x, with u, v, x ∈ (T ∪ {A, B, C, D})∗ , and only two non-context-free rules, AB → λ and CD → λ. For a grammar G = (N, T, S, P ) in the Geffert normal form, we construct the replacement computational system M = (V, T, E, A, B, R) as follows: V = N ∪ T ∪ E, A = {rL SSrR }, B = {rL SSrR u rL SSrR v ˆ | S → uSv ∈ P }, ∪ {rL SSrR xˆ | S → x ∈ P, x ∈ (T ∪ {A, B, C, D})∗ }, ∪ {rL ABrRˆ, rL CDrRˆ}, R = {(rL , rR , 2)},
A DNA-based Computational Model
321
where rL , rR ∈ E ∗ are the only pair of recognition sites, and for a string x in (T ∪ {A, B, C, D})∗ , x denote the string x except that every occurrence of A in x is replaced with rL A, B is replaced with BrR , C is replaced with rL C, and D is replaced with DrR . These modifications are necessary to simulate two non-context-free rules with the axioms {rL ABrRˆ, rL CDrRˆ}. The idea of this construction is as follows. The axioms rL SSrR u rL SSrR v ˆ and rL SSrR xˆ simulate the context-free rules S → uSv and S → x in P , respectively, together with the replacement rule (rL , rR , 2). The axioms rL ABrRˆ and rL CDrRˆ simulate two non-context-free rules AB → λ, CD → λ when A is adjacent to B or C is adjacent to D. The simulation starts with the only linear axiom rL SSrR in A, which corresponds to a derivation from the start symbol S in G. For example, for a simple grammar G = (N, T, S, P ) in the Geffert normal form: N = {S, A, B, C, D}, T = {a, b}, P = {S → aaSb, S → bSB, S → bbA, AB → λ, CD → λ}, we construct the replacement computational system M = (V, T, E, A, B, R): V = N ∪ T ∪ E, A = {rL SSrR }, B = {rL SSrR aarL SSrR bˆ, rL SSrR brL SSrR BrRˆ, rL SSrR bbrL Aˆ, rL ABrRˆ, rL CDrRˆ}, R = {(rL , rR , 2)}. Now, the following derivation of the grammar G from the start symbol S: S ⇒G aaSb ⇒G aabSBb ⇒G aabbbABb ⇒G aabbbb is simulated by the replacement computational system M as follows: (rL SSrR , rL SSrR aarL SSrR bˆ) ⇒r aarL SSrR b (aarL SSrR b, rL SSrR brL SSrR BrRˆ) ⇒r aabrL SSrR BrR b (aabrL SSrR BrR b, rL SSrR bbrL Aˆ) ⇒r aabbbrL ABrR b (aabbbrL ABrR b, rL ABrRˆ) ⇒r aabbbb. Since we have the only three patterns of complementary subsequences in the axioms in B, that is, rL SSrR , rL ABrR and rL CDrR , any other replacement for the form rL XY rR for X, Y ∈ V is impossible even with the replacement rule (rL , rR , 2). Consequently, L(G) = L(M ).
322
4
Yasubumi Sakakibara and Hiroshi Imai
Parsing Computation Using the New Model
While we have shown an universal computability of our model in the previous section, the most important feature of our model is that the model offers the parsing (accepting) process for a given sequence with formal grammars to solve the membership problem. That is, for a given grammar and for any sequence, the model computes whether the grammar accepts the sequence. Most conventional DNA-based model have not offer this accepting process. Combined with the massive parallelism of DNA-based computations, we show that the parsing processes for context-free grammars and finite sequential transducer are efficiently computed by the replacement computational systems. 4.1
Parsing for Context-Free Grammars
A context-free grammar (CFG) is defined by a quadruple G = (N, T, P, S), where N is a nonterminal alphabet, T is a terminal alphabet, P is a set of production rules, and S is the start symbol. Each production rule in P has the form X → α for X ∈ N and α ∈ (N ∪ Σ)∗ , indicating that the nonterminal X can be replaced by the sequence α. The language generated by a CFG G is denoted L(G). The generative power of CFGs is strictly greater than the finite automata. The CFGs are often used for designing programming languages and there exist several parsing algorithms such as Cocke-Younger-Kasami algorithm of time complexity O(m3 ) for a CFG G = (N, T, P, S) and a string w to decide whether x ∈ L(G), where m is the length of the input string w. A simple example of CFG and its derivation process to generate a string “abbc” are shown below: G = (N, T, P, S) N = {S, A, B}, T = {a, b, c}, P = {S → aAB, A → bA, A → b,
B → c}.
The derivation for “abbc”: S ⇒G aAB ⇒G abAB ⇒G abbB ⇒G abbc. For this CFG G, we construct the following replacement computational system M = (V, T, E, A, B, R) which accepts a string in L(G): V = N ∪ T ∪ {%, #} ∪ E, B = {rL SarR rL BrL Aˆ, R = {(rL , rR , 2)},
rL AbrR rL Aˆ,
rL AbrRˆ,
rL BcrRˆ},
where rL , rR ∈ E ∗ are the recognition sites, and %, # are special symbols used for the parsing process. The accepting process for the input string “abbc” with this replacement computational system M goes as follows:
A DNA-based Computational Model
323
1. The input string is encoded to the sequence “%rL S a rR b rR b rR c rR #”. 2. The accepting process begins with the sequence and goes as follows: (%rL SarR brR brR crR #, rL SarR rL BrL Aˆ) ⇒r %rL BrL AbrR brR crR # (%rL B rL AbrR brR crR #, rL AbrR rL Aˆ) ⇒r %rL BrL AbrR crR # (%rL B rL AbrR crR #, rL AbrRˆ) ⇒r %rL BcrR # (%rL BcrR #, rL BcrRˆ) ⇒r %# 3. If the accepting process finally produces the sequence %#, the system accepts the input sequence. Otherwise, it does not accept. As in the above example, we need to slightly modify the definition of the replacement computational system to execute the accepting process: 1. We construct a replacement computational system M = (V, T, E, A, B, R) without the definition for the set of axioms A (or equivalently, assuming A = ∅). 2. For a string w over the terminal alphabet T , we encode w to the sequence “%rL Sφ(w)#”, where φ is a substitution mapping such that φ(a) = arR for a ∈ T , and S is the start symbol in the original CFG G = (N, T, P, S). 3. For a successive computation of M = (V, T, E, A, B, R): (x1 , y1ˆ) ⇒r1 x2 , (x2 , y2ˆ) ⇒r2 x3 , (x3 , y3ˆ) ⇒r3 x4 , · · · , (xn , ynˆ) ⇒rn xn+1 , for yiˆ ∈ B and ri ∈ R (1 ≤ i ≤ n), we write (x1 , y1ˆ) =⇒∗ xn+1 . For a simple CFG G and the corresponding replacement computational system M in the above example, it is shown that S ⇒∗G w for w ∈ T ∗ if and only if (%rL Sφ(w)#, yˆ) =⇒∗ %# for some y ∈ B. Now, we assume a λ-free CFG G = (N, T, S, P ) in the Greibach normal form and we construct the corresponding replacement computational system M (G) = (V, T, E, A, B, R) for G as follows: V = N ∪ T ∪ {%, #} ∪ E, B = {rL AarR rL Bn rL Bn−1 · · · rL B2 rL B1ˆ | A → aB1 B2 · · · Bn−1 Bn ∈ P, a ∈ T, B1 , . . . , Bn ∈ N }, ∪ {rL AarRˆ | A → a ∈ P, a ∈ T }, R = {(rL , rR , 2)}, where %, # are special symbols not in N , T and E. Since the massive parallelism of DNA-based computations can efficiently compute the non-determinism (non-deterministic derivations) of CFGs, we have the following: Theorem 2. For every CFG G in the Greibach normal form, there exists a replacement computational system M (G) such that w ∈ L(G) for w ∈ T ∗ if and only if (%rL Sφ(w)#, yˆ) =⇒∗ %# for some y ∈ B. Further, the accepting process (%rL Sφ(w)#, yˆ) =⇒∗ %# is executed in time linear, that is O(|w|), of the length of the input string w on DNA-based computation.
324
4.2
Yasubumi Sakakibara and Hiroshi Imai
Finite Sequential Transducer
A finite sequential transducer is a finite automaton with outputs, and is defined by a 6-tuple D = (Q, T1 , T2 , δ, q0 , F ), where Q is a finite set of states, T1 is an input alphabet, T2 is an output alphabet, δ is a state-transition function such that δ : Q × T1 −→ Q × T2∗ (in the deterministic case), q0 is the initial state and F is a set of final states. In the same manner as for CFGs, we construct a replacement computational system which simulates a given finite sequential transducer D. Here, we give a simple example: A finite sequential transducer D = (Q, T1 , T2 , δ, q0 , F ) is defined by: Q = {q0 , q1 , q2 , q3 }, T1 = {a, b, c}, T2 = {x, y, z}, δ : {δ(q0 , a) = (q1 , x), F = {q3 },
δ(q1 , b) = (q2 , xy),
δ(q2 , c) = (q3 , xyz), },
and the derivation process for the input string “abc” in the form of instantaneous description is: q0 abc A xq1 bc A xxyq2 c A xxyxyzq3 . We construct the replacement computational system M = (V, T, E, A, B, R): V = Q ∪ T1 ∪ T2 ∪ E, B = {rL q0 arR xrL q1ˆ,
rL q1 brR xyrL q2ˆ,
rL q2 crR xyzˆ},
R = {(rL , rR , 2)}, and the derivation process of the finite sequential transducer D for the input string “abc” is simulated by M as follows: (%rL q0 arR brR crR #, rL q0 arR xrL q1ˆ) ⇒r %xrL q1 brR crR # (%x rL q1 brR crR #, rL q1 brR xyrL q2ˆ) ⇒r %xxyrL q2 crR # (%xxy rL q2 crR #, rL q2 crR xyzˆ) ⇒r %xxyxyz# Thus, the output of M becomes a sequence “%xxyxyz#”.
5
Related Works and Discussion
Most related works are the insertion-deletion system [4] and the circular splicing system [5]. Compared with those works, fundamental differences of our model are that the recognition sites are cut out after the application of the rule, and our model offers not only the generative power of languages but also the parsing abilities for CFGs and finite sequential transducers. It is far less practical to implement the replacement rule in biological laboratory. Hence, we will need to discuss about the implemental issue of our replacement operation.
A DNA-based Computational Model
325
Acknowledgments This work is supported in part by “Research for the Future” Program No. JSPSRFTF 96I00101 from the Japan Society for the Promotion of Science. This work was also performed in part through Special Coordination Funds for Promoting Science and Technology from the Ministry of Education, Culture, Sports, Science and Technology, the Japanese Government.
References 1. Y. Benenson, T. Paz-Ellzur, R. Adar, E. Keinan, Z. Livneh, and E. Shapiro. Programmable and autonomous computing machine made of biomolecules. Nature, 414, 430–434, 2001. 2. S.K. Degtyarev, N.I. Rechkunova, Y.P. Zernov, V.S. Dedkov, V.E. Chizikov, M. Van Calligan, R. Williams, and E. Murray. Bsp24I, a new unusual restriction endonuclease. Gene, 131, 93–95, 1993. 3. T. Head. Formal language theory and DNA : An analysis of the generative capacity of specific recombinant behaviors. Bulletin of Mathematical Biology, 49, 737–759, 1987. 4. L. Kari and G. Thierrin. Contextual insertion/deletion and computability. Information and Computation, 131, 47–61, 1996. 5. Gh. P˘ aun, G. Rozenberg, and A. Salomaa. DNA Computing. Springer-Verlag, Heidelberg, 1998. 6. Official REBASE Homepage: http://rebase.neb.com/
Time-Varying Distributed H Systems of Degree 2 Can Carry Out Parallel Computations Maurice Margenstern1 , Yurii Rogozhin2 , and Sergey Verlan3 1
Laboratoire d’Informatique Th´eorique et Appliqu´ee Universit´e de Metz, France [email protected] 2 Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova [email protected] 3 Laboratoire d’Informatique Th´eorique et Appliqu´ee Universit´e de Metz, France [email protected]
Abstract. A time-varying distributed H system (TVDH system) is a splicing system which has the following feature: at different moments one uses different sets of splicing rules (these sets are called components of TVDH system). The number of components is called the degree of the TVDH system. The passage from a component to another one is specified in a cycle. It was proved by the first two authors (2001) that TVDH systems of degree one generate all recursively enumerable languages. The proof was made by a sequential modelling of Turing machines. This solution is not a fully parallel one. A. P˘ aun (1999) presented a complete parallel solution for TVDH systems of degree four by modelling type-0 formal grammars. His result was improved by the first two authors by reducing the number of components of such TVDH systems down to three (2000). In this paper we improve the last result by reducing the number of components of such TVDH systems down to two. This question remains open for one component, i.e. is it possible to construct TVDH systems of degree one which completely uses the parallel nature of molecular computations based on splicing operations (say model type-0 formal grammars)?
1
Introduction
Starting from [10], a grounding paper on splicing computations, a lot of studies were devoted to various extensions of H systems originating from [3], in particular, pointing at their possible universality power. Using the extention given in [10] of the original definition of [3], paper [1] defined the notion of test tube and proved the universality of an extended H system with a finite number of test tubes. But no indication on the number of test tubes which are needed in order to obtain the universality of the computation was given in [1]. That number was first set to 9 [2], then to 6 [11], and finally established to 3 in [15]. The latter result is very near to the real frontier between decidability and undecidability M. Hagiya and A. Ohuchi (Eds.): DNA8, LNCS 2568, pp. 326–336, 2003. c Springer-Verlag Berlin Heidelberg 2003
Time-Varying Distributed H Systems of Degree 2
327
or, in other terms, between the possibility and the impossibility of predicting the eventual behaviour of such systems, depending on the number of test tubes. Indeed, as it is known that for a single test tube, generated languages are regular [14], it remains to examine the case of two test tubes, which is still open up to now. Time-varying distributed H systems were recently introduced in [11], [12,13] as another theoretical model of biomolecular computing (DNA-computing), based on splicing operations. Instead of considering test tubes, these models introduce components, later see the formal definition, which cannot all be used at the same time but one after another, periodically. This aims at giving an account of real biochemical reactions where the work of enzymes essentially depends on the environment conditions. In particular, at any moment, only a subset of all available rules are in action. If the environment is periodically changed, then the active enzymes change also periodically. In [12], it is proved that 7 different components are enough in order to generate any recursively enumerable language. In [5], the first two present authors proved that two components are enough to construct a universal time-varying distributed H system, i.e. a time-varying distributed H system, which is able to simulate the computation of any Turing machine. Recently [7] the same authors proved that one component is enough in order to obtain the universality. Universality of computation and generating any recursively enumerable language are equivalent properties, but it is a priori not necessarily true, that the universality of some time-varying distributed H system with one component entails that time-varying distributed H systems generate all recursively enumerable languages, with only one component. So the same authors proved in [8] that this is the case: one component is enough in order to generate any recursively enumerable language. The proof was made by a sequential modelling of Turing machines. Let us clarify this point. Recall that a sequential process is defined by the presence of a clock and by actions which are performed in such a way that at most one action is performed at each top of the clock. It is plain that a deterministic Turing machine performs a sequential computation. In [8], the proof consists namely in simulating such a deterministic machine and so, it is a sequential computation, not a parallel one. By contrast, a parallel computation is performed by several processes which either produce actions independently of any clock or, produce them at each top of a clock: in this case, it is possible that several processes perform an action at the same top of the clock and that the performed actions are different from one another. When the processes perform their actions at the tops of the same clock, we speak of a synchronised parallel computation. According to these definitions, a sequential computation is a very particular case of a synchronised parallel computation. The rules of a type-0 grammar may contain several times the same nonterminal symbol in their right part. Accordingly, the description of all possible computations starting from the axiom leads to a tree, whose branches are the
328
Maurice Margenstern, Yurii Rogozhin, and Sergey Verlan
different possible computations. Now, it is not needed that, when a branch is followed by some process, the clock of this process be the same as the clock of another one which is defined by another process. And so, we can see the simulation of any type-0 formal grammar as a good criterion for a parallel computation. This is why, in the litterature on DNA simulating systems, it is considered that the simulation of any type-0 formal grammar is a criterion for a parallel computation. By the time when we published [5], A. P˘ aun published [9], where he proved that time-varying distributed H systems with 4 components generate all recursively enumerable languages. But his solution is a parallel one according to what we said: its proof consists in modelling any type-0 grammar. A. P˘ aun’s result was improved by the first two authors of the present paper by reducing the number of components of TVDH systems which model type-0 formal grammars down to 3 [6]. Now we improve the last result by reducing the number of components of TVDH systems which model type-0 formal grammars down to two. The corresponding question for one component is still open.
2
Basic Definitions
We recall some notions. An alphabet V is a finite, non-empty set whose elements are called letters. A word (over some alphabet V ) is a finite (possibly empty) concatenation of letters (from V ). The empty concatenations of letters is also called the empty word and is denoted by ε. The set of all words over V is denoted by V ∗ . A language (over V ) is a set of words (over V ). A formal grammar G is a tuple G = (N, T, P, S) of an alphabet N of so-called non-terminal letters, an alphabet T of so-called terminal letters, with N ∩T = ∅, an initial letter S from N , and a finite set P of rules of the form u → v with u, v ∈ (N ∪ T )∗ and u contains at least one letter from N . Any rule u → v ∈ P is a substitution rule allowing to substitute any occurrence of u in some word w by v. Formally, we write w ⇒G w if there is a rule u → v in P and words w1 , w2 ∈ (N ∪ T )∗ with w = w1 uw2 and w = w1 vw2 . We denote by ⇒∗G the reflexive and transitive closure of ⇒. I.e., w ⇒∗G w means that there is an integer n and words w1 , . . . , wn with w = w1 , w = wn and wi ⇒G wi+1 for all i, 1 ≤ i < n. The sequence w1 ⇒ w2 ⇒ · · · ⇒ wn is also called a computation (from w1 to wn of length n − 1). A terminal word is a word in T ∗ ; all terminal words computable from the initial letter S form the language L(G) generated by G. def More formally, L(G) = {w ∈ T ∗ ; S ⇒∗ w}. An (abstract) molecule is simply a word over some alphabet. A splicing rule (over alphabet V ), is a quadruple (u1 , u2 , u1 , u2 ) of words u1 , u2 , u1 , u2 ∈ V ∗ , u u which is often written in a two dimensional way as follows: 1 2 . u1 u2 A splicing rule r = (u1 , u2 , u1 , u2 ) is applicable to two molecules m1 , m2 if there are words w1 , w2 , w1 , w2 ∈ V ∗ with m1 = w1 u1 u2 w2 and m2 = w1 u1 u2 w2 ,
Time-Varying Distributed H Systems of Degree 2
329
and produces two new molecules m1 = w1 u1 u2 w2 and m2 = w1 u1 u2 w2 . In this case, we also write (m1 , m2 ) r (m1 , m2 ). A pair h = (V, R), where V is an alphabet and R is a finite set of splicing rules, is called an splicing scheme or an H scheme. For an H scheme h = (V, R) and a language L ⊆ V ∗ we define: def
σh (L) = σ(V,R) (L) = {w, w ∈ V ∗ |∃w1 , w2 ∈ L : ∃r ∈ R : (w1 , w2 ) r (w, w )}. A Head-splicing-system [3,4], or H system, is a construct: H = (h, A) = ((V, R), A) of an alphabet V , a set A ⊆ V ∗ of initial molecules over V , the axioms, and a set R ⊆ V ∗ × V ∗ × V ∗ × V ∗ of splicing rules. H is called finite if A and R are finite sets. For any H scheme h and language L ∈ V ∗ we define: σh0 (L) = L, σhi+1 (L) = σhi (L) ∪ σh (σhi (L)), σh∗ (L) = ∪i≥0 σhi (L). The language generated by H system H is: def L(H) = σh∗ (A). Thus, the language generated by H system H is the set of all molecules that can be generated in H starting with A as initial molecules by iteratively applying splicing rules to copies of the molecules already generated. A time-varying distributed H system [12] (of degree n, n ≥ 1), (TVDH system) is a construct: D = (V, T, A, R1 , R2 , . . . , Rn ), where V is an alphabet, T ⊆ V , the terminal alphabet, A ⊆ V ∗ , a finite set of axioms, and components Ri the finite sets of splicing rules over V, 1 ≤ i ≤ n. At each moment k = n · j + i, for j ≥ 0, 1 ≤ i ≤ n, only component Ri is used for splicing the currently available strings. Specifically, we define L1 = A, Lk+1 = σhi (Lk ), for i ≡ k(mod n), k ≥ 1, 1 ≤ i ≤ n, hi = (V, Ri ). Therefore, from a step k to the next step, k + 1, one passes only the result of splicing the strings in Lk according to the rules in Ri for i ≡ k(mod n); the strings in Lk which cannot enter a splicing are removed. We say that a component Ri of a TVDH system rejects the word w if w cannot enter any splicing rule from Ri . In this case we write w ↑Ri . We may omit Ri if the context allows us to do so. The language generated by D is: L(D) = (∪k≥1 Lk ) ∩ T ∗ . def
We denote by REG the set of all regular languages, by RE the set of all recursively enumerable languages, by V DHn , n ≥ 1, the family of languages generated by time-varying distributed H systems of degree at most n, and by V DH∗ the family of all languages of this type.
330
Maurice Margenstern, Yurii Rogozhin, and Sergey Verlan
3
TVDH Systems of Degree 2
Theorem 1. For any type 0 formal grammar G = (N, T, P, S) there is a TVDH system DG = (V, T, A, R1 , R2 ) of degree 2 which simulates G and L(G) = L(DG ). In order to prove the theorem we shall prove the following inclusions: (i) L(G) ⊆ L(DG ) and (ii) L(G) ⊇ L(DG ). It seems that (i) is the most difficult part of the demonstration and we will focus our attention on it. We construct DG = (V, T, A, R1 , R2 ) as follows. Let N ∪ T ∪ {B} = {a1 , a2 , . . . , an } (an = B) and B ∈ / {N ∪ T }. In what follows we will assume that 1 ≤ i ≤ n, 1 ≤ j ≤ n − 1, 2 ≤ k ≤ n, a ∈ N ∪ T ∪ {B}. Alphabet: V = N ∪T ∪{B}∪{X, Y, Z, Z , Z , Xi , Yi , Xj , Yj , Xj , Yj , X , Y , X , Y , C1 , C2 , D1 , D2 } The terminal alphabet is T , the same as for the formal grammar G. Axioms A = {XSBY }∪ {ZvYj , ∃u → vaj ∈ P } ∪ {ZYi , ZYj , ZYj , X Z, X Z, ZY, Xi ai Z, Xj Z, Xj Z, ZY , ZY , XZ, Xj Z, C1 Z , D1 , Z C2 , D2 }. Component R1 : 1.1 :
ε uY , Z vYj
∃u → vaj ∈ P ; 1.1 :
ε ai uY , Z Yi
1.2 :
ε ai Y ; Z Yi
1.3 :
a Yk ; Z Yk−1
1.4 :
a Yj ; Z Yj
1.5 :
X1 a ; X Z
1.6 :
X a ; X Z
1.7 :
a Y ; Z Y
1.8 :
a BY ; Z ε
1.9 :
a Yj ; Z Yj
1.10 :
C1 Z ; ε D1
∃u → ε ∈ P ;
Component R2 : 2.1 :
X a X a X a ; 2.2 : k ; 2.3 : j ; X i ai Z Xk−1 Z Xj Z
2.4 :
a Y1 a Y ; ; 2.5 : ZY Z Y
2.6 :
X a ; X Z
Xj a ; Xj Z
2.9 :
Z C2 ; D2 ε
2.7 :
X a ; ε Z
2.8 :
Note. Components R1 and R2 also contain the following rules: αε for each axiom α ∈ A, except XSBY . αε
Time-Varying Distributed H Systems of Degree 2
331
Proof of (i) Our simulation is based on the method of rotating words [1,2] and the corresponding technique which are proposed by G.Paun [12]. Let us recall them briefly. For any word w = w w ∈ (N ∪ T )∗ of formal grammar G say that word Xw Bw Y (X, Y, B ∈ / N ∪ T ) of TVDH system DG is a ”rotation version” of word w. Say also that going from a rotation version of the word to another one is rotating the word. TVDH system DG models the formal grammar G as follows. System DG rotates the word Xw1 uw2 BY into Xw2 Bw1 uY and applies a splicε uY . And so the rotation is used to put the occurrence of u to the end ing rule Z vY of the word. System DG rotates the word Xwai Y (ai ∈ (N ∪ T ∪ {B}) ”symbol by symbol”, i.e. the word Xai wY is obtained after performing a few steps of system DG . Rotating a word can be done as follows. We start with word Xwai Y in component R1 . Component R2 receives the word XwYi from component R1 . Component R1 receives words Xj aj wYi (1 ≤ j ≤ n) from component R2 . After that point system DG works in a cycle in which indices i and j decrease. If j = i, then derived words with these indices will be ruled out. When i = j we obtain the word X1 ai wY1 . Then, from that word we obtain the word Xai wY (so we rotated the word Xwai Y ) and the word ai w (if ai w = ai w B). So, if ai w ∈ T ∗ then ai w ∈ L(G) and ai w ∈ L(DG ). TVDH system DG works as follows. The computation follows the flow-chart which is shown in the figure 1. The vertices of the flow-chart show a configuration of molecules during the computation. We enumerate all configurations and their numbers are in the upper right corner. In configurations w is treated as a variable, it has different values in different configurations. For example, if in configuration 1 w is equal to w ai then in configuration 2 w may be w . We will show that the computation follows the flow-chart i.e. all molecules produced from one configuration will be eliminated except molecules from the next configuration. Description of the Rotation We have the word Xwai Y (w ∈ (N ∪ T ∪ {B})∗ ), 1 ≤ i ≤ n. We are in configuration 1. (Xw|ai Y, Z|Yi ) 1.2 (XwYi , Zai Y ↑). String Zai Y cannot enter a splicing in R2 and therefore is removed. So we start to rotate the word wai . Now we are in configuration 2. (X|wYi , Xk ak |Z) 2.1 (Xk ak wYi , XZ), 1 ≤ i, k ≤ n. String XZ is an axiom. If i = 1 we may have the following computation: (Xw|Y1 , Z|Y ) 2.4 (XwY ↑, ZY1 ). Strings ZY1 is an axiom and string XwY is ruled out. We arrived in configuration 3. Now we have 4 possible cases: a) Xj as wY1 ; b) X1 as wYi ; c) Xj as wYi ; d) X1 as wY1 ; i, j > 1, 1 ≤ s ≤ n a) Xj as wY1 , j > 1.
332
Maurice Margenstern, Yurii Rogozhin, and Sergey Verlan 7
w
X w o
X wY P
nn nnn n n n 8 nn n o
PPP PPP PPP P
g
w
XwY
6
X wY O
1
5
start +3
X wY
XwY
O
2
4
XwYi O
X wYi
OOO OOO OOO OO
oo ooo o o o 3 oo o
oo ooo o o oo 13oo
OOO OOO OOO OO
'
7
Xj wYi
7
Xj−1 wYi−1
9
'
Xj wYi−1
O
12
Xj−1 wYi−1
NNN NNN NNN NN g
10
Xj−1 wYi−1
ppp ppp p p p 11pp w
Xj−1 wYi−1
Fig. 1. The flow-chart of the computation Xj as wY1 ↑. String Xj as wY1 cannot enter a splicing in R1 and therefore is removed. b) X1 as wYi , i > 1. There are two possibilities: (i) (X1 as w|Yi , Z|Yi−1 ) 1.3 (X1 as wYi−1 ↑, ZYi ). String ZYi is an axiom. String X1 as wYi−1 cannot enter a splicing in R2 and therefore is removed. (ii) (X1 |as wYi , X |Z) 1.5 (X as wYi ↑, X1 Z). String X1 Z is an axiom. String X as wYi cannot enter a splicing in R2 and therefore is removed. So, the computation for this case is ”mortal”, i.e. it completes without results. c) Xj as wYi , i, j > 1. ) 1.3 (Xj as wYi−1 , ZYi ). (Xj as w|Yi , Z|Yi−1 String ZYi is an axiom. We are in configuration 9. , Xj−1 |Z) 2.2 (Xj−1 as wYi−1 , Xj Z). (Xj |as wYi−1 String Xj Z is an axiom.
Time-Varying Distributed H Systems of Degree 2
333
We are in configuration 10. (Xj−1 as w|Yi−1 , Z|Yi−1 ) 1.4 (Xj−1 as wYi−1 , ZYi−1 ). String ZYi−1 is an axiom. We are in configuration 11. (Xj−1 |as wYi−1 , Xj−1 |Z) 2.3 (Xj−1 as wYi−1 , Xj−1 Z). String Xj−1 Z is an axiom. We are in configuration 12. (Xj−1 as w|Yi−1 , Z|Yi−1 ) 1.9 (Xj−1 as wYi−1 , ZYi−1 ). is an axiom. String ZYi−1 We are in configuration 13. Now we have 2 possible continuations: (i) (Xj−1 as w|Y1 , Z|Y ) 2.4 (Xj−1 as wY ↑, ZY1 ). String ZY1 is an axiom and string Xj−1 as wY is ruled out. |as wYi−1 , Xj−1 |Z) 2.8 (Xj−1 as wYi−1 , Xj−1 Z). (ii) (Xj−1 String Xj−1 Z is an axiom. And we arrived again in the configuration 1. It is easy to observe that the indices of X and Y were decreased simultaneously. d) X1 as wY1 . (X1 |as wY1 , X |Z) 1.5 (X as wY1 , X1 Z). String X1 Z is an axiom.
We are in configuration 4. (X as w|Y1 , Z|Y ) 2.4 (X as wY , ZY1 ). String ZY1 is an axiom. We are in configuration 5. (X |as wY , X |Z) 1.6 (X as wY , X Z). String X Z is an axiom. We are in configuration 6. We have now 3 possibilities: (i) (X |as wY , X|Z) 2.6 (Xas wY ↑, X Z). String X Z is an axiom. String Xas wY cannot enter a splicing in R1 and therefore is removed. (ii) (X |as wY , |Z ) 2.7 (as wY ↑, X Z ↑). Strings as wY and X Z cannot enter a splicing in R1 and therefore are removed. (iii) (X as w|Y , Z|Y ) 2.5 (X as wY , ZY ). String ZY is an axiom. And we arrive in configuration 7. Now there are 2 possible continuations: 1) (X as w|Y , Z|Y ) 1.7 (X as wY, ZY ). String ZY is an axiom. We are in configuration 8. (X |as wY, X|Z) 2.6 (Xas wY, X Z). String X Z is an axiom. We arrive in configuration 1 and so we rotated the symbol as .
334
Maurice Margenstern, Yurii Rogozhin, and Sergey Verlan
We could apply the rule 2.7: (X |as wY, |Z ) 2.7 (as wY, X Z ↑). String X Z is ruled out. (as w |at Y, Z|Yt ) 1.1, 1.1 String Zat Y is ruled out.
or 1.2
(as w Yt , Zat Y ↑), 1 ≤ t ≤ n.
So, if t = 1 then string as w Yt is ruled out. If t = 1 then (as w |Y1 , Z|Y ) 2.4 (as w Y ↑, ZY1 ). String ZY1 is an axiom and string as w Y is ruled out. So, this computation is ”mortal”. 2) If we have B at the end of the word we may apply the rule 1.8: (X as w |BY , Z |) 1.8 (X as w , Z BY ↑). String Z BY is ruled out. (X |as w , |Z ) 2.7 (as w ↑, X Z ↑). Strings X Z and as w are ruled out. If string as w ∈ T ∗ then as w is a result. We could apply the rule 2.6: (X |as w , X|Z) 2.6 (Xas w ↑, X Z). String X Z is an axiom and Xas w is ruled out. Simulating the Productions of the Grammar If we have a production u → vai (u → ε) and a word Xw uY (Xw ai uY ) we may apply the rule 1.1 (1.1 ). (Xw |uY, Z|vYi ) 1.1 (Xw vYi , ZuY ↑). String ZuY cannot enter a splicing in R2 and therefore is removed. (Xw |ai uY, Z|Yi ) 1.1 (Xw Yi , Zai uY ↑). String Zai uY cannot enter a splicing in R2 and therefore is removed. After that point the system continues like in the rotation case. Thus we model the application of rule u → vai (u → ε). Other Computations with Axioms We may have the following computations between axioms which do not lead to new results. Thus we exhaust all other possibilities. ) 1.3 (ZvYi−1 ↑, ZYi ), 2 ≤ i ≤ n. (Zv|Yi , Z|Yi−1
(X1 |a1 Z, X |Z) 1.5 (X a1 Z ↑, X1 Z). |Z) 2.2 (Xk−1 ak Z ↑, Xk Z), 2 ≤ k ≤ n. (Xk |ak Z, Xk−1
(Zv|Y1 , Z|Y ) 2.4 (ZvY ↑, ZY1 ).
Time-Varying Distributed H Systems of Degree 2
335
Proof of (ii) Let us see which words are produced by DG . First of all, we can see that DG correctly simulates the rule u → v ∈ P because all additional molecules are eliminated. The letters X, Y and B are removed only when we have B at the end of the word. This means that we use the right ”rotational variant” of the word in order to obtain the corresponding terminal string. So, if w ∈ L(DG ) then w ∈ L(G). Acknowledgements The authors acknowledge the very helpful contribution of PST.CLG.976912 NATO project and IST-2001-32008 MolCoNet project for enhancing their cooperation, giving the best conditions for producing the present result. For the same reasons they also acknowledge the help of the French Ministry of Education.
References 1. Csuhaj-Varj` u, E., Kari, L., P˘ aun, G.: Test Tube distributed system based on splicing. Computer and AI. 2–3 (1996) 211–232 2. Ferretti, C., Mauri, G., Zandron, C.: Nine test tubes generate any RE language. Theoretical Computer Science. 231, no.2 (2000) 171–180. 3. Head, T.: Formal language theory and DNA: an analysis of the generative capacity of recombinant behaviors. Bulletin of Mathematical Biology 49 (1987) 737–759 4. Head, T., P˘ aun, Gh., Pixton, D.: Language theory and molecular genetics. Generative mechanisms suggested by DNA recombination. Chapter 7 in vol.2 of G.Rozenberg, A.Salomaa, eds., Handbook of Formal Languages, 3 volumes, Springer-Verlag, Heidelberg (1997) 5. Margenstern, M., Rogozhin, Yu.: A universal time-varying distributed H-system of degree 2. Biosystems 52 (1999) 73–80. 6. Margenstern M., Rogozhin Yu., About Time-Varying Distributed H Systems. Lecture Notes in Computer Science, Springer, vol. 2054, 2001, p.53-62. (Proceedings of DNA6, Leiden, The Netherlands, June 13-17, 2000). 7. Margenstern M., Rogozhin Yu.: A universal time-varying distributed H system of degree 1. Proceedings of the DNA7, Seventh International Meeting on DNA Based Computers, University of South Florida, U.S.A., June 10-13, 2001, (2002) - LNCS to appear. 8. Margenstern, M., Rogozhin, Yu.: Time-varying distributed H systems of degree 1 generate all recursively enumerable languages, in vol. Words, Semigroups and Transductions (M. Ito, Gh. Paun, S. Yu, eds), World Scientific, Singapore, 2001, p. 329-340. 9. P˘ aun, A.: On Time-Varying H Systems. Bulletin of EATCS. 67 (1999) 157–164 10. P˘ aun, G., Rozenberg, G., Salomaa, A.: Computing by splicing. Theoretical Computer Science. 168, no.2 (1996) 321–336 11. P˘ aun, G.: DNA computing: distributed splicing systems. In Structures in Logic and Computer Science. A Selection of Essays in honor of A. Ehrenfeucht, Lecture Notes in Computer Science 1261 (1997) 353–370 12. P˘ aun, G.: DNA Computing Based on Splicing: Universality Results. Theoretical Computer Science. 231, no.2 (2000) 275–296.
336
Maurice Margenstern, Yurii Rogozhin, and Sergey Verlan
13. P˘ aun, G., Rozenberg, G., Salomaa, A.: DNA Computing: New Computing Paradigms. Springer, Heidelberg (1998) 14. Pixton, D.: Regularity of splicing languages. Discrete Applied Mathematics 69 (1996) 101–124 15. Priese, L., Rogozhin, Yu., Margenstern, M.: Finite H-Systems with 3 Test Tubes are not Predictable. In Proceedings of Pacific Symposium on Biocomputing, Kapalua, Maui, January 1998 (R.B.Altman, A.K.Dunker, L.Hunter, T.E.Klein, eds), World Sci. Publ., Singapore (1998) 545–556
Author Index
Andronescu, M., 182 Arita, M., 205 Augh, S.J., 73
Lee, J.Y., 73 Lim, H.-W., 143 Liu, D., 10
Bancroft, C., 168 Barua, R., 124 Basu, S., 61 Besozzi, D., 302 Bi, H., 196, 252
Margenstern, M., 326 Mart´ın-Vide, C., 281 Matsuda, D., 38 Mauri, G., 302 Misra, J., 124
Chai, Y.-G., 143, 156 Chen, J., 196, 252 Cohen, B., 182 Condon, A.E., 182, 215, 229
Ohuchi, A., 112
Deaton, R., 196, 252 Dees, D., 182 Ferretti, C., 302 Frisco, P., 291 Garzon, M., 196 Goode, E., 262 Hashimoto, A., 85 Head, T., 262 Heitsch, C.E., 215 Hoos, H.H., 215, 229 Hug, H., 133 Imai, H., 315 Ionescu M., 281 Jang, H.-M., 143, 156 Ji, S., 291 Jonoska, N., 1 Kameda, A., 112 Karig, D., 61 Kashiwamura, S., 112 Kim, D., 242 Kobayashi, S., 205 Kondo, T., 205 LaBean, T.H., 10 Lee, I.-H., 156, 242
Park, J.-Y., 156 Park, T.H., 73 P˘ aun, A., 281 P˘ aun, G., 281 Pixton, D., 262 Reif, J.H., 10, 22 Rogozhin, Y., 326 Rose, J.A., 47, 252 Rubin, H., 196 Sa-Ardyen, P., 1 Sakakibara, Y., 315 Schuler, R., 133 Seeman, N.C., 1 Shiba, T., 112 Shin, S.-Y., 73, 242 Skiena, S., 182 Slaybaugh, L., 182 Suyama, A., 47 Takahara, A., 269 Takano, M., 47 Takenaka, Y., 85 Taylor Clelland, C.L., 168 Torre, P. de la, 95 Tulpan, D.C., 229 Verlan, S., 326 Weiss, R., 61 Wood, D.H., 168, 196 Yamamoto, M., 112 Yamamura, M., 38
338
Author Index
Yokomori, T., 269 Yoo, S.-I., 143 Yun, J.-E., 143
Zandron, C., 302 Zhang, B.-T., 73, 143, 156, 242 Zhao, Y., 182