DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6
Scientific Computing and Automation (Europe) 1990
DATA HANDLING I...
250 downloads
814 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6
Scientific Computing and Automation (Europe) 1990
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheirn
Other volumes in this series:
Volume 1 Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Volume 2 Chemornetrics: A textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valk6 and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the scientific Computing and Automation (Europe) Conference, 12- 15 June, 7990,Maastricht, The Netherlands. Edited by E.J. Karjalainen
DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6 Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim
Scientific Computing and Automation (Europe) 1990 Proceedings of the Scientific Computing and Automation (Europe) Conference, 72- 15 June, 1990, Maastricht, The Netherlands
edited by
E.J. KARJALAINEN Department of Clinical Chemistry, University of Helsinki, SF-00290 Helsinki, Finland
ELSEVIER Amsterdam - Oxford - New York - Tokyo
1990
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 21 1, 1000 AE Amsterdam, The Netherlands Distributors for the United States and Canada: ELSEVIER SCIENCE PUBLISHING COMPANY INC 655, Avenue of the Americas New York, NY 10010, U.S.A.
L i b r a r y o f Congress C a t a l o g i n g - i n - P u b l i c a t i o n
Data
S c i e n t i f i c Computing and A u t o m a t i o n ( E u r o p e ) C o n f e r e n c e ( 1 9 9 0 : Maastricht, Netherlands) S c i e n t i f i c computing and a u t o m a t i o n ( E u r o p e ) 7990 : p r o c e e d l n g s o f t h e S c i e n t i f i c Computing and A u t o m a t i o n ( E u r o p e ) C o n f e r e n c e . 12-15 June 1990. M a a s t r i c h t , t h e N e t h e r l a n d s / e d i t e d by E . K a r j a l a i n e n . p. cm. ( D a t a h a n d l l n g :n s c i e n c e and t e c h n o l o g y ; v . 6 ) I n c l u d e s b l b l i o g r a p h i c a l r e f e r e n c e s and i n d e x . ISBN 0-444-88949-3 1 . Sclence--Data processing--Congresses. 2. Technology--Data 3. E l e c t r o n i c d i g i t a l c o m p u t e r s - - S c i e n t i f i c processing--Congresses. applications--Congresses. 4 . Computer e n g i n e e r i n g - - C o n g r e s s e s . I. K a r j a l a i n e n . E . ( E r k k i ) 11. T i t l e . 111. S e r l e s . a i 8 3 . 9 . ~ 3 1990 502.85--dC20 90-220 10 CIP
--
ISBN 0-444-88949-3
0Elsevier Science Publishers B.V., 1990 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V./ Physical Sciences & EngineeringDivision, P.O. Box 330, 1000 AH Amsterdam, The Netherlands. Special regulationsfor readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the publisher. No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Although all advertising material is expected to conform to ethical (medical)standards, inclusion in this publication does not constitute a guarantee or endorsement of the quality or value of such product or of the claims made of it by its manufacturer. This book is printed on acid-free paper. Printed in The Netherlands
Contents Preface
I.
......................................................................................................................................................................................................................................
ix
Scientific Visualization and Supercomputers 1. An overview of visualization techniques in computational science:
2. 3.
4.
5. 6.
11.
State of the delivered art at the National Center for Supercomputing * . Applications ................................................................................................................................................................................................ Hardin J, Folk M. The application of supercomputers to chemometrics ......................................................... Hopke PK. Parallel computing of resonance Raman intensities using a ............................................................................. transputer array ................................................................................... Efremov RG. A user interface for 3D reconstruction of computer tomograms or magnetic resonance images .............. ........................................................................................ Friihauf M. Automatic correspondence finding in deformed serial sections Zhang YJ. Biological applications of computer graphics and geometric modelling .................................................................................................................................................................................................. Barrett AN, Summerbell D.
3
21
31
55
Statistics 7. Experimental optimization for quality products and processes ................................ 7 1 Deming SN. 8. Experimental design, response surface methodology and multi criteria decision making in the development of drug dosage forms ............85 Doornbos DA, Smilde AK, de Boer JH, Duineveld CAA. 9. The role of exploratory data analysis in the development of novel .. antivlral compounds ................................................................................................... ...... 97 Lewi PJ, Van Hoof J, Andries K. 10. Some novel statistical aspects of the design and analysis of Quantitative Structure Activity Relationship studies .......................................................... 105 Borth DM, Dekeyser MA.
v1
11. Site-directed computer-aided drug design: Progress towards the
design of novel lead compounds using “molecular” lattices ............................ Lewis RA, Kuntz ID. 12. A chemometrics / statistics / neural networks toolbox for MATLAB ................................................................................................................................................................. Haario H, Taavitsainen V-M, Jokinen PA.
111.
133
Data Analysis and Chemometrics 13. Neural networks in analytical chemistry ............................................................................... Kateman G, Smits JRM. 14. Electrodeposited copper fractals: Fractals in chemistry .............................. Hibbert DB. 15. Thc use of fractal dimension to characterize individual airborne particles .................................................................................................................. Hopke PK, Casuccio GS, Mershon WJ, Lee RJ. 16. Use of a rule-building expert system for classifying particles based on SEM analysis ............................................................................................. Hopke PK, Mi Y. 17. Partial Least Squares (PLS) for the prediction of real-life performance from laboratory results ............................................................................................. Lewi PJ, Vekemans B, Gypen LM. 18. Dynamic modelling of complex enzymatic reactions ................................. Ferreira EC, Duarte JC. 19. From chemical sensors to bioelectronics: A constant search for improved selcctivity, sensitivity and miniaturization ................................ Coulet PR. 20. A Turbo Pascal program for on-line analysis of spontaneous . . . ncuronal unit activity ....................................................................................................................... GaA1 L, Molnk P.
IV.
117
151 161
173
179
199 21 1
22 1
237
Laboratory Robotics 21. Automation of screening reactions in organic synthesis Josses P, Joux B, Barrier R, Desmurs JR,Bulliot H, Ploquin Y, Metivier P. 22. A smart robotics system for the design and optimization of spcctrophotometric experiments Settle Jr FA, Blankenship J, Costello S,Sprouse M, Wick P.
249
259
Vii
23. Laboratory automation and robotics-Quo vadis? ............................................................ 273 Linder M. 24. Report of two years long activity of an automatic immunoassay section linked with a laboratory information system in a clinical laboratory ................................................................................................................................................................................................ 285 Dorizzi RM, Pradella M.
V.
LIMS and Validation of Computer Systems 25. An integrated approach to the analysis and design of automated manufacturing systems ....................................................................................................................................................... Maj SP. 26. A universal LIMS architecture .................................................................................................................. Mattes DC, McDowall RD. 27. Designing and implementing a LIMS for the use of a quality assurance laboratory within a brewery ................................... ..... .................... Dickinson K, Kennedy R, Smith P, 28. Selection of LIMS for a pharmaceutical research and development laboratory-A case study .................................................................................................................................... Broad LA, Maloney TA, Sub& Jr EJ. 29. A new pharmacokinetic LIMS-system (KINLIMS) with special emphasis on GLP ................................................................................................................................................... Timm U, Hirth B. 30. Validation and certification: Commercial and regulatory aspects ........ Murphy M. 31. Developing a data system for the regulated laboratory ........................................... Ycndlc PW, Smith KP, Farrie JMT, Last BJ.
VI.
293 301
307
315
329
351
Standards Activities 32. Standards in health care informatics, a European AIM ................................................. 365 Noolhoven van Goor J. 33. EUCLIDES, a European standard for clinical laboratory data exchange between independent medical information systems ............................................................................................................................................................................. 37 1 Sevens C, De Moor G, Vandewalle C. 34. Conformance testing of graphics standard software. ......................................... 379 Zicgler R.
viii
VII. Databases and Documentation 35. A system for creating collections of chemical compounds based on structures ............................................................................................................................................................................... Bohanec S, Tusar M, Tusar L, Ljubic T, Zupan J. 36. TICA: A program for the extraction of analytical chemical information from texts ....................................................................................................................................................... Postma GJ, van der Linden B, Smits JRM, Kateman G. 37. Databases for geodetic applications ........................................................................................................... Ruland D, Ruland R. 38. Automatic documentation of graphical schematics ................................................................ May M.
393
407 4 15 427
VIII. Tools for Spectroscopy 39. Dcvelopments in scientific data transfer ............................................................................................... Davies AN, Hillig H, Linscheid M. 40. Hypermedia tools for structure elucidation based on spectroscopic methods .................................................................................................................................................. Farkas M, Cadisch M, Pretsch E. 41. Synergistic use of multi-spectral data: Missing pieces of the workstation puzzle ................................................................................................................................................................... Wilkins CL, Baumeister ER, West CD. 42. Spectrum reconstruction in GC/MS. The robustness of the solution found with Alternating Regression (AR) .................................. Karjalainen EJ.
445
455
467
477
Author Index
..............................................................................................................................................................................................
491
Subject Index
................................................................................................................................................................................................
493
ix
Preface
The second European Scientific Computing and Automation-SCA 90 (Europe)-meeting was held in June 1990 in Maastricht, the Netherlands. This book contains a broad selection of the papers presented at the meeting. Science is getting more specialized. It is divided into narrow special subjects. But there are other forces at work. The computer is bringing new unity to science. Computers are used for making measurements, interpreting the data, and filing the results. Mathematical models are coming into wider use. The computer-based tools are common to many scientific fields so SCA tries to concentrate on the common tools that are useful in several disciplines. Computers can produce numbers at a furious pace. Trying to see what is going on during the computing process is a frustrating experience. It is like trying to take a sip of water from a fire hose. A new discipline, scientific visualization, is evolving to help the researcher in his attempt to come to grips with the numbers. The opening talk was given by Hardin and Folk from the National Center for Supercomputing Applications (NCSA) at University of Illinois. The main element of their presentation was the dramatic color animations produced on supercomputers and workstations. These public-domain visualization tools are described by them. Supercomputers are useful in chemometrics. Hopke gives examples of problems that benefit from the distribution of the computations into a number of parallel processor units. The parallel computer used by Efremov is different. He installed a number of transputers in an AT-type PC to calculate Raman spectra. The AT was speeded up by a factor of 200! Medical personnel need tools to manipulate three-dimensionalimages. Friihauf covers the design of intuitive user interfaces for tomography and MR images. Zhang analyzes three-dimensional structures from light microscopy of serial sections. He shows how a series of images from deformed tissue sections can be linked together to a complete three-dimensional structure. The geometric laws of the growth process in embryonic limbs are described by Barrett and Summerbell. It is possible to describe a morphological process with a small number of parameters in a geometric model. Statistical methods are needed in all scientific disciplines. Deming describes the role of experimental design in his article. Doornbos et al. uses experimental design to optimize different dosage forms of drugs. Lewi shows how chemometric tools are used in industry
X
for dcsigning drugs. Borth handles statistical problems with censored data in quantitative structure-activity relationships (QSAR) and drug design. Lewis and Kuntz describe how the idea of molecular lattices is used to find novel lead compounds in drug design. Haario, Taavitsainen and Jokinen have developed a statistical toolbox for use in chemometrics. The routines are built in MATLAB, a matrix-based language for mathematical programming. Chcmometrics is a phrase used to cover a broad field of computer applications in chemistry. The term covers statistics, expert systems and many types of mathematical modeling. The chemical applications of neural networks are handled in a tutorial by Kateman et al. Hibbert describes uses of fractals in chemistry. Hopke analyzes the surface texture of individual airborne smoke particles by fractals. Ferreira optimizes the manufacturing process that produces ampicillin. Compounds that could not by measured directly are estimated from a dynamic model. Coulet gives a broad tutorial on biosensors, where mathematical models are used to obtain more specific mcasurcments. Ga2l and Molnk describe a computer program for analyzing the electrical activity of single neurons in pharmaceutical research. Laboratories use analyzers built for a fixed purpose. A programmable arm,a laboratory robot, can be programmed by the user for many operations in the laboratory. In principle, most tasks in the laboratory can be automated with robots. The early users have oftcn found the programming costs high. Still there are uses where the flexibility of robots is nccded. Josses et al. show how a pharmaceutical company uses laboratory robotics to develop new methods for organic synthesis. Settle et al. describe how expert systems are linked with laboratory robotics. Linder ponders the philosophy of robotics and recommends using independent workstations with local intelligence. Dorizzi and Pradclla dcscribe the interfacing of immunoassay pipetting stations to a small commercial LIMS systcm. The gain in productivity for a rather small investment was impressive. Laboratory Informations Management Systems (LIMS) was one of the main themes in the meeting. People feel that LIMS is still a problem. The interest ranges all the way from single instrument users to problems of the management. The development of computer software often requires large projects. Maj describes a formal approach to the analysis and design of automated manufacturing systems. Mattes and McDowall emphasize the role of system architecture for the long-term viability of LIMS systems. Dickinson et al. analyze a LIMS development project for a quality assurance laboratory in a brewery. A project for the choice of a commercial LIMS supplier for a multinational company is described by Broad et al. The selected LIMS is used in a pharmaceutical research laboratory. Timm and Hirth developed a custom-made LIMS system for documenting
xi
pharmacokinetic measurements. The KINLIMS system emphasizes the requirements of GLP. Murphy describes the legal and commercial aspects of validation and certification processes. Yendle et al. give a practical example how an industrial software project developed and documented a chromatography integration package. One of the goals in product design was to facilitate independent validation by the user. Standards are needed for building larger systems. If we do not use proper standards the wheel has to be reinvented every time. Noothoven van Goor reports how the AIM (Advanced Informatics in Medicine) program is catalyzing the development of standards for medical informatics in Europe. Sevens, DeMoor and Vandewalle give an example in the EUCLIDES project that defines standards for the interchange of clinical laboratory data. Ziegler shows how computer graphics standards are tested for conformance to the original specification. The costs of computer data storage are decreasing. The result is that databases are bigger. Large on-line databases for literature searches are accessible to all researchers. Still there is a need for specialized local data bases. Bohanec et al. present a suite of computer programs that handle collections of spectral and other chemical data. The information is accessed by the chemical structures. The information is indexed by descriptors of molecular features derived from connectivity tables. Postma describes how databases can be built up automatically from chemical literature with an experimental parser program called TICA. The program is limited to interpreting abstracts about titration methods. A computer data base for geodetic applications is demonstrated by Ruland. The system combines all phases of the work. Technical documents can be generated by computers. May gives examples of algorithms for producing drawings. Users of spectroscopic instruments have many needs. They need tools for data interchange and powerful software as part of the instrument. Davies, Hillig, and Linscheid describe their experiences with standards for exchange of spectroscopic data. Vendor formats are replaced by general formats like JCAMP-DX. The interpretation of spectral data can be supportcd by new hypertext-basedtools described by Farkas, Cadish and Pretsch. The Hypercard-basedtools should be very useful in chemical education. Wilkins builds "hyphenated" instruments by combining FT-IR spectroscopy and mass spectrometry with gas chromatography. The combination produces more information than the single instruments. More advanced instrument software is needed to fully utilize the information produced by the combination. Karjalainen describes how overlapping spectra from hyphenated instruments are dissected into distinct components. Validation and quality assurance aspects of the spectrum decompositionprocess are emphasized.
xii
Many people contributed to the success of the meeting. Members of the program board Prof. D.L. Massart, Prof. Chr.Trendelenburg, Prof. R.E.Dessy, Dr. R. McDowall, and Prof. M.J. de Matos Barbosa contributed their experience to the selection of topics and papers. I want to express my sincere thanks to them, without their help the congress would not have been the meeting for multiple disciplines that it was. Fiona Anderson from Nature magazine handled the exhibition, a central element in the meeting. Robi Valkhoff with her team of conference organizers from Reunion kept the congress in good order. I want to thank all contributors for their papers. Finally thanks to Ulla Karjalainen, Ph.D. for laying out the pages using Quark m e s s on a Macintosh I1 computer.
September, 1990
Erkki J. Karjalainen Helsinki
Scientific Visualization and Supercomputers
This Page Intentionally Left Blank
Illustrations to Chapter 1 by J. Hardin and M. Folk
Color plate 5. An ab initio chemistry study of catalysis showing the shape of the electron potential fields surrounding a disassociating Niobium trimer. NCSA Image / Harrell Sellers, NCSA.
Color plate 6. An image generated during an earlier run of the ab initio chemistry study points out a problem in part of the researcher’s code. NCSA DataScope / Harrell Sellers, NCSA.
This Page Intentionally Left Blank
Illustrations to Chapter 1 by J. Hardin and M. Folk (cont...)
This Page Intentionally Left Blank
Illustrations to Chapter 4 by M. Friihauf
Photos taken from the screen, using a normal camera, were provided by the authors. The 35 mm slides were scanned into a Macintosh I1 computer using a BameyScan slide scanner. Page layout was done with Quark XF’ress, color separations were made using the Spectre series programs.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
3
CHAPTER 1
An Overview of Visualization Techniques in Computational Science: State of the Delivered Art at the National Center f o r Supercompnting Applications J. Hardin and M. Folk National Centerfor Supercomputing Applications, University of Illinois, USA
Introduction Over the last decade we have witnessed the emergence of a number of useful techniques and capabilities in the field of scientific computing. At NCSA, we have focused on the intcractionsbetween the simultaneousemergence of massive computational engines such as the Cray line of supercomputers, the development of increasingly functional workstations that sit on the user’s desktop, and the use of visual techniques to make sense of the work done on these systems. The keynote presentation given at the 1990 SCA (Europe) Conference in Maastricht was in the form of a report, or update, on this work. The kind of full color, 3D, fully rendered, animated images that have been produced from supcrcomputcr numerical simulations of developing tornadic clouds (see Color plate 1) represent the state of the art in visualization techniques. A 10 minute video of this required a team of visualization specialists working for weeks in collaboration with the scientists who produced the simulation program. The video hardware the images were produced on is part of an advanced digital production system that, with dedicated hardware and software, cost between 1/2 and 1 million dollars. For most sections of the animation, individual framcs took 10 to 20 minutes to render. At thirty frames a second, thc 10 minute video required close to 20,000 frames. The voice over required the use of the above mentioned video production facilities and professional support. All in all, this is a tour de force of current technologies, techniques and technicians. Obviously, not everyone has access to such resources. But increasing numbers of researchers have seen the benefits of visualizing their data and want to use such techniques in their work. The need for such capabilities comes from the mountains of data produced by numerical simulations on high performance computing systems, and from equally endless sets of data generated by satellites, radio telescopes, or any observational See the separate Color plates.
4
data collecting device. As the access to high performance systems increases, through government programs like the NSF Centers in the U.S., or just through the rapidly decreasing cost of these systems, so does the need for visualization techniques. Recognizing that such needs of researchers have increased dramatically in the past few ycars, but that most researchers work with limited resources, the question we want to pursue is: what can be done currently, by naive users, on their desktops, specifically with NCSA developed software that we can give to users now? In one sense we are asking: how many of the techniques seen in the storm simulation can be done interactively on existing installcd hardware systems by what proportion of the user population? And what changes when we start taking about interactive analysis of data versus the presentation of data that we have in the storm video?
Desktop visualization In answering these questions, we moved from the video to programs running on two computers that represent low and middle range capabilities for today’s users: the Macintosh 11, and the SGI Personal Iris. The first demonstration was of NCSA DataScope, a data analysis package developed for the Macintosh I1 line, part of a set of such tools developed at NCSA by the Software Tools Group (STG). The choice of the Macintosh emphasizes the importance of the user interface in providing a wide base of users access to methods of visual data analysis. It is an axiom among the developers of these tools that scientists do not like to program. Chemists like to do chemistry; physicists like to figure out physics; structural cngineers like to solve cngineering problems. In all these things,computer simulations of chemical processes and physical systems allow researchers to test their ideas, uncover new knowledge, and dcsign and test new structures. The data from these simulations, or from experiments or rcmotc sensing devices, may be complex enough, or simply large enough, that visualization tcchniques will be useful in its analysis. And those techniques themselves may be complex and sophisticatcd, But the Iocus of thc scientist remains on the science, not the mcthods used to generate scientific knowledge. The user needs to be able to easily mobilize sophisticated visualization techniques and apply them to the data at hand, without bccoming skilled in the means and methods of computer graphics. A good user interface makes this possible. With DataScope (Color plate 2) the user can start with the raw data in 2 dimensional array form, displayed on the screen as a spreadsheet. By choosing a menu option the user can display the data as a 2D color raster image, and apply a more revealing palette of colors. The user is now able to see the entire data set, pcrhaps a half a megabyte of data, at a glance. Shockwaves in a computational fluid dynamics simulation can be easily identilicd by shape and changes in color across thc shock boundaries (Color plate 3). The pulse or current in a MOSFET device, and the dcpth it reaches in the substrate, is apparent in
5
the results of an electronic device simulation (Color plate 4). The shape of the electron potentials surrounding disassociating atoms in an ab initio chemistry study of catalysis are clear to see (Color plate 5). This is the basic data-revealing step that is common to all visualization techniques: the mass of numbers has been transformed into a recognizable image or visual pattern. The next step is to allow the user to further investigate both the image and the numbers. By pointing and clicking with the mouse, regions of interest in the image can be chosen and the corresponding numbers in the dataset are highlighted. The two views on the data, the spreadsheet and the image, are now synchronized. It is then possible to see the actual floating point values associated with particular regions or points in the image. The researcher can also do transformations of the data set by typing an equation into a notebook window. In this way a discrete derivative can be obtained, or a highpass filter run across the data set. Then the new data set can be immediately imaged and compared to the original. Other operations are possible, all of which allow the user to interrogate the data by moving between the images and the raw numbers easily. The next tool that was demonstrated was NCSA Image. This tool was developed for investigating data images through various interactive manipulations of the color palette, and to view sequences of images that may, for instance, represent a series of time steps in a simulation. By dynamically changing the palette, or color table, associated with an image a researcher can look for unexpected contrasts or changes in values that had not been foreseen. The colors associated with a particular region of the data set that may represent the edge of a vortex, or a boundary in the simulation, can be zoomed in on, and the color contrast enhanced for a finer view of detail in that area. An animation reveals the dynamic relations of variables and the development of a physical system over time. Combinations of such techniques allow the researcher to develop as many ways of interrogating the data as possible. So far we have only discussed the data analysis aspect of visualization techniques. But equally important for researchers developing and testing their numerical simulations is the ability of images like the ones described above to be used as debugging took Without such images, researchers are often reduced to sampling the huge data sets their codes generate, and are not able to see when boundary conditions have not been properly established, or when unexpected oscillations have been generated by their models. A quick look at an image of the systems output can alert the researcher to such problems, and, on occasion, also point in the direction of a solution. An example of this was the interesting image generated by an ab initio chemistry program under developmentby a researcher at NCSA. Color plate 6, above, shows a stage in the dissociation of a Niobium himer. The atom (ion?) is moving off to the right, and leaving the dimer behind. Color plate 6 shows an image of an earlier run done during the development of that same code. By looking at this the developer not only was able to easily see that the program had gone wrong, but was also able to narrow down the search for where the problem lay. Instead of, as the
6
researcher put it “having to slog my way through the entire wave function from beginning to end,” he was able to go directly to the section of the code responsible for the problem. Not shown, but also part of the NCSA,tool suite for the Macintosh, was NCSA PalEdit, which allows users to interactively manipulate and construct color palettes, and NCSA Layout, which is used to annotate images and compose them for slides that can be used to communicate a researcher’s work and findings to colleagues (Color plate 7). This last tool adds presentation capabilities to the data analysis and code debugging uses of such visualization techniques.
Standards for data formats This is a good place to point out that a key feature of any scientific visualization system is interoperability among the tools and the scientists’ data-producing software. Whether the data is from simulations or from instrumental observations, it should be easy for scientists to both get it into a form that the tools can work with, and get it from where it is generated to the platforms where the tools reside. Furthermore, to the extent that the tools can operate on similar kinds of data, the transfer of data from one tool to another should, from the uscr’s perspective, be trivial. For this it is necessary to provide standard dah models that all of the tools understand, a file format that accommodates these models, is extensible to future models and is Uansportable across all platforms, and simple user interfaces that enable scientists with very little effort to store and retrieve their data. To satisfy these requirements NCSA has developed a format called HDF. HDF is a self-describing format, allowing an application to interpret the structure and contents of a file without any outside information. Each data object in a file is tagged with an identificr that tells what it is, how big it is, and where it can be found. A program designed to interpret certain tag types can scan a file containing those types and process the corresponding data. In addition to the primitive tag types, a grouping mechanism makes it possible to describe, wilhin the file, commonality among objects. User interfaces and utilities exist that make it easy for scientists’ programs, as well as the tools, to read and write HDF files. Currently there are sets of routines for reading and writing 8-bit and 24-bit raster images, multi-dimensional gridded floating point data, polygonal data, annotations, and general record-structured data. For example, by placing a few HDF calls in their program a user can have their numerical simulation, running on a supercomputer,generate HDF files of the output data. These files can then be moved, say by the common Unix method of file transfer (ftp), to a Macintosh where NCSA DataScope is running. The file can then be loaded into DataScope without any changes having to be made by the user. This exemplifies the desired transportability of files between machines and operating systems mentioned above. In addition, since each HDF file can contain a number of dissimilar objects, researchers can use HDF files to help organize their work. Upon completing the work
7
session described above in NCSA Datascope, the user could save the original floating point data set, the interpolatedraster image of that data generated by Datascope, any analytic functions or notes typed during that session into the notebook, and the palette that the user had found most useful to view the data image with, all in one HDF file. Upon returning, a day or a week later, to the same problem, clicking on that data set launches NCSA Datascope and loads the data, image, palette, and notes.
3D visualization Moving off of the Macintosh platform requires a user interface that retains as much of the Mac’s ease of use as possible. The emerging X-windows environment from MIT, while not providing a complete graphic user interface, does provide a standard form of windows on a variety of platforms. The NCSA STG has developed much of the functionality described above in the X-windows environment, and in the analysis of what is commonly referred to as regularly gridded data, has moved into 3D functionality. NCSA XImage and NCSA XDataSlice are tools that allow users on platforms that support X-windows to use visualization techniques in the analysis of their data. A number of techniques are used to look at data cubes, including tiling multiple windows, animating a 2D image along the 3rd dimension (which can be done with NCSA Image also, see above), isosurface rendering, and the use of the ‘slicer dicer’ feature (Color plate 8). This last method allows the user to cut into a 3 dimensional data set and see arbitrary planes represented as pseudocolored surfaces in a 3D representation. This method combines flexibility in choice of data area to view, and rapid response to user choices, increasing the interactivity of the analysis. The kind of data that is often the output of finite element or finite difference analyses, what is commonly referred to as polygonal or mesh data or points in space, presents heavier demands on the computational resources available on the desktop. (None of the tools discussed above allow the presentation or analysis of such data unless it is first translated into a grid form.) The need to rapidly interpret a large set of points, connectivity information and associated scalar values into a relatively complex 3D image requires that much of this be done in hardware if interactive speeds are to be achieved. The cost of such capabilities is rapidly decreasing, making tools that were considered esoteric 3 years ago standard items today. With this rapidly decreasing cost and the resulting increasing availability of systems capable of handling the demands of interactive 3D data visualization and analysis, the NCSA STG has developed a tool called PolyView (for Polygon Viewer) which runs on the SGI Personal Iris. The Personal Iris has a graphics pipeline in hardware and runs Unix, one of the emerging standards that NCSA has adopted. While the Personal Iris does not currently fully support a portable standardized graphic user interface, it is moving toward a complete X-windows environment in the near future. PolyView currently uses
8
the native windowing system to construct the all-important user interface. This allows naive users to load 3D data sets and display them as points, lines, or filled polygons (Color plate 9). The user can adjust shading, rotate the object, zoom in and out, and perform palette manipulations of associated scalars, for instance to investigate the heating of the base plate heat sink soldered to an induction coil (see Color plates 10-11). In this example, the heat generated by various power levels applied to the coil are simulated and heat levels mapped to a palette, just as was done in the 2D examples above. The heat level is then easy to discriminate at various points on the 3D coil and plate. Interactive manipulation of the palette allows the user to focus on a particular range of values. The user can then select a point or region that appears interesting and ask that the actual values of the variable be displayed in floating point form.
Conclusion This then was the brief update of the state of the delivered art at NCSA. Over the last year, indeed, since the talk at the beginning of the summer, work on bringing advanced visualization techniques closer to more users has progressed rapidly. Work at NCSA has pushed the limits of what can be done by users versus specialists. For instance, video production systems have decreased rapidly in price and more and more of the functionality once reserved to the large team efforts exemplified by the storm video are available in systems quickly approaching the desktop. These developments, and similar advances in user interface design and implementationin the area of scientific visualization, mean continuing rapid improvements in the functionality available to a growing base of researchers.
Acknowledgements The authors wish to thank all those who helped make this rather complicated demonstration and talk possible. Erkki Karjalainen, the conference chair, for the invitation and his enthusiastic assistance during the conference. Robi Valkhoff and Victorine Bos of CAOS and Keith Foley of Elsevier for their patience and gracious assistance setting up the presentation and solving problems, some of the authors’ making, that cropped up. And the staff of the MECC in Maastricht for their timely and professional technical assistance.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V.. Amsterdam
9
CHAPTER 2
The Application of Supercomputers to Chemometrics P.K.Hopke Department of Chemistry, Clarkson University, Potsdam, NY 13699-5810 , USA
Abstract The improvements in speed and memory of computers has made applications of chemometric methods routine with many procedures easily accomplished using microcomputers. What problems are sufficiently large and complex that the largest and most powerful computers are needed? One major class of problems is those methods based on sampling strategies. These methods include robust regression, bootstrapping, jackknifing, and cluster analysis of very large data sets. Parallel supercomputingsystems are particularly well suited to this type of problem since each processor can be used to independently analyze a sample so that multiple samples can be examined simultaneously. Examples of these problems will be presented along with measures of the improvement in computer throughput resulting from the parallelization of the problem.
1. Introduction During the last ten years, there have been revolutionary changes in the availability of computing resources for solving scientific problems. Computers range from quite powerful microcomputers now commonplace in the analytical laboratory up to “supercomputers” that are able to solve large numerical problems impossible to consider before their advent. A computer is not “super” simply because of its high clock speed. They achieve their computational throughput capability because of a combination of short cycle time and computing architecture that permits the simultaneous accomplishment of multiple tasks. There are several types of architectures and these differences lead to differences in the kinds of problems that each may solve in an optimal manner. In the past, computers with multiple processor units have generally been used to provide simultaneous services to a number of users. In the supercompukrs, multiple processors are assigned to a single job. Also in different machines, the processors are used in very different configurations.
10
Computers can be classified in terms of the multiplicity of instruction streams and the multiplicity of data streams [Flynn, 19661. Computers can have a single instruction stream (SI)or a multiple instruction stream (MI) and a single data stream (SD) or a multiple data stream (MD). A traditional computer uses a single stream of instructions operating on a single stream of data and is therefore an SISD system. The MISD class is in essence the same as an SISD machine and is not considered to be a separate class [Schendel, 19841. The SIMD processors can be divided into several types including array processors, pipeline processors, and associative processors. Although there are differences in the details of the process between these types of systems, they all work by simultaneously performing the same operation on many data elements. Thus, if the serial algorithm can be converted so that vector operations are used, then a number of identical calculations equal to the length of the vector can be made for a single set of machine instructions. Many of the supercomputers such as the CRAY and CDC CYBER 200 systems employ this vectorization to obtain their increase in computational power. It is necessary to rewrite the application program to take advantage of the vector capabilities. The nature of thcse programming changes will be different for each machine. Thus, although there has been a major effort to improve the standardization of programming languages, machines with unique capabilities require special programming considerations. Finally, there are MIMD systems where independent processors can be working on differcnt data streams. An example of this type of system are the loosely-coupled array processor systems (LCAP) at IBM [Clementi, 19881. In these systems, the host processor controls and coordinates the global processing with a “master or host program” while the hcavy calculations are carried out on the independent array processors running “slave programs”. The slave processors are not directly connected to one another and the slave programs are not coordinated with each other; the host program is responsible for control over the slave programs and therefore for their coordination and synchronization. Other multiple processor systems like the Alliant FXB use multiple processors with a shared memory. In the Alliant system, there are two types of parallel processors, interactivc processors and computational elements. The interactive processors execute the interactive user jobs and runs the operating system thus providing the link to multiple uscrs of the systcm. The eight computational elements can work in parallel on a single application with the systcm performing the synchronization and scheduling of the elements. These computational elements are such that within the parallel structure, vectorized operations can be performed. Thus, depending on the problem and program structure, this system can be working on single or multiple data. The FORTRAN compiler automatically pcrforms the optimization of the parallel and vector capabilities. This compiler control of optimization means less coordination problems for the programmer. However, it makes it more difficult to know exactly how the code is being optimized.
11
Host Computer (Master)
I
Figure 1. Schematic outline of a parallel system consisting of a master host computer and a series of slave array processors.
2. Applications for parallel systems The objective of this paper is to examine the utility of a parallel computer system for the implementation of statistical algorithms that are based on processing a large number of samples. Some aspects of parallel computing are illustrated by the implementation of algorithms for cluster analysis and robust regression. Similar applications to cross-validation and the bootstrap are also possible. The system which will be considered here is an MIMD system that consisted of a central host processor (master) connected by channels to 10 array processors (slaves). This system is termed a loosely-coupledarray processor (LCAP) system and was implemented at the IBM Research Center at Kingston, NY [Clementi, 19881. The particular system was designated as LCAP-1. The host processor was an IBM 3081 and the slaves were Floating Point System FPS-164array processors (AP’s), linked as in Figure 1. This architecture is “coarse-grained”,in contrast with “fine-grained” systems that contain many but less powerful processors. The way to use the LCAP system is to let the host processor control and coordinate the global processing with a “master” or “host program” while the heavy calculations are carried out on the AP’s running “slave programs”. The slave processors are not directly connected, and the slave programs are not coordinated with each other: the host program is responsible for control over the slave programs and therefore for their coordination and synchronization. Software running on the LCAP system must consist of a host FORTRAN program that calls subroutines (also in FORTRAN) to run on the slaves. There may be several such subroutines, that may also make use of their own subroutines. All these routines are
12
grouped into a single slave program, that is run on some or all of the slaves. The way by which duplication of the slave processing is avoided is either to have different data used on the different slaves or to execute different subroutines (or both). Parallel execution is controlled by a specific set of instructions added to the master program as well as to the slave routines. A precompiler is then used (on the host computer) to generate the master and slave program source code. Subsequently, the master program and AP routines are compiled, linked and loaded for the run. For a detailed description of the LCAP system, see Clementi and Logan [1985]. A guide to the precompiler is given by Chin et al. [1985].
3. A sequential algorithm for clustering large data sets Clustcr analysis is the name given to a large collection of techniques to find groups in data. Usually the data consist of a set of n objects, each of which is characterized by p measurement values, and one is looking for a “ ~ t u r a l grouping ” of the objects into clusters. A clustering method tries to form groups in such a way that objects belonging to the same group are similar to each other, while objects belonging to different groups are rather dissimilar. The most popular approaches to clustering are the hierarchical methods, that yield a tree-like structure, and the partitioning methods that we will focus on here. In the latter approach, one wants to obtain a single partition of the n objects into k clusters. Usually k is given by the user, although some methods also uy to select this number by means of some criterion. One of the ways to construct a partition is to determine a collection of k central points in space (called “centrotypes”) and to build the clusters around them. THE MASLOC method [see Massart et al., 19831 is somewhat different, because it searches for a subset of k objects that are representative of the k clusters. Next, the clusters are constructed by assigning each object of the data set to the nearest representative object. The sum of the distances of all clustering. Indeed, a small value of this sum indicates that most objects are close to the representativeobject of their cluster. This observation forms the basis of the method. The k representative objects are chosen in such a way that the sum of distances from all the objects of the data set to the nearest of these is as small as possible, It should be noted that, within each cluster, the representative object minimizes the total distance to the cluster’s members. Such an object is called a medoid, and the clustering technique is referred to as the k-medoid method. Implementing the k-medoid method poses two computational problems. The first is that it requires a considerable amount of storage capacity, and the second that finding an optimal solution involves a large number of calculations, even for relatively small n. In practice it is possible to run an exact algorithm for data sets of up to about 50 objects. When using a heuristic algorithm that yields an approximate solution, one can deal with about 300 or 400 objects.
13
An extension for solving much larger problems has been developed by Kaufman and Rousseeuw [ 19861. The corresponding program is called CLARA (from Clustering LARgc Applications). It extracts a radon sample of, say, m objects from thc data set, clusters it using a hcuristic k-medoid algorithm, and thcn assigns all the remaining objccts of the data set to one of the found mcdoids. The set of k medoids that have just bccn found is then complemcntcd with m - k randomly selected objects to form a new sample, that is then clustcrcd in the same way. Then all of the objects are assigned to the new mcdoids. The value of this new assignmcnt (sum of the distances from each object to its mcdoid) is calculated and compared with the previous value. The medoids corresponding to thc smallest value is kept, and used as the basis for the next sample. This process can be repeated for a given number of samplcs, or until no improvement is found during some itcrations. Thc CLARA program was written in a very portable subset of FORTRAN and implemented on scvcral systems, using variable sample sizes and numbers of samplcs. The computational advantages of CLARA are considerable. First of all, the k-mcdoid method is applied to much smaller sets of objects (typically the entire data set might consist of several hundreds of thousands of objccts, while a sample would only contain m). The numbcr of calculations bcing of the order of a quadratic function of clustcrs, this considerably reduces the computation time. Thc actual rcduction dcpcnds, of course, on the numbcr of samplcs that are considered. Another advantage concerns the storage requircmcnts. As sccn above, the k-median method is based upon the distanccs bctwccn all objects that must be clustcrcd. Thc total number of distances is also a quadratic function: For a set of 1,000 objects there are 499,500 such distances. which occupy a sizcablc part of ccntral memory. Of course it is possible to store the distances on an extcrnal dcvice or to calculate them each time they are needed, but this would s c a l y incrcasc the computation time. In the CLARA method the samples to be clustered contain fcw objccts, and thcrefore few distances must be stored. It is true that during ihc assignmcnt of the entire dataset, the distance of each object to each of the k medoids must bc calculatcd. However, only the sum of the minimal distanccs must be storcd, and not thc individual distances. After the last itcration, the assignment of all objccts to the final sct of mcdoids is carried out once again, in ordcr to obtain the resulting partition of thc entirc data set.
4. Application of parallel processing to the clustering problem Thc mcthod just dcscribcd lcnds itsclf quitc wcll to parallcl proccssing, and in particular to thc LCAP computcr systcm. Each processor can bc indcpcndcntly running thc kmcdoid mcihod on a particular sample. Of coursc, this rcquircs thc k-mcdoid codc to bc availablc on cach of the slavc proccssors. Two stratcgics can bc cmploycd to takc advantage of thc additional throughput that bccomcs available through the parallclization. In thc first strategy, cach slave proccssor
14
rcccivcs a samplc and thc host program waits until all samplcs have bccn analyzcd. Aftcr clustcring its sample, Lhc slavc proccssor also assigns cach objcct of thc entirc data set to thc closcst of thc found mcdoids. The sum of lhe dislanccs of the objects to thc choscn medoids is also calculated. The mcdoids of the best sample obtained to datc arc thcn includcd in thc ncxt batch of samplcs. In this strategy a large number of samplcs can be analyzed, but thcrc arc often idle processors, waiting for the last sample to bc complctcd. This is bccausc thc computation time of the k-medoid algorithm dcpcnds strongly on thc sample it works on, and is therefore quite variable. In the sccond suatcgy, the host waits for any sample to be finished, compares the objcctive value with thc bcst onc obtained so far, updates thc current set of mcdoids (if rcquircd), and uscs thc bcst mcdoids for the ncxt sample to bc run on thc currently idle slavc proccssor. Thus thcrc is only a vcry short waiting time, from the momcnt a slave proccssor complctcs a sample until thc ncxt sample is initiatcd. However, thcn some slavc proccssors may still bc working on samplcs now known not to include thc currcntly bcsl mcdoids. Both of thcsc slratcgics rcquirc vcry littlc communication bctwccn host and slavc proccssors. At thc bcginning of a run, data on all objccts, as wcll as the ncccssary codc, arc scnt to each of Ihc slavcs. This soflwarc consists of the samplc sclcction, the distancc calculation, thc k-mcdoid method, and the assignmcnt routincs. Once a samplc is clustcrcd, only thc casc numbcr of the mcdoids (which amounts to a fcw intcgcrs) and the total value of thc clustcring are sent back to thc host. Thc only processing carried out by the host at this point is comparing Lhe total value with thc currently bcst onc, and if it is lcss, rcplacing the bcst sct of mcdoids by the ncw sct. The host scnds back the (possibly modificd) bcst sct of mcdoids, allowing thc slave to gencratc a ncw sample. Finally, at the end (whcn no more samplcs must bc drawn) the final clustering of the entire data sct is dctcrmincd. Initial rcsults wcrc oblaincd using a crcatcd data set with a known slruclure. Thc daki sct includes three well-dcfincd clustcrs and scvcral typcs of oullicrs. It has bccn found that for thcsc data stratcgy 2 is more efficicnt than suatcgy 1. Increasing thc numbcr of objccts appears to improve the pcrformancc of stratcgy 1, and docs not sccm to slow down stratcgy 2. In both stratcgics 1 and 2, thc host proccssor has vcry littlc work to do. An altcrnativc third stratcgy is to lct it do thc assignment job for cach set of mcdoids coming from a samplc. In this way, part of thc codc (the assignment routinc) is kcpt in thc host proccssor. Unfortunatcly, this also incrcascs thc probability that thc host is busy at thc instant a slavc rcturns its sct of mcdoids, forcing it to wait until it can obtain a ncw sct. A possiblc way around this problcm is to have that slavc start a ncw samplc using thc prcvious bcst sct of mcdoids. In thc lattcr strategy thc coordination bctwccn host and slavcs is morc complcx, and it is in fact bcttcr suitcd for a systcm with a scction of sharcd mcmory, in which thc bcst sct of mcdoids found so far can bc stored. In such a situation,
mcasurcs must bc taken to avoid retrieving medoids from the sharcd memory by onc of thc slavcs while anothcr is dcpositing its results. In gcncral, it appcars that thc sclcction of a strategy must take two factors into accounI: a. Thc amount of communication bctwccn host and slaves should bc adapted to the systcm. For cxamplc in thc LCAP systcm, in which communication is a limiting factor, it should bc rcduccd to a minimum. b. Thc workload given to the slaves should bc balanccd as well as possible, avoiding idle slavcs or duplication of work. Naturally, also othcr considcrations may bc in ordcr, for instancc having to do with restrictions on thc storage capabilities of the host and the slavcs. Thc results of thcsc studics of parallclization of this cluster algorithm has also pcrmittcd the more intcnsive study of the CLARA algorithm. In thcse studics [Hopke and Kaufman, 19901, it was found that a stratcgy of fcwer, larger samples providcd partitions that wcrc closcr to thc optimum solution obtained by solving the complcte problem.
5. A parallel algorithm for robust regression In rcvicwing othcr statistical tcchniqucs suitablc for parallclization, algorithms that also proceed by rcpcated sampling wcrc considcrcd. One of thcsc is for robust rcgrcssion analysis. In the classical lincar rcgrcssion modcl.
y. =
X . 11
1
e1 + xi2e2+...+ X 1.P 8p + ei
(i = 1,2,
...)n)
the crrors e; arc assumed to bc independent and normally distributcd with zero mean and constant vxiancc. The X I , ..., xP arc callcd explanatory variables. and y is the response variablc. The aim of rcgrcssion analysis is to cstimatc the unknown rcgrcssion cocfficicnts el, €$, ..., ep from a samplc of n data points (xil, 3 2 , ...,xip, yi). Thc convcntional mcthod is lcast squares, dcfincd by
r
min
(8,. ..., i p i=l ) whcrc thc rcsiduals ri arc given by A
r . = y . - x i l e l -...1
1
A
e 'P p
X .
The lcast squxcs tcchniquc has bccn very popular throughout because thc solution can be obtained cxplicitly by mcans of some matrix algebra, making it the only fcasible method in Lhc prc-computcr agc (note that lcast squarcs was invcntcd around 1800). Morcovcr,
16
the lcast squares estimator is the most efficient when the errors ei are indeed normally distributed. However, real data often contain one or more outliers (possibly due to recording or transcription mistakes, misplaced decimal points, exceptional observations caused by earthquakes or strikes, or mcmbers of a different population), which may exert a strong influence on the least squares estimates, often making them completely unreliable. Such outlicrs may be very hard to detect, especially when the explanatory variables are outlying, bccause such “leverage points” do not necessarily show up in the least squares residuals. Thcrcforc, it is useful to have a robust estimate that can withstand the effcct of such outlicrs. The least median ofsquares method (LMS) is defined by
(
min
8J
median r : i=l*”.ln
[Roussccuw 19841. It has a high breakdown point, because it can cope with up to 50% of outlicrs. By this we mean that the estimator remains trustworthy as long as the “good” dara arc in the majority. (It is clear that the fraction of outliers may not exceed 50%. bccause they it would become impossible to distinguish between the “good” and thc”bad” points.) To calculate thc LMS estimates, Rousseeuw and Leroy [1987] use the program PROGRESS (the latter name stands for Program for Robust reGRESSion). The algorithm can be outlincd as follows: selcct at random p observations out of the n and solve thc corrcsponding systcm of p linear equations with p variables:
The solution gives a trial estimate of the coefficients, denoted by (51, ..., %). Then calcuIatc thc objeclivc value median ir 2 i =1..- .n
whcre the rcsiduals correspond to this trial estimate:
(i = 1,2,
..., n)
This procedure is carricd out many times, and the estimate is retained for which the objcctivc valuc is minimal. In the example of Figure 2, the model is yi = €$xi + 02 + ej with n = 9 and p = 2, so we consider samples with two observations. The line detcrmined
17
by the sample (g,h) yields a large objective value, as does the line passing through (fh). The line corresponding to (f,g) gives a much smaller objective value, and will be selected by the algorithm. Note that bothfand g are “good” points, whereas h is an outlier. In general, the number of replications is determined by requiring that the probability that at least one of the samples is “good” is at least 95%. When n and p are small all combinations of p points out of n may be consid- Figure 2. Example of simple regression with ered, corresponding to the algorithm of nine points. There are two distinct outliers in Steele and Steiger (1986). Once the optimal the lower right comer. . u (61, ..., Op) has been found, the algorithm uscs it to assign a weight of 1 (“good”) or 0 (“bad”) to each of the n data points, Subscqucntly, thc points with weight 1 may be used in a classical least squares regression. Whcn implemcnting this method on the LCAP system, it was again necessary to minimize the amount of host-slave communication. This minimization is easier to achieve for PROGRESS than it was for CLARA, because now each sample (of p points) is independent of the previous ones, unlike CLARA where the new sample was built around the best medoids found from the earlier samples. Again, several strategies are possible for exploiting the parallel computer structure. In the parallel versions of CLARA discussed above, the host was continuously informed for the best objective value found so far, and it was directly involved in the sample selection process. In order to obtains good system pcrformance, our LCAP implementation of PROGRESS is somewhat different. Indeed, the amount of computation for a single sample is almost constant, and relatively small since it comes down to solving a system of p linear equations and computing the median of n numbers. Thus, to send each sample from the host to the slave and then return the objcctivc value involves considcrablc communication time relative to the single sample computation timc. Thereforc, an altcmative strategy was chosen, in which the number of samples to be uscd was simply dividcd by the number of available slaves (10 in our case). Each slave then processes that numbcr of samples (as soon as it is finished with one sample, it immediately procecds with the next) and reports the best result upon completion The random numbcr generator for each slave is provided with a different seed to ensure that different sets of samplcs are used. In this strategy, a large number of calculations are pcrformcd in parallel with only a minimum of communication needed to initialize the system and to rcport the final rcsults to the host. At the end, the host merely has to sclcct the best solution. In this solution, the parallel algorithm yields exactly the same solution as the 1
18
scqucntial one, providcd the latter uses the samc random samplcs. Morcovcr, the computation timc is esscntially divided by thc number of slaves.
6. Some other parallelizable statistical techniques In both problcms discussed above, a large number of samples must bc proccsscd (almost) indcpcndcntly using identical codc. At the bcginning, the necessary code as well as all thc data arc scnt to the slavc processors. Subsequcntly, communication bctwccn host and slavcs can bc kcpt to a minimum. The same characteristics can also be found in other classcs o l statistical techniques, allowing to implement those in a similar way. The bootstrap [Diaconis and Efron 19831 is a method of determining paramctcr estimates and conlidcncc intervals by considering a large number of samples obtained by drawing (with rcplaccmcnt) the same number of objccts as in the original data set. (The idea is that such resampling is more faithful to thc data than simply assuming it to be normally disuibutcd.) Each of the samplcs may be processed by a slave, whilc at the cnd thc cstimatc and/or confidence intcrval are constructed by the host. In thc jackknife and some cross-validation techniques, the objccts of thc data set arc excludcd onc at a time. The objcclive of thc jackknife is to obtain bcttcr (less biased) pararnctcr cstirnates and to set scnsiblc conlidcncc limits in complcx situations [SCC Mostcller and Tukcy 19771. The purpose of cross-validation is to cvaluatc the pcrformancc of decision rulcs (an cxainplc is thc Icave-one-out proccdurc in discriminant analysis). Both tcchniqucs rcpcatcdly carry out the same calculations (in fact, n times) and are thcrelorc also well-suitcd for parallcl computation.
7. Conclusion Computing systcms havc grcally improved during the last years. Computcrs availablc today rangc from quitc powcrrul micro and minicomputcrs up to supcrcompiitcrs that arc ablc to solvc large and complex numerical problcms. Thcsc supcrcompuicrs acliicvc Uicir compulational pcrrormance through a combination or advanced processors arid an architcclurc that pcnnils cUicicnt algorithms. It appcars that a variety of statistical mcthods, that arc bascd on thc considcralion of a largc number of samplcs, arc iclcally suilcd for parallcl implcmcntation. Thc dcvclopmcnt or parallcl architccturcs thus opcns ncw possibilitics lor thcsc computationally intcnsivc proccdurcs.
Acknowledgeinen t Thc author wishcs to thank Drs. L. Kaulman and P. Roussccuw of thc Vrijc Univcrsitcit Brusscl for thcir collaboration in the studies prescntcd in this work, thc IBM Rcscarch Ccntcr in Kingston for thc opportunity to use the LCAP systcm and Drs. D. Logan and S.
19
Chin of IBM for thcir assistance in implementing these algorithms. This work was supported in part by the U.S. National Science Foundation through Grants INT 85 15437 and ATM 89 96203.
References Chin S, Doming0 L,Camevali A, Caltabiano R, Detrich J. Parallel Computation on the ICAP Computer System: A Guide to the Precompiler. Technical Report. Kingston, New York 12401: IRM Corporation, Data Systems Division, 1985. Clemcnti E. Global scientific and engineering simulations on scalar, vector and parallel LCAPtype supercomputers.Phil Trans R SOCLond 1988; A326: 445-470. Clcmenti E, Logan D. Parallel Processing with the Loosley Coupled Array of Processors System. Kingston, New York 12401: IBM Corporation, Data Systems Division, 1985. Diaconis P, Efron B. Computcr-Intensive Methods in Statistics. Scientijic American 1983; 248: 116-1 30. Flynn MJ. Very High Speed Computing Systems. Proceedings of the IEEE 1966; 14:1901-1909. Hopke PK, Kaufman L. The Use of Sampling to Cluster Large Data Sets. Chemometrics and Intelligent Laboratory Systems 1990; 8: 195-205. Kaufman L, Roussecuw P. Clustering Large Data Sets. In: Gelsema E, Kana1 L, eds. Pattern Recognition in Practice 11. Amsterdam: Elsevier/North-Holland. 1986: 425-437 (with discussion). Massart DL, Plastria F, Kaufman L. Non-Hierarchical Clustering with MASLOC. Pattern Recognition 1983; 16: 507-516. Mosteller F, Tukey JW.Data Analysis and Regression. Reading, Massachusetts: Addison-Wesley, 1977. Roussccuw PJ. Lease Median of Squares Regression. Journal of the American Statisrical Association 1984; 79:871-880. Roussceuw PJ, Leroy AM, Robust Regression and Outlier Detection. New York: Wiley-Interscicnce, 1987. Schcndcl U. Introduction to Numerical Methods for Parallel Computers. Chichester. England: J. Wilcy & Sons, Ltd., 1983. Steele JM, Steigcr WL. Algorithms and Complexity for Least Mcdian of Squares Regrcssion. Discrete Applied Mathemalics 1986; 14: 93-100.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computingand Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
21
CHAPTER 3
Parallel Computing of Resonance Raman Intensities Using a Transputer Array R.G. Efremov Shemyakin Institute of Bioorganic Chemistry, USSR Academy of Sciences, ul. MiklukhoMaklaya 16/10, 117871 Moscow, V-437,USSR
1. Introduction Recent progress in optical spectroscopy of biological molecules is closely connected to the development of the computcr power. The computational approaches have a great importance in solving the problems of spectral processing, interpretation and simulation. For example, computing the resonance Raman (RR) intensities of large molecules provides detailed information about the equilibrium geometry and dynamics of the resonant excited electronic state. The traditional approach to calculating the RR cross section involves a summation over all the vibrational levels of the resonant electronic state. Such sum-over-state algorithm performs a direct search for the complete set of excited state parameters that gcncratc the best fit to the expcrimcntal RR data. Usually, when there are only a fcw RR active modcs, the sum-over-state method is much more efficient compared with alternative approaches [I]. But in gcncral, using a standard sequential sum-over-state procedure can be computationally intractable for large (especially, biological) moleculcs because the algorithm scts up calculations the Raman cross sections of N vibrational modes of the molcculc and for each mode there are N logically nested loops. Therefore, such a technique demands powerful computer resourccs. Thc problem becomes processing-intensive espccially if a hard optimization is required-in a case when the initial estimates of excitcd slate paramctcrs are not known exactly. An altcrnativc approach to solving this task is based on the fact that the intensities of all modcs can bc calculalcd indepcndcntly at difkrent parallcl processors. The transputcr providcs an idcal unit for inexpensive, high power parallcl computers which can perform the algorithm in rcasonablc time and for real moleculcs of biological interest. The transputcr TSOO manufactured by INMOS [2] combines a fast 32 bit RISC (rcduccd instruction sct) proccssor (10 MIPS), fast static mcinory (4 Kb of on-chip RAM), a 61-bit floating point coproccssor (which can opcrate concurrcnlly with thc
22
ccntral proccssor) and four vcry fast (20 Mbit/s) bidirectional scrial communication links on onc chip. Transputcrs usc thc links for synchronous point-to-point communications with cach othcr. Thc links can bc switchcd so as to crcate any nctwork configuration. Some applications of the transputer arrays in the fields of computational chemistry, physics and biophysics havc bccn rcccntly dcscribcd [3-53. Thcsc studics clcarly dcmonstratcd the cfficicncy of the transputer architecture. It was shown that cvcn IBM AT or microVAX computcrs cquippcd with a transputer boards can be useful for such a proccssing-intcnsive problems like Monte Carlo [3], direct SCF [4]and biomolccular cncrgy [5] calculations. This work prcscnts a mcthod of parallel sum-over-state computing of rcsonancc Raman and absorption cross scctions using a transputer array. Wc havc implcmcntcd this approach to dircct modcling of thc cxpcrimcntal absorption spectrum of adcnosinc triphosphatc (ATP) and RR cxcilation profilc of ATP in watcr solution. As a rcsult, the sct of cxcitcd statc paramctcrs of ATP that provide the bcst fit to the expcrimcntal data has bccn obtaincd.
2. Computing resonance Raman and absorption cross scctions Thc rcsonancc Raman cross section in the Condon approximation can bc writtcn as thc convcntional vibronic sum over stiltcs [I]:
6.
i+f
=
c,
I,
< f l v > < vli> M E , E L C E v - E i + E 0 - E L -ir
Iz
Hcrc I i >, I v > and IS> arc thc initial, intermediate, and final vibrational states; ~i and E , arc thc cncrgics (in cm-1) of thc statcs I i > and I v >; M is the clcctronic transition Icngth; E, and EL are Lhc cncrgics of thc incidcnt and scattered photons; EO is thc cncrgy scparalion bclwccn the lowcst vibrational lcvcls of the ground and excited clcctronic statcs (zcro-zcro cncrgy); is the homogeneous linewidth (in cm-I), and C,, is a constant. The corrcsponding cxprcssion for thc absorption cross section is: 2
a,(E L ) =
c,
I< vI i >I 2
r
M ELE v X
(Ev-
Ei
+ Eo-
E L )2 +
12
Thc absorption cross scction OA (A2/molccule) is rclatcd to thc molar absorptivity E (M-l cm-I) by 0 ,= 2.303 10I9 €IN,, where iVA is Avogadro’s numbcr. In thc simplcst approach thc vibrational frequencics and normal coortiinatcs arc idcntical in thc ground and cxcitcd statcs, and a system of nmod vibrational modcs can bc
23
trcatcd as a collection of N indcpcndcnt pairs of harmonic oscillators with frequencies a k (k = 1, ..., nrnod). Therefore, multidimensional Frank-Condon factors can be written as the products of onc-dimcnsional overlaps: nmod
nmd
v > n < f i l v i>
and
i =1
E,-
.ci =
C
;=1
h n , ( v i - ij)
So, thc cquations (1) and (2) for fundamental resonance Raman and absorption cross scclions are given morc cxplicitly as follows:
O,(RL)=
c, M 2 E L rCC
(4)
v1 v 2
Hcrc the Raman active modc has subscript 1. One-dimcnsional Franck-Condon factors can bc calculatcd with a recursive rclation [6]. If there were no changes in vibrational frcqucncics (Q) in thc cxcitcd clcctronic state then the factors for each mode vk can be shown to bc only a function of its displacement dk in the excited state. In equations ( 3 ) and (4) nrnod vibrational modcs are included in the summation over quantum numbcrs vi = 0, 1, 2, ... of the uppcr clcctronic state for which the product of Franck-Condon factors excceds a cutoff lcvcl (wc uscd -104-105 times the magnitude of the zero-zero transition). Raman and absorption cross sections are also dependent on the environment effects bccausc in the condcnscd phase diffcrent scatterers may be either in different initial quantum states or in slightly diffcrcnt surroundings. Such phenomena lead to inhomogencous broadcning of RR excitation profiles and absorption spectra. The vibrational modcs which are active in resonance Raman don't undergo large displaccmcnts and frcqucncy shifts upon excitation. Thcrcforc, little error results from neglccting thcrmal effccts [l]. Variations in the local cnvironmcnt can be taken into account if a Gaussian distribution of zcro-zcro cncrgics with standard dcviation 0 (in cm-1) is proposcd to describe the sitc broadcning cffccts [7]:
24
where <Ep is an average zero-zero energy (in cm-l). A similar equation is also m e for the absorption cross section.
3. Hardware and software environment The hardware used in the present work is shown in Figure 1. It consists of five NMOS transputers T800-25, each with 1 Mb RAM (the transputer array) of which one node (root transputer) is attached to a host personal computer AT 386-25 running under MS-DOS, A file server program (3L, Ltd.) placed on the PC controls the access of the transputer to the disk, the screen, the keyboard etc. The transputer processor has four N M O S links, to connect it with other transputers. Each link has two channels, one in each direction. They provide synchronized, unidirectional communication. The hardware configuration as well as logical interconnections between the processors and tasks were described by a configuration language included in 3L Parallel FORTRAN package [8]. As it can be seen from Figure 1, the server task placed on the host PC is not directly connected to the application program. The filter task is interposed
. . DATA
RES.
110 SERVER
...
I
#
’
COMPUTER
4
RES.
ROOTT1
DATA
t
RES.
TASK 4
T4 Figure 1. Hardware configuration of the transputer array.
RES.
T2
T5
25
between them. It runs in parallel with the server program and the application task and passes on messages traveling in both directions. Such a configuration file has the form: PROCESSOR Host PROCESSOR Root PROCESSOR T1
.......... PROCESSOR T4 WIRE ? Host[O] Root[O] WIRE ? Root[l] T I [O] WIRE ? Root[2] T2[0] WIRE ? Root[3] T3[0] WIRE ? T I [ l ] T4[0] TASK Afserver INS=l OUTS=l TASK Filter INS=2 OUTS=2 DATA=lOK TASK I N S 2 O U T S 2 DATA=...K TASK I N S 1 O U T S 1 DATA=...K TASK
................ PLACE
The program for RR intcnsitics calculation was written in 3L Parallel FORTRAN.
4. Algorithm The general flowchart of the program is shown in Figure 2. The main algorithm placed on the root processor performs all input/output instructions and starts the nonlinear
26
INPUT DATA 1. starting values of the parameters to the optimized:d(i) (i=,... ,nrnod),pJ2. other parameters necessary for cross sections (resonance Raman or absorption) calculations 3. parameters of optimization process 4. I
exp
-experimental Rarnan (i=l ,...,N) cross sections
THE BODY OF OPTIMIZATION ALGORITHM (Dickinson's random search and/or Davidon-Fletcher-Powellmethods)
1
INTERMEDIATE VALUES OF d ( i ) , W
1
MESSAGE OF d(i),@,r
I
TO SLAVE TRANSPUTERS
CALCULATIONS OF CROSS SECTIONS OF NMOD VIBRATIONAL MODES ON ROOT AND SLAVE TRANSPUTERS
OPTIMAL VALUES OF d(i),8,T
h
TERMINATION MESSAGE TO SLAVE TRANSPUTERS
I
ROOT TRANSPUTER
CALCULATION OF U-FUNCTION ON ROOT TRANSPUTERS
I
Figurc 2. Gcncral flowchart of thc parallcl FOKTKAN program for calculation rcsonance Ranian and absorption cross scctions.
21
optimization with a Dickinson’s random search or/and Davidon-Fletcher-Powell gradicnt mcthods [9]. For random numbcr gcncration with the 32-bit processor wc have uscd the function RAN4 [lo]. The optimization function to be minimized was written in the form: 2
nmod
for rcsonancc Raman:
,y ( d ,
0,r)=
C ( 17’- I$) i =1
N
for absorption:
U ( d , 0, r)=
1
( 17 - Il?)
I
2
i =1
whcrc d = dj 0’ = 1, ..., nmod)-the displacement of the excited state potential cncrgy correspond to cxpcrimcnlal and calcurvc along thcjth normal coordinate; I pxP and culatcd Raman or absorption cross scctions, nmod is a number of the vibrational modcs (in this work nmod = 4 or 5), N is a numbcr of points in digitized absorption spectrum. According to Equation (5) we calculated inhomogeneous broadening as a finite sum (usually 50-100 stcps) ovcr a Gaussian distribution of zero-zero energies with standard dcviation 0. An avcragc value of Eg and a set of l f x P (i = 1, ..., nmod) for ATP were estimated from the analysis of high-resolution absorption spectra as well as detailed resonance Raman cxcitation profile (R.G. Efrcmov and A.V. Feofanov, unpublished data). At thc first stcp of computational scheme the excited state parameters ( d , 0,r> which givc the bcst fit to thc absorption spectrum wcrc found using the optimization-function U&(d, 0, 0. Thcn the refinement of the data obtained was performed to achieve the minimal discrcpancy between the experimental and calculated RR cross sections. Bcforc the calculation of optimization function at fixed values of d(i) (i = 1, ..., nmod), 0 and I-, thc main task scnds the messages containing the parameters d , 0,r to the “computational” tasks placcd on the slave processors T1 (for mode v2), T2 (for mode v3), ctc. and also pcrforms thc samc algorithm for the first vibrational modc ( y ) . So, thc data arc dividcd bctwccn fivc proccssors working in parallel. Onc of thc most important considcrations is that each Lransputcr performs a similar amount of work and all thc rcsourccs arc utilizcd at this stcp. Whcn the cross scctions (Raman or absorption) arc calculatcd, thc msks immcdiatcly scnd thc rcsults obtained to thc main module and thc ncxt stcp of optimization bcgins. It should bc notcd that all thc proccsscs arc synchronizcd, so, it’s impossiblc for thc main program to start the next iteration bcforc it rcccivcs thc mcssagcs from the slave transputcrs. Each Lransputcr operates indcpendcntly of the othcrs. Thc proccsscs on differcnt Lransputcrs synchronizc only if necessary, i.e., if thcre is a data transfcr bctwccn the tasks.
5. Speed and efficiency of parallel code Thc timing and cfficicncy of the parallcl implementation for calculation thc FU7 cross scclions offour and fivc vibrational modcs (nnzod = 4 or nmod = 5 ) are givcn in Table 1. The
28
TABLE 1 CPU times for computing the resonance Raman cross sections of 4 and 5 vibrational modes with a PC and PC-based transputer array. CPU time, sec Number of modes
PC
T800 transputer array
Speed-up coeff.
Sequential code obtained with MS-FORTRAN4.10 compiler
Parallel code obtained with 3L Parallel FORTRAN compiler
4
1,352
7
193
5
20,610
101
204
actual speed-up obtained for this simulation using four and five transputers when comparing the elapsed time on a heavily used PC AT 386-25 are approximately 193 and 204 times, correspondently.Potentially greater reductions are available with increasing number of processors for molecules which contain more than five normal modes revealed in resonance Raman spectra. However, as the number of vibrational modes and transputers in the array is increased, the time for communication between transputers may become important. Thus, the efficiency will decrease as the communication time becomes more dominant. For the tasks dividing the data bctwcen differentprocessors each of which performs the same calculation it was shown that the elapsed time drops linearly when the number of processors is less than 10 [Goodfellow JM [lo]]. Therefore, it is reasonably to suggest that for molecules with a large number of intensive Raman active modes (nmod > 20) the computational efficiency of thc transputer farm will not be so impressive and elapsed time will be almost independent of the number of proccssors.
6. Application to analysis of absorption and resonance Raman spectra of ATP We have followed the procedure described above to determine the excited state displacements (&)), homogeneous width (I)as well as inhomogeneous standard deviation (@ for 5 vibrational modes of ATP. Experimental absorption spectrum and calculated fit generated from the parameters estimated in the result of sum-over-statemodeling are shown in Figure 3. It is clearly seen that the experimental profile and the simulated one are in good agreement. There are only small discrepancies on the high-energy edge of thc absorption. At the next step wc uscd the parameters obtained and experimental resonance Raman spccua of ATP excited with different laser lines to calculate the refined set of the
29
6
Energy, cm -1
x10
Figure 3. Experimental (solid line) and calculated (dashed line) absorption spectra, for ATP. Calculated curve was generated using the parameters found by means of parallel computing procedure (di for 1302, 1333, 1482, 1510 and 1582 cm-1 vibrational modes are: 0.14, 1.13,0.55, 0.42 and 0.37; 8 = 750 cm-1; 725 cm-1; E c 37594 cm-1).
r=
Energy, ern-'
x10'
Figure 4. Experimental (solid line) and calculated (dashed line) resonance Raman excitation profiles of 1333 cm-1 mode of ATP. Calculated curve was generated using a set of best-fit excited state parameters (see Fig. 3). Experimental profile was taken from: Efremov RG and Feofanov AV, unpublished data.
30
paramctcrs which provide the best fit to resonance Raman absolute cross sections. The final bcst-fit resonance Raman cxcilation profile of ATP is shown in Figure 4. Thc calculatcd contour corrcctly rcproduccs thc main features of the expcrimcntal one. It is intcrcsting to note that the bcst-fit paramctcrs are very similar to those obtained from the analysis of absorption spcctrum. The details of cxpcrimental and calculated resonance Raman profilcs will be published in the nearest future.
7. Conclusion The use of five T-800 transputers connected in a farm allowcd the calculation of absorbance and resonance Raman cross sections of ATP molccule with a sum-over-slates tcchniquc. Large increascs in spccd wcrc obtained using FORTRAN code and with a simple approach to parallclism. The method proposed was shown to be crficient in scarching for the sct of cxcitcd statc parameters which give the best fit to the spcctral data-thc spcctra gcncratcd with such paramctcrs are in good agrccmcnt with thc cxpcrimcnlal oncs. Rcccntly wc hnvc succccdcd in obtaining the UV-RR spcctra of a functional complcx oC ATP with thc mcmbranc-bound protein Na+,K+-ATPasc[ 111. Thc substrate binding in thc cnzymc’s active sitc has bccn shown to bc accompanicd with significant changcs in thc clcctronic-vibrational structure of the adcnine ring of ATP. Thc computing prcscntcd in this study is an indispcnsablc step in solving thc problcm of intcrprcmtion of Lhe spcctral changcs induced by thc ATP binding in the activc sitc of Na+,K+-ATPasc.
References Mycrs AR, Mathics RA. In: Spiro TG, Ed. Biological Applicalions of Raman Spectroscopy. Vol. 2. New York: Wilcy, 1987, 1-58. 2. Homcwood M. May D, Shcphcrd D, Shcphcrd K.IEEE Micro 1987; 7: 10. 3. Gorrod MJ, Coc MJ, Ycarworth M. Nuclear Instrum and Methods in Physics Res 1089; A281: 156. 4. LVcdig U, Rurkhartlt A, v. Sclmcring HG. 2 Phys D 1989; 13: 377. 5. Goodfellow JM, Vovcllc F. Eur Biophys J 1989; 17: 167. 6. hlanncback C. Physica 1951; 17: 1001. 7. Mycrs AB,H a m s KA, Mathics RA. J Chem Phys 1983; 79: 603. 8. Parallel FORYRAN Uscr Guide 2.0. 3L Ltd., I’ccl Housc, Ladywell, Livingston EH53 6AG, S c o h x i , 1988. 9. Siddal JN. “OPTISEP” Designer’s Optimizalion Subroutines (ME’I71IDSNIREPl). Faculty of Enginccring, McMastcr Univcrsity, Hamilton, Ontario, Canada, 1971. 10. Prcss WH, Flanncry BP, Tcukolsky SA, Vcttcrling WT. hrumerical Rccipes. The Art of ScicnliJic Compuling. Carnbridgc: Cambridge University I’rcss, 1986,215. 11. Efrcmov KG, Fcofanov AV, Dzhandzliugazyan KN, Modyanov NN,Nabicv IR. FEOS LcIt 1990; 260: 257. 1.
E.J. Karjalainen (Editor), Scienrific Compuring and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
31
CHAPTER 4
A User Interface for 3 0 Reconstriiction of Computer Tomograms or Magnetic Resonance Images M. Fruhauf I;raunhofer-Arbeilsgruppe Graphische Datenverarbeitung, WilhelminenstraJe 7,D-6100 Darmstadt, FRG
Abstract Thrcc dimensional rcconstruction of medical image data is accomplished by espccially dcvclopcd volume rcndcring techniqucs, e.g., the casting of rays through the reconstructed volumc data sct. To make these techniques available for the daily use in hospitals for diagnosis, a spccial uscr intcrfacc is rcquircd bccause of diffcrcnt reasons: - Only intcractivc exploration and manipulation of data provides bcst chanccs for diagnosis and ihcrapy planning. - Mcdical imagc data is vcry complcx and, thcrcforc, hard to render in rcal-time. - Only volumc rcndcring tcchniqucs producc high-quality 3D reconstructions of mcdical iniagc dam. - Mcdical staff is not familiar with thc principlcs of computcr graphics such as roution of objccts or lighting and shading. Thc most challenging part of thc uscr intcrfacc is the rcal-timc intcractivc dcfinition of vicwing paramctcrs for volumc rcndcring. Vicwing paramctcrs in this casc arc thc vicwpoint and cut plancs through the volume data set. We use an approach for thc fast rcndcring of volumc dan which is bascd on a scanline-based tcxture mapping on a cubc’s surface. Furthermore, wc havc dcvclopcd tools for intcractive colour assignment, lighting, scgincnlation and contour extraction which arc included in the uscr intcrfacc. Thc uscr intcrfacc solvcs the problcm of real-time rotation and slicing of 16 mcgabytc CT data scts on graphics workskitiom. It follows thc natural undcrstanding of rolation and is opcratcd by using thc workstation’s mouse.
1. Introduction Many syslcms for thc thrcc-ciimcnsional rcconstruction of mcdical imagcs (CT, R4RI or ultrasound) havc bccn dcvclopcd up to now. Howcvcr, most of thcsc syslcnis could not
32
succeed in medical practice. One reason for that is, that interactive work is not sufficiently supported by these systems. The interactive analysis of complex volume data sets and the interactive finding of the bcst possible representation for the required features is, however, a precondition for achieving scientific insights. The obtained representation of a medical volume data set is heavily influenced by the specification of the visualization parameters. Therefore methods for the interactive specification of these parameters have to be developed. The procedures for the 3D representation of volume data in scientific visualization can be divided in two different categories. The first category are surface-oriented representations. Surfaces representing certain features of the data are explicitly calculated from the volume data set in hand. These surfaces are then rendered with “conventional” methods. Here, the calculation of the surface representation is computing-intensive or can be realized only with the support of the user, whereas the shaded representation of the surfaces can be produced relatively fast. Moreover, surface rendering is supported by the hardware of the graphics workstations for polygon shading. In the second category there are volume-oriented representations. Volume rendering procedures compute shaded 3D representations directly from the volume data set in hand. Here, the complete volume data set is always the basis for the calculation of the representation. Only a few preparing, simplc data conversions have to be made prior to rendering. The represcntation method and quality as well as the emphasis of single features of the data sct in hand are determined by choosing the representation parameters. Since the whole data sct is always in hand, this kind of 3D representation of volume data in scientific visualization is very computing-intensive and therefore very time-costly. High-quality reprcscntations cannot be calculatcd on graphics workstations in spaces of time which allow for an interactive analysis of the data sets. Therefore methods have to be dcvclopcd and implemented, with which the parameters for controlling volume-oriented representation procedures can be chosen. These methods have to be appropriate to judge the effects on the final representation, after having chosen the parameters. The user has to be able to create the echo of the choice of the parameters in short time, in order to support interactive work for the data analysis. The choice of the parameters should largely be realizcd in an interactive way and not by a command lanpage. Volume-oriented procedurcs for thc rcprcsentation of complex data sets arc very different with regard to complexity and picture quality. Representations which can be gcnerated easily and fast (previews), can provide a quick survey of the data sct in hand and of the choice of the parameters.
2. Visualization parameters The volume visualization procedure we are using for volume rcndcring [l], applies the
33
following parameters to specify the representation to be created: Volume data set Viewpoint or rotation angle Cut planes through the volume Visibility Rendering method - Method for the calculation of normals - Shading method - Method for the calculation of samples - Light source position Reflection parameters Colour - Transparency - Resolution Visibility, colour and transparency are individually defined for each volume element (voxcl). Thcy arc defined by means of functions depending on the scalar value of the voxel. Such functions can be picccwise linear or piecewise constant. For these functions thrcshold valucs are thcn interactively specified. For the transparency and colour assignment to the voxcls, as wcll as for the setting of the visibility attribute, the following chapter presents suitablc echo types. The choice of the most appropriate methods for the actual procedure of volume visualization can understandably not be simulated by other methods. Here the user depends on his expcricnce with the single methods. In order to accelerate the generation of the representation, pictures with lower resolution can be calculated. In general, however, these will only provide a rough impression of the representation quality to be expected. One of the most difficult tasks in volume data visualization is to develop a tool, with which evcn large data sets (e.g., 2563) on graphics workstations can be rotated and cut in rcal-time. The following chaptcr also presents a method for that task (see also [2]).
-
3. Echoes of the parameter specification 3.1 Colour and transparency assignment Colour and transparency assignments on voxels are rcalizcd, in the simplcst case, by mcans of thrcshold valucs in the range of values of the data. The definition of the threshold valucs is cffcctcd aftcr the analysis of selected data set layers. The result of the assignmcnt is excmplarily checked on these or other layers (Color plate 1). Transparency valucs are mappcd onto grcy values. The interactive specification of the threshold values is supportcd by two dcviccs: first, by inquiring about the value of single, mouse-identificd data points, and second, by displaying a chosen threshold value in context with the
34
histogram of a scanlinc in one data set laycr (Color platc 1). All thcsc opcrations can be carricd through in real-timc on one data set laycr each. Prccondition for thc implcmcntalion of thcsc tools is, howevcr, that one singlc pixcl can be idcntificd with the workslation’s mouse.
3.2 Viewpoint and cut planes Mcdical image data is arranged in a rcgular grid. Thus, a data cubc is obtaincd from a stack of imagcs. This simplifies the algorithm for thc intcractivc spccification of the vicwpoint immcnscly. The user recognizes the cubc’s position from the position of the cdgcs and vcrticcs of the cube. Top and bottom, left and right, front and back arc rccognizcd on the basis of thc inner structure of the cubc’s surfaces. Thcrcforc we projcct 2D pixmaps from the volumc data set onto thc six surfaccs of the cube. In the simplest case, thc outcrmost laycrs of thc data set are used for lhat purpose. If thcsc layers, howcvcr, do not contain sufficient information, thc volume data is orthogonally projcctcd onto the cubc’s surhccs, abovc a thrcshold valuc having bccn dcfincd by thc uscr. This tcchniquc is cspccially appropriatc for data scts which are not prcscnt in cubc form. Thc calculation of thcsc six pixmaps is cffcctcd during thc preproccssing. Thcy arc kcpt in thc main mcmory of the workstation while using thc tool. During thc rotation of the data cubc, only thc eight vertices of the cubc are transforincd. At any timc, at most thrcc faces of the cube arc visiblc. The visiblc faces arc idcntificd with the aid of the normal vcctor. The orthogonal projcction of the cube’s faces onto thc image planc is cffcctcd by simply eliminating the z-coordinatc. The corrcsponding pixmaps arc mappcd onto thc thus dcvcloping parallelograms by scaling and shcaring (Color platc 2). For that a scanlinc-based fill algorithm is used [3]. Cuts through thc volumc arc spccificd by succcssively rcmoving visiblc Surfaces of thc cubc. In this case only one of the six pixmaps has to bc rccalculatcd. The quick rccalculntion of onc pixmap is hcrc supportcd by information from its z-buffcr. In casc of rcmoving onc laycr, thc ncighbouring faces arc only rcduced by onc pixcl (Color platc 3). For the rccalculation of the visiblc layers it is imporklnl, that thc complctc data sct is prcscnt in the main mcmory during uansit time. Our expcricnce shows that the bcncfil of thc tool is not rcstrictcd, whcn large data scts are rcduccd by the factor two for this purpose.
3.3 Light source and reflection parameter Thc uscr intcrfacc has to explain the effects of spccifying thc light sources and rcflcction paramctcrs to uscrs, who arc not familiar with thc principles of thc 3D rcprcscntation in computcr graphics (lighting and shading). Since shading with regard to complcx sccncs cannot bc cCrccicd in rcal-timc, thc echo gcncration to thc lighting paramctcrs is rcalizcd
35
‘I’ Z
Figure 1. Interaction method for rotation.
by means of a simple scene from geometric solids. Here especially one scene offers itself, consisting of a sphere and a cube. For such a simple scene, the shading can be calculated very quickly on a workstation.
4. Interaction techniques In interactive applications, where three- or more dimensional data are mapped onto a two-dimensionalprojcction area, there is the problem of using input equipment with two degrees of freedom for the specification of more than two input parameters. In the concrctc case this means: How can we realize the specification of the viewpoint at a 3D scene with the mouse of a graphics workstation? Since in our tool we only apply parallel projection, the specification of the viewpoint can be substituted by rotating the object by three axes. In order to specify the rotation, the background of the projection area is divided in “rotation-sensitive” fields. Different fields imply different rotations (Fig. 1). These fields support an intuitive understanding of the rotation direction. Different mouse keys imply rotation increments varying in size (e.g., 20°, 5 O , lo). The application of this concept allows to specify the viewpoint at complex volume data sets in real-time with a few mouse operations, without having to care about Lhe orientation of the coordinate system or other geometric details. For the specification of the cut planes, we use the concept of successively removing or adding picked surfaces. This procedure has proved convenicnt and also allows an echo gcneration in real-time on low-cost workstations.
36
5. Previewing Another method of supporting the interactivity of volume visualization systcrns is the generation of fast rendered images in lower quality (previewing). In other fields of computer graphics, wireframe representations are used for that purpose. Shaded rcprescntations of high quality are only produced for the final view. In volume rcndcring the usc of a wire framc reprcsentation is not possible because of different reasons. The first reason is the lack of any explicit surface representation of objects in the data set. The second reason is that the interior of “objects” in the data set is not homogencous. The neglect of that inhomogcneity as in wircframc representation would complicatc thc orientation in the data sct for the scicntist. The third reason is that the structure and thus the surface of “objects” crcated by the intcrpretation of thc data set is very complcx. Thcrcfore a wircframe representation is difficult to compute and would in most cases consist of many vectors. Furthcrmore, surface reprcscntations of volume data have to be recomputed after slicing the data set. For fast volumc rendering, three differcnt strategies can be applied. First, simple rendering tcchniques may be used. Those methods produce less detailed images, but the images can bc rcndcred quickly. The Back-to-Front projection strategy (BTF) [4] can be applied, for examplc, instcad of the more costly raycasting mcthod. In that case it has to be considcrcd that the rcsolution of the image depends directly on the rcsolution of the data sct using BTF projcction. Bcsides, the computation of the gradicnt for shading can be simplified. The methods for gradicnt computation have becn published clscwhcre and the dirfcrcnccs in the quality of gencrated images have bccn invcstigatcd [ 5 ] , [6], [7]. A very casy mcthod of visualizing volume data sets is the additive rcprojcclion [8]. This mcthod produces visualizations similar to X-ray images. Thc sccond class of strategies produces low-resolution images. Furthcrrnore, thc image gcncrntion can be bascd on a subset of the original mcdical image data set (Color platcs 4 and 5). If only cvcry sccond voxcl in the data set is used, thc amount of data is rcduccd by factor eight. Since computation time dcpcnds linearly on the magniludc of thc data set, image gcncration time is decreased significantly. The third mcthod of accclcrating image computation is called context-scnsitivc adaptive imagc gcncration. Marc Levoy dcveloped a mcthod, in which pixcls are gencrated in the first stcp in a coarsc raster. This rastcr is refined gradually. Thc rcfincmcnt is pcrformcd first in regions of high frcqucncies in the image. Using this mcthod, from the first stcp an imagc of high quality is computed. The amount of dctails is incrcascd stcpwise. Evaluations of Lcvoy prove that raycasting algorithms with lcss than 20% of the rays alrcady producc iniagcs of surficicnt meaning.
37
6. Conclusion The dcscribcd tools of the user interface for volume rendering in medical imaging are very suitablc for the definition of rendering parameters. In near future the tools will be combined to a complete user interface, which will be implemented using OSFDIotif undcr X-windows. Thus, this user interface will be available on many different hardware platforms.
References Friihauf M. Volume Visualization on Workstations: Image Quality and Efficiency of Diffcrent Techniques. To appear in: Computers & Graphics 1990; 14(4). 2 . Friihauf M, Karlsson K. The Rotating Cube: Interactive Specification of Viewing for Volume Visualization. In: Eurographics Workshop on Visualization in Scientific Computing, Clamart. France, April 23-24, 1990. 3. Hofmann GR. Nonpolygons and photographic components for naturalism in computer graphics. In: Hansmann W, Hopgood FRA, Strasser W, Eds. Eurographics ‘89. Amsterdam: North-Holland, 1989. 4. Fricdcr G, Gordon D, Reynolds RA. Back-to-Front Display of Voxel-Based Objects. IEEE Comp Graph App Jan. 1985: 52. 5. Chcn L ct al. Surface Shading in the Cuberille Environment. IEEE Comp Graph App Dec. 1985: 3 3 4 3 . 6. Hijhne KH, Bernstein R. Shading 3D-Images from CT Using Gray-Level Gradients. IEEE Trans Med Imag 1986; Vol MI-5 No.1: 45-47. 7. Encarnacao JL, Friihauf M, Gobel M, Karlsson K. Advanced Computer Graphics Techniques for Volume Visualization. To appear in: Proc of the tutorials on Geometric Modelling: Methods andApp1ication.s. Boblingen, West Germany, May 9-11, 1990. 8. Johnson ER, Mosher CE. Integration of Volume Rendering and Geometric Graphics. In: Upson C, Ed. Chapel Hill Workshop on Volume Visualization. Department, of Computer Scicnce, University of North Carolina, Chapel Hill, 1989: 1-8 9. Levoy M. Volume rendering by adaptive refinement. Visual Computer 1990; 6: 2-7. 10. Wcndlcr Th. Cooperative Human-Machine Interfaces for Medical Image Workstations: A Sccnario. In: Lcmke HU, Hrsg. Proc Int Symp CAR’89. Berlin Heidelberg New York: Springer Vcrlag: 1989: 775-779. 11. Lcmkc HU ct al. 3-D cornputcr graphic workstation for biomedical information modelling and display. In: Proc SPIE-Int SOC Opt Eng, Conference Medical Imaging (SPIE). Newport Bcach, CA, USA 1987; Vol. 767, pt. 2; 586-592. 12. Folcy JD, Van Dan1 A. Fundamentals of Interactive Computer Graphics. Addison-Wesley 1983.
1.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computingand Auromation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
39
CHAPTER 5
Aiitornutic Correspondence Finding in Deformed Seriul Sections Y. J. Zhang Information Theory Group, Department of Electrical Engineering, Delft University of Technology, The Netherlands
Abstract Three dimensional objcct rcconstruction from serial sections is an important process in 3D iinagc analysis. One critical stcp involved in this process is the registration problcm. Wc proposc a gcncral approach for automatic rcgistration of deformed serial sections. The automatic profile rcgistration is realized in two consccutivc steps, that is thc onc to onc corrcspondcncc finding and followed by the onc by onc alignment. An automatic incthod callcd pattcrn matching is proposcd for the corrcspondcnce finding. This incthod is bascd on pattcrn rccognition principles and consists of dynamical patlcrn forming and gcomctric pattcrn testing. A practical implcmcntation of lhc pattcrn matching proccdurc in thc contcxt of 3D rcconstruction of mcgakaryocytc cells in bone marrow tissue of 2pin scctions is prcscntcd. Local gcomctric rclationship among profilcs of different objccts in Ihc samc scction and the statistic characteristics of section dcformation havc bccn cmploycd to ovcrcoinc the difficulty caused by scction distortion. Satisfactory rcsulls havc becn oblaincd with rathcr scriously dcformcd tissue sections. As thc matching patterns can bc dynamically conslructcd and thcy arc translation and rotation invariant, this method should also bc useful in a varicty of rcgistration applications.
1. Introduction In 3D iinagc analysis of biological structures, the rcconstruction of 3D objccts from scrial scctions is a popular proccss to study thc inncr and body structurcs of objccts. Onc csscntial process involvcd in thc 3D objcct rcconstruction is the alignmcnt of succcssivc profiles bclonging to samc objccts in consccutivc scctions [l]. A morc gcncral Lcrm uscd is iinagc rcgistration. Thc goal of rcgistration is to corrcct thc cvcntual position, oricntation, magnification and cvcn grcy lcvcl diffcrcnccs in thc iinagcs of scrial scctions [2].
40
Two classes of registration techniques can be distinguished: one class is the corrclation technique (see [2], [31), which consists of formatting a set of correlation mcasuremcnts between two image sections. Thcn the location of maximum correlation is dctcrmincd. Two imagcs can bc spatially registered according to this location via a section transformation. One important drawback of this technique is that it nccds enormous computation time for complicatcd image, even some techniques for reducing computation effort have bccn proposcd [4].Morcovcr, corrclation is also scnsitive to gcomctrical distortion [ 5 ] . Anothcr class is the landmark technique [6,7], that mcans in aligning sections with the hclp of landmarks, which can be artificial or anatomical ones. The usc of artificial landmark techniques is often limited, because the landmarks must be introduced before cutting the scctions and some distortion may also be introduced into the tissue during thc creation of thcse marks. On the othcr hand, the choice of anatomical landmarks is very problem-oriented and it is difficult to detect thcm automatically. Rccently, shapc points obtaincd from object contours have been proposed in 3D registration process [8]. Howcvcr, thc determination of shapc points requires the results of objcct scgmcntation. The automatic rcconstruction process will be even much complicated if a large population of objects should be treated and for cach object high resolution imagcs should be uscd to gct accurate mcasuremcnt results. In such a case, vcry expensive computation efforts would bc expcctcd. Wc have presented a two stcp reconstruction mcthod for rcsolving such a problcm [9]. Two typcs of reconstructions are splitcd: onc is symbolic rcconsuuction which involvcs idcntifying and linking thc scparatcd profilcs of an objcct cmbcddcd in a scrics of scclions; anothcr is pictorial rcconstruction which consists of grouping consccutivc profilcs bclonging to thc same objects, and rcconsuucting thcsc objcco in thc 3D space as thcy wcrc bcfore being cut. Corrcspondently, two lcvcls of rcgisuations arc also distinguishcd. Wc call them global lcvcl registration and local lcvcl registration, scparatcly. In thc formcr one, all pairs of corresponding objcct profilcs in two consccutivc scclions are to be idcntificd. That is, for cach profilc of an objcct in onc scction, to dccidc if it is continued in the adjaccnt section, and if it continues, finding thc corrcsponding onc in this section (this may be accomplished with somc lowcr rcsolution imagcs). Thcn, with Lhc hclp of this global result, we can geomctrically corrcct Lhc diffcrencc in oricnlation and position (also possiblc for scaling and/or grcy lcvcl) of these corresponding profiles with full rcsolution imagcs to gct prccise rcgistration rcsult in cach local lcvcl. As wc can scc, thc corrcspondcnce finding is a mandatory stcp in 3D objcct rcconslruction. This is bccausc thc rclalion bctwccn two profilcs of onc objcct in two scclions is first cstablishcd, thcn it bccomcs possible to pcrfcclly rcgistcr thcsc two profilcs. Abovc mcntioncd two classes of rcgistration tcchniyucs can bc uscd in thc sccond lcvcl of thc rcgisuation. Howcvcr, thcy arc not suitablc for the first lcvcl of rcgistration wlicn thc scctions arc individually dcformcd during section prcparation (this is oftcn thc casc cspccially with vcry thin scctions, and cach section is trcatcd scparatcly).
41
The basic assumption for most existing registration techniques is that the set of structures of an object can be considered as a rigid body and when the whole set changes its position and orientation in the space, all structures belong to this set always keep their original relative positions with regard each other. On basis of such an assumption, the registration can be achieved by a global transformation of the whole set. However, when the sections are deformed, the above assumption may be violated. Since the geometric relationship among different profiles in one section would be changed due to distortion, the relative positions of corresponding profiles in adjacent sections will also be altered. For such a situation, good one to one match cannot be obtained solely by a global transformation of the whole section. Because of the considerable disparity from section to section, perfectly enlisting a few pairs of corresponding object profiles could not ensure the correspondence between other pairs of object profiles. A direct one to one object registration becomes necessary in this case. In section 2, one automatic technique for the one to one correspondence finding in deformed serial section using local coherent information will be introduced. We call it pattern matching as it is based on pattern recognition principle and consists of dynamical pattern formatting and geometric pattern testing. Section 3 will give a real example of the use and practical implementation of pattern matching technique for the quantitative analysis of 3D megakaryocyte cells in bone marrow tissue. Finally, some considerations for improving the pcrformancc of pattern matching and using this technique to meet other applications are discussed in section 4.
2. Pattern matching
2 .I General description In this section we will present an automatic correspondence finding technique for the registration of profiles in adjacent sections. It is convenient to present every profile by a point located at the center of gravity when all profiles under consideration are rather dispersed in a large area, which is often the practical situation. With such a representation, one useful information which can be still exploited is the local relationship among vicinity profiles and such information would be less influenced by the section distortion. It is implied here that the deformation normally occurs in other places than the profiles themselves lying down. The block diagram of the whole procedure of pattern matching is shown in Figure 1. Each time, two conscculive scctions are taken into account for the correspondence finding. Thc section uscd as reference one is called matchcd section whereas the section to be matchcd is callcd matching section. When we use the point representation one point is callcd matchcd point or matching point according to the section it belongs to. For each matchcd point in matchcd section, the objectives are to determine if it has or has not a
42
Figure 1. Block diagram of pattern matching procedure.
hlatchcd scction
Rlatchcd object choicc
Rlatchcd pattern formation
objects choice
Matching pattern formation
I
idcn tification
corrcspondlng matching point in matching scction, and in the tormcr case, to idcntily 11s corrcspondcnt. Thc first Lisk is thc matchcd point selection in matchcd section, for which a corresponding point in matching scction will be lookcd. For cach sclcctcd matchcd point, a pattcrn is constructed by using thc information concerning this point and several surrounding points. Thc information rclatcd to onc pattcrn can be grouped into a pattern vcctor. Thc dimcnsion of the pattcrn vcctor dcpcnds on the quantity of information uscd to idcntily the pattcm. Thc pattern vector thus formcd has an onc to onc correspondcncc with this point. Next comcs the sclcction for potcntial matching points in the matching scction. The rcgion of scction, in which thc sclcction is taking placc, is callcd search rcgion. This rcgion can bc rcslrictcd to some parts of the section if a priori knowlcdge about the point distribution is availablc. Othcrwise, thc search region may be the wholc section. For each candidatc matching point, a corresponding pattern is also constructed similarly as for thc matchcd point. A numbcr of pattcrn similarity tests arc then performed, each time bctwccn thc sclected matchcd pattern and one of Ihe potential matching patterns. The decision about whcthcr any matching pattcrn matches thc given matchcd pattcrn is madc according lo thcsc tcsts. If a matching rcsult bctwccn the two pattcrns is obtaincd, thc rclation bctwccn two ccntcr points will bc idcntificd. This thrcc stcp proccdure is pcrformcd, in soinc cxtcnt, as a human bcing would do for such work [lo]. This incthod sharcs somc
43
common properties with template matching techniques (see, e.g., [ll]),however, some essential differences exist. Some detailed discussions will be given in following subsections.
2.2 Pattern construction The patterns are constructed dynamically as opposed to a previously fixed template in template matching techniques. For each matched point, m nearest surrounding points are chosen in the construction of the related matched pattern. This pattern will be composed of these m points together with the center point. The pattern is specified by the center point coordinates (XO, yo), the m distances measured from m surrounding points to the center point (dl, d2, ...,d,J and the m angles between m pairs of distance lines (81, 82, ..., 8,J A pattern vector can therefore be written as:
where 1 is the label of center point. Each element in pattern vector can be considered as a feature of the center point. The maximum of m distances is called the diameter of the pattern. For each potential matching point, a similar pattern needs to be constructed. All surrounding points falling into a circle (around the matching point) which has the same diameter as that of the matched pattern are taken into account. These pattern vectors can be written in a similar manncr:
The number n may be dirfcrent from m (and can be different from one pattern to another), because of the deformation of section and/or the end of continuation of objects. The above process for pattern formation is automatic. Moreover, this pattern formation method is rather flexible as here no specific size has been previously imposed to construct patterns. We also allow those constructed patterns to have a different number of points. The pattern thus constructed is called “absolute pattern”, because its center point has absolute coordinates. One example of absolute pattern is illustrated in Figure 2A. Each pattern thus formed is unique in the section. Every pattern vector belongs to a fixed point and presents specific properties of this point. To match corresponding points in two adjacent sections by means of their patterns, translation and rotation of patterns are necessary. The absolute pattern formed abovc is invariant to rotation around the center point because it is circularly defined, but it is not invariant to translation (see Fig. 2B). To overcome this inconvenience, we further construct, from each absolute pattern, a corresponding “relative pattern” by discarding the
44
A
B
Y
Y
0 - - -
-.
Yo
translation d
X
XO
Figure 2. Absolute pattern.
coordinates of center point from pattern vector. One such pattern is shown in Figure 3A. The relative pattern vector corresponding to (1) can be written as:
The relative pattern is not only invariant to rotation around the center point, but also invariant to translation (see Fig. 3B). The absolute pattern has a one to one correspondence with respect to the center point. The uniqueness of relative pattern is related to the number of surrounding points in Lhe pattern and also depends the distribution of objects in sections. This is because that one point set has its unique pattern description, but several sets do not necessarily have different pattern descriptions. Intuition tells us that when Lhe amount of points are expanded and the distribution of points is more random the uniqueness of the description becomes more sufficient. So the appropriate value for m should be determined by a leaming set.
A
B
Y / - - - - -
Diameter \
\ I
\
I
''.
/ /Angle
' a . -
I
I
0
Figure 3. Relative pattern.
1
Distance X
I
Y
45
2.3 Pattern similarity test and decision making The pattcrn similarity can be trcatcd only in certain specific positions and oricntations as opposite to systematic calculations of every possible translation and rotation value of the tcmplatc relative to the refcrcnce image. The relative patterns proposed above are used in similarity tcst. As patterns would contain various numbers of points, the dimensions of thcir pattcrn vectors will bc different and term-by-term based matching (e.g., [12] p. 428) can not be pcrfomcd. A gcomctrical technique has been used to assist the similarity tcst. Two tcsling pattcrns (one from each section) can be unlike in position and orientation. Wc first translatc the matching pattern to put its center overlap with the center of the matchcd pattcrn. Since relative patterns are under consideration, this proccss just implies to put thcse pattcrns in the center of the coordinate system. Then, we rotate the matching pattcrn until two distance lines (one from each pattern) coincide. This produces a combincd configuration of two pattcrns. Continuing rotate the matching pattern, we will obtain all combincd test configurations. At each tcst configuration, the similarity of the two pattcrns will be investigated. Here diffcrcnt criteria of similarity arc available. We have uscd the absolute difference measure bctwccn two testing patterns as its cumulative nature allows it to be incorporatcd into fast matching algorithms [13]. For cvcry point of the matched pattern, its nearest neighbor (in the Euclidcan scnsc) in thc matching section is first looked for. If two center points arc corrcspondcnts, there will be one test configuration where such distances are rather small and this shows the match position. This is predictable because the matching points aftcr translation and rotation will fall into the vicinity of relatcd matched points. If two pattcrns have a different number of points, it would imply that therc are some points which do not have their correspondents in adjacent section (this is also possible even if the two pattcrns have same numbers of points). If one point has not its correspondent in adjaccnt scction, its ncarcst ncighbor will be a point which has no relation to it. In this case, the mcasurcd dislance would be rather big. Taking such a distance into consideration will make the tcst to give erroneous results. To solve this problem, one thrcshold is uscd to eliminate thcse ncarcst ncighbors from funhcr calculation if thcir minimum distances with points of matchcd pattcm cxcecd a given Icvcl. Thc distancc of ncarcst neighbor givcs an index of thc similarity. This mcasurc can be uscd directly to calculatc thc goodncss of the match. Howcvcr, a mcrit function which is bascd on the wcightcd distance mcasure for mcasuring thc quality of match can also bc dcfincd. From onc sidc, it pcrmits to incorporatc somc apriori knowlcdgc about point sct and give accuratc judgcmcnt of match; from othcr sidc, it can hclp thc choice of thc threshold value mcnlioncd abovc. This function should be invcrscly proportional to thc distancc bctwccn a pair of comparcd profilcs. The highcr the function value, the better is the match.
46 ~~
Relative patterns formation
I
I Y Relative pattern rotation
1 Distance measurement I
1
I
5 D z Threshold
1-
Calculation of merit function
I
Calculation of RMS of merit values
I Searching the maximum of RMS I 1
Matching Point pair
j
Nomatching
Figure 4. Flowchart of pattern similarity test.
Thc pattcrn tcsls arc pcrformcd for all potcntial matching patterns. The final dccision about the goodncss of match is made aftcr all those pattern similarity tests. The root of mcan square (RMS) of all merit function valucs can be uscd for thc final dccision. Using RMS has the advantage of being more accuratc than the normal mcan value, cspccially whcn thc numbcr of mcasurcmcnls is small [14]. Thc potcntial matching pattcrn with minimum RMS of merit function valucs is considcrcd as thc corrcsponding pattcrn of Lhc matchcd pattcrn. Thc flowchart of pattcrn similarity test and dccision making is shown in Figurc 4. Oncc a match has bccn found for two profilcs, thcy can be rcspcctivcly labcled.
47
3. Real application
3.1 Material and method Hcrc we prcscnt a rcal expcrimcnt using above proposed technique. In the quantitative analysis or mcgakaryocytcs in bone marrow tissue, serial sectioning technique has been used whcrc the tissue was cut into a number of consecutive 2-pm thick scctions to reveal the inncr swucture undcr light microscope [9]. The 3D reconstruction of megakaryocyte cells was first rcstrictcd by the serious deformation of sections caused by a relatively complex scction prcparation procedure (some deformation examples can be found in [15]). One point rcprcscntation example of mcgakaryocyte cells in tissue is given in Figure 5, whcrc for two consecutive sections, not only the numbers of points (profiles) GROUP LABEL : 33 SECTION LABEL : 107 NUMBER OF CELLS : 112
. . ..
-
..
.
...:*:.... ... . -,.-: . . . . . . . . .. . ... -* *
I
I
-*.
-
*
. :1
..
..... - . I
...-
.
GROUP LABEL : 33 SECTION LABEL : 108 NUMBER OF CELLS : 137
0
.
.. . ..
. *
..
... -..'.-.:...
. ... * . . . . .. : *. %
*
Figure 5. Point rcprescntation examplcs of deformed section.
1 .
I
48
are different, but also the arrangements of points are rather dissimilar. Secondly, the expected big computation effort is also serious. The average surface of one section is about 6x107 pm2. A sampling rate of 5 pixel/pm has been employed for the accuracy of measurement, thus with full resolution, the image area will be 1.5~109pixels (corresponding to about six thousand 512*512 images) per section. The reconstruction approach indicated in the first section has been uscd to reduce the storage and computationalrequirements. Above pattern matching technique has been employed to find the corresponding profiles in those deformed serial sections. The process is carried out as described in last section. In the beginning, the first section in a group is chosen as the matched section, for each profile in this section, we look for its correspondent in the second section, with the pattern matching method. Then the next pair of sections is taken into account, this process continues to the last section of group.
3.2 Merit function For this application, a merit function indicating the quality of match is derived based on the statistical characteristics of mismatch error in the section and calculated according to thc probability theory. As described and tested in [15], the dissimilarity mainly caused by the section distortion in both axes of the rectangular Cartesian system are normally distributed. If X and Y are the random variables denoting the mismatch along x and y axes, respectively, their probability density functions can be written as follows (see, e.g., [16]):
P( Y) =
1 Exp (2 7 r y 2 cry
[*) 2.”y
where a, and orare the standard deviation of X and Y, respectively. As mentioned above the Euclidean distance has been used in similarity tests. This distance is a function of X and Y and is also a random variable. If we denote it by R, then:
The probability density function of R can be written as [17]:
49
whcrc p(x, y) is the joint probability density function of X and Y, and &) is the impulse function. It is reasonable to assume that X, Y are indcpcndcnt random variables and cr = a , = ojl, bccause the distortion is random, which has no dominant orientation. Under this condition, wc havc:
Taking (8) into (7), we finally gct:
Thc distribution of R is a Raylcigh distribution which attains its maximum at r = cr. Normalizing equation (9) yields :
Thc mcrit function is thcn dcfincd as:
I M(r)=
if O l r l o
I
r>o
if
Thc curvc of merit function is shown in Figure 6, where x = r / cr. This function is a singlc valued function of r. Thc good match rcccivcs morc rcward than a bad onc. For x 2 1, Figurc 6. Curve of a mcrit function.
0
0.5
1
1.s
2
2.6
3
3.5
50
thc function value is proportional to the probability of R and thus takes into account the statistical charactcristics of mismatch caused by the distortion of section. From Figure 6, one will notice that as the curve recedes from x = 1 ( r = o), it dcsccnds fastcr and faster until it reaches x = 43 ( r = 0*43). Bcyond that point, the dcsccnt is cvcr slower. It can further bc shown that this point is an inflcxion point. The valuc ( r = d 3 ) is choscn as the threshold as mcntioned in last scction. Only the distimccs bctwccn a point of matched pattcrn and its nearest neighbor which do not exceed this thrcshold will be taken into account for decision making.
3.3 Results In onc of our cxpcriments for thc 3D rcconstruction of mcgakaryocyte cclls, totally 1926 mcgakaryocyte profilcs sclccted from 15 consecutive sections have been ucatcd. Thc pattern matching proccdurc has been applied to those 14 scction pairs (the two sections shown in Figure 5 is one pair of them) to rcgistcr those profilcs. Thc automatically cstablishcd rclations bctwcen adjaccnt profilcs have bccn comparcd with the opcrator's obscrvations undcr microscope for the vcrification purpose. Thcsc rcsults are listed in Table 1. Thc fist column givcs the labels of scctions (matchcd/matching scction). The sccond column shows thc numbers of cclls in rcspcctcd matched and matching scction, rcspcctivcly. Thc sum numbcr givcn bclow is the numbcr of cclls lo bc matched. Thc rcsults arc rathcr satisfactory since 95.63% profilcs arc corrcctly assigncd by the automatic pattcrn matching proccdurc (rangc of 90.30% to 99.27%). Only 4.37% (rangc or 0.73% to 9.70%) of asscssmcnts arc incorrcct. In an othcr trial, whcn only global transformation has bccn used, the corrcspondcncc finding error ratc was attaincd 33.21% to 41.97% for difkrcnt scctions. The accuracy of our tcchniquc is quitc satisfactory. Bascd on thc rcsult of global rcgistration, 51 complctc mcgakaryocytc cclls in thosc sections havc bccn symbolically rcconstructcd. More dctails can be found in [ 151.
4. Discussion Thc pcrformancc of thc abovc mcntioncd corrcspondcncc finding tcchniquc can bc improvcd in scvcral ways in practical applications. In casc of most of objccts extending into scvcral consccutivc scctions, the profilc distributions in thosc scctions should havc somc rclations cvcn if thc scctions arc deformed. This likcncss may oftcn bc pcrccivcd whcn thc diffcrcncc in oricntation among sections has bccn corrcctcd. This implies, likc in thc two-stagc tcrnplalc matching tcchniqucs [18], a two stiigc pattcrn matching proccdurc can also bc cmploycd to takc advantagc of this likcncss. At thc first stagc, a rough transformation (translation and/or rotation of the wholc scction) applying thc lcast squarcs criterion (such as proposcd in [6]), but can be detcrmincd automatically as mentioncd abovc) can bc cmployed aftcr the rcgistration of scvcral pairs of corrcsponding
51
TABLE 1 Pattcm matching results. ~~
Scclion pair
# of cells
Correct
1021103 1031104 1041105 105/106 1061107 107/108 1081109 109/110 1101111 1111112 1121113 1131114 1141115 115/116
1331134 1341129 1291137 1371137 1371112 1121137 1371139 1391117 1171136 1361116 1161131 1311108 1081120 1201140
124 121 126 130 130 104 136 135 112 134 114 121 102 119
1786
1708
Sum Avcrage
%
93.23 90.30 97.67 94.90 94.90 92.86 99.27 97.12 95.73 98.53 98.28 92.37 94.44 99.17
Error
9 13 3
7 7 8 1
4 5 2 2 10 6
1
%
6.77 9.70 2.23 5.10 5.10 7.14 0.73 2.88 4.27 1.47 1.72 7.63 5.56 0.83
78 95.63
4.37
profiles in adjaccnt sections. This transformation allows to reduce the search region bccausc it would cstablish a preliminary correspondcnce between two sections. Only those matching points falling into the search region will be considered as potential matching points. In many cases, this region will be quite small compared with the whole scction area. As a result, only a small number of patterns are to be constructed and to be tcstcd. In this way, thc calculations will be greatly reduccd. Another improvemcnt may be madc is to rcducc thc computation effort for tcsting no-match patterns. Since the cumulativc diffcrcnce mcasurcment has bccn employed, it is possible that some judgcmcnt can bc madc bcforc all thc matching patterns arc complctcly tcstcd. For example, whcn the diffcrcncc bctwcen two tcsting pattcrs is already big enough that continuing calculation would be worthlcss, thc Lcst for current patterns can bc stoppcd. In our expcricncc, the nurnbcr of potcntial matching points can bc much largcr than thc. numbcr of malchcd points, thc computation rcduction is also considcrable. Slatistically, the largcr the mismatch bctwccn two patterns, thc shortcr is the computation. Thc pattcrn matching tcchniquc has certain generality because the patterns for matching arc dynamically constructed, the pattern can be any size so it is suitablc for the cases whcrc thcrc is grcat variation of the patterns to be matchcd. Thc principal idca behind this tcchniquc is morc gcncral, i.c., to extract characteristic descriptions for the rcgislration and classification of pattcrns. From this point of vicw, some straightforward modifications to adapt this tcchniquc to dilfcrcnt situations can be easily madc. For cxamplc, wc can introduce the tcxturc information of image into thc pattcrn vector whcn such an information is availablc to match tcxturc imagcs. Also thc basis of match can bc not limited to
52
spatial patterns, it can be any group of atlributes associated with the pcrccption of patterns. Finally, we want to point out that the symbolic reconstruction based on correspondcnce finding is a very useful process in 3D image analysis. On one hand, following 3D pictorial rcconstruction bccomes possiblc based on the relationship among different parts of same objects. On thc other hand, certain analysis tasks, such as 3D objcct numbcr counting and 3D object volume measurement, become possible cvcn before 3D pictorial rcconstruction. In many 3D image analysis applications, this will give cnough information and savc a lot of processing efforts. We can say that the early proposed two lcvcl rcgistration approach is not only a short-cutting procedure for registration of serial sections, but also a useful conuibution to the whole 3D image analysis task.
Acknowledgements We would like to express our heartfelt thanks to Professor G. Cantrainc and the mcnibcrs of h e clcclronics group, as well as Dr. J. M. Paulus and the mernbcrs of thc hcmatology group, of LiCge University, Belgium. The support of the Netherlands’ Project Tcam for Computer Application (SPIN), Thrce-dimensional Image Analysis Projcct is also gratefully apprcciatcd.
References 1.
Jolmson EM, Capowski JJ. Principles of Reconstruction and Thrce-Dimensional Display of Scrial Sections using a Computer. In: RR Mize ed. The Microcomputer in Cell and Neuro-
biology Research. Elscvicr, 1985. Pratt WK. Digital I m g e Processing. John Wiley & Sons, 1978. Gentile AN, Harth A.The Alignmcnt of Serial Sections by Spatial Filtering. Comput Biomcd Res 1978; 11: 537-551. 4. Barncl DI, Silverman HF. A Class of Algorithms for Fast Digital h a g c Registration. IEEE Trans Computers 1972; C-21(2): 179-186. 5. Rosenfeld A. Image Pattcrn Rccognition. Proceeding o f l E E E 1981; 69(5): 596-605. 6. Chawla SD, Glass L, Frciwald S, Proctor JW.An Intcractive Coniputcr Graphic Systcm Tor 3-D Stcrcoscopic Rcconstruction from Scrial Sections: Analysis of Mctastatic Growth. Cornput Biol Mcd 1982; 12(3): 223-232. 7. Kropf N, Sobcl I, Lcvinthal C. Scrial Section Reconstruction Using CAKTOS in Thc Microcomputer. In: RR Mizc cd. Cell and Neurobiology Research. Elscvicr, 1985. 8. Mcrickcl M. 3D Rcconstruction: The registration Problcm. Compul Vision Graphics Image Process 1988; 42(2): 206-219. 9. Zhang YJ. A 3-D Irnagc Analysis Systcm: Quantitation of Mcgakaryocytc. Submitted to Cytomctry 1990. 10. Tou JT, Gonzalcz RC. Paltern Recognition Principles. Reading, MA: Addison-Wcsley, 1974. 11. Koscnfcld A, Kak AC. Digilal Picture Processing. Acadcmic Prcss, 1976.
2. 3.
53
12. Gonzalcz RC, Wintz PA. Digital Image Processing (second edition). Reading, MA: Addison-Wcslcy, 1987. 13. Aggarwal JK, Davis LS. Martin WN. Correspondence Processes in Dynamic Sccne Analysis. Proceeding of IEEE 1981: 69(5): 562-572. 14. Wang LX et al. IIandbook of Mathematics. Peking, 1979. 15. Zhang YJ.Development of a 3 - 0 computer image analysis system, application to megakaryocyte quantitation. Ph.D. dissertation, Libge University, Belgium, 1989. 16. Zar JH. Biostatistical analysis. Prentice-Hall Inc, 1984. 17. Pugachev VS. Theory of Random Functions. Reading, MA: Addison-Wesley, 1965. 18. Vandcrbrug GJ, Kosenfcld A. Two-Stage Template Matching. IEEE Trans Computers 1977; C-26(4): 384-393.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
55
CHAPTER 6
Biological Applications of Computer Graphics and Geometric Modelling A.N. Barren1 and D. Summerbell2 lDeparttncnt of Computer Science, Brunel University, Uxbridge, Middlesex, UB8 3PII and 2National Institute f o r Medical Research, The Ridgeway, Mill Ilill, London, NW7 I A A , UK
Abstract This papcr dcscribcs an application of geomctric modelling to morphogcnctic behaviour in ordcr to simulatc and so quantify the processes responsible for growth and dcvelopmcnt. Thc organism sclcctcd for these studies was thc developing chick wing bud and the modclling proccss was carricd out in both two and three dimensions. The 3D simulation of thc growth during early stagcs results in irregularities of shape at the surface of the organism. Thc rcprcsentation of these surfaces using standard graphics packagcs prcscnts ccrtain di fficultics and wc consider alternatives for display of thesc and similar modcls.
1. Introduction Sincc thc publication of the thcory of positional information [l, 21, thcrc has bccn incrcasing intcrcst in thcorctical modcls that attempt to explain onc or othcr aspcct of patLcrn formation. Wolpcrt’s original formulation dcscribcd csscntially one-dimensional ficlds and utiliscd a source-sink diffusion modcl. However it was extcndablc to two or thrcc dimcnsions using indcpcndcnt variables for each orthogonal axis and was not limitcd to thc source-sink dillusion mechanism. Numerous papers have subscqucntly extcndcd thc idcas in the dircction of both othcr coordinate systcms [3] and othcr mcchanisms [4], but thc conccptual framework rcmains that of positional information. The theory has also catalyscd intcrcstcd in quite diflcrcnt approaches to pattern formation, such as prcpattem mcchanisms [5]. Surprisingly thcrc has not becn a comparable intcrcst in oihcr aspccls of devclopmcnt. In particular, thc dcvclopmcnt of form has bccn largcly ignorcd, though rcccntly therc havc bccn cxciting dcvelopments in this area [6, 71. They show, using quite difrerent
56
approachcs, how complex changcs in shape can be modelled using algorithms that simulate physical interactions between cells. In this papcr, we too are intcrestcd in shape rather than pattcm formation, and we present a mathematical model bascd on a finite element analysis approach, which describes changes with timc nccded to deform first a boundary and then a surfacc, so as to produce a solid resembling the wing bud. Thcrc arc several important advantagcs to be gaincd by mathemalically defining thc changcs in shape resulting from the morphogenetic bchaviour. In particular we can cxprcss the rclationship betwcen growth rate and position as measured from a specific origin within the systcm. This will enable us to subsequently evaluate the extent to which thc obscrved shapc changes can be corrclatcd with thc biological mcchanisins Lo be incorporated into any future modcl. Furthcrmorc, it enables us to make spccific prcdictions about the ccll bchaviour involvcd, such as ratcs of ccll division or cell dilation or ccll changc of shape. The control of form in the limb bud has bccn modclled on two previous occasions [8, 91. Both simulations worked in two dimcnsions, the anteropostcrior (AP) and thc proximo-distal (PD), and trcaled \he bud as a sct of cells occupying points on a regular matrix. Following uniform cell division one daughtcr cell remained in placc and thc other shifted one row of the matrix one position outwards. Arbitrary rules were formulatcd to control the dircction in which thc disturbance occurred so as to simulate limb-bud likc changes in overall shapc. Wc call these “pushing” models in which shapc changes arc controllcd by ccll division in thc interior or mcsodcrm. We have decided to explore a contrasting approach. Wc look at shape changes in the boundary and ueat thc interior as a homogencous mcdium Lhat serves only to apply a uniform outward prcssurc on the boundary. This we tcrm a “pulling” modcl in which changcs in shapc arc controllcd by changcs in cell division, cell growth or cell shapc in the boundary or ectodcrm. We expcct that this will lcad to a less arbitrarily structurcd modcl h a t can be used to study the control of growth and ultimately to provide a bettcr framcwork within which to cxaminc models for thc control of pattern formation.
2. The anatomy The chick limb bud dcvclops from a morphologically undcfincd oval patch in thc somatoplcur of thc latcral platc (Fig. la). This thin two laycrcd shcct of cclls, thc limb primordiurn, consists of mcsodcrm with a covcring of cctoderm. Thc cclls in the somatoplcur are dividing rapidly but Lhc ratc of division dcclincs cxccpt in Lhc prcsurnplive limb arca so that a low swelling is formed [lo]. Thc mesenchymal swclling approximatcs to a biparabolic solid with two planes of symmetry: the major antcro-postcrior (AP) axis of approximatcly 1200 pn and thc minor dorso-ventral (DV) axis of approximatcly 500 pm (Fig. Ib). Thc remaining proximo-distal (PD) axis is approximatcly 300 pm long at the
B.
A.
Sornits
a‘
*
D.
Anterior
L
Distal
Dorsal
L
Distal
k
b’ I
Mesoderm
I
Apical Ectodermal Ridge
Figure 1. (a) View from dorsal surface of 2 day chick embryo showing area of lateral plate from which limb bud will develop. (b) Tangential section (to whole body) through A-A’, showing the areas that will form the limb bud. (c) View from dorsal surface of limb bud of 3+ day chick embryo showing orientation and size of x-axis (proximo-distal) and y-axis (anterior-posterior).(d) Tangential section (to whole body) through A-A’, showing orientation of z-axis (dorso-ventral) and x-axis (proximo-distal). Perversely this plane of section is commonly styled “longitudinal” (to limb bud) by those working on the limb. The apical ectodermal ridge is a prominent structure running the length of the AP axis along the distal tip. It has no demonstrable structural role but there is some evidence for a signalling or organising function.
apex. The limb bud is now at approximately stage 18-19 [ l l ] and the succeeding pattern of growth leads to asymmetry in all three axes. The bud is still composed basically of two layers. A thin (effectively one cell thick), outer ectoderm surrounding a mesodermal stuffing. There is one major regional feature. The ectodenn at the distal tip forms a pseudostratified epithelium, the apical ectodemal ridge (AER). The ridge is only a few cells (approximately 50 pm) wide along the DV axis with a nipple-like profile, but runs for most of the length of the AP axis (approximately 1,000 pm). No structural role has been demonstrated for the AER but it does have a role in pattern formation, showing many of the characteristics expected of an embryonic organiser [12], boundary region [l], or reference point [13]. Our simulation starts at this stage of development (t = 0 h).
58
3. The 2D model At the start of limb outgrowth (t = 0 h). The shape of the bud as viewed from the dorsal surface is symmetrical about an axis normal to and bisecting the base of the limb. It then grows out in an essentially proximo-distal direction (Fig. 2a). We define the principle axes of the limb, consistent with Cartesian geometry, as coincident with the base (Y axis, approximating to the embryo AP axis) and a normal medio-lateral bisectrix (X axis, approximating to the limb proximo-distal axis), see Figure 2b. Symmetry consideration suggested that fitting a simple conic section such as an ellipse or parabola to the apical ridge would give a suitable mathematical description of the shape at this stage of growth. Of the two, ellipse or parabola, the latter (y2= 4ax) is preferred since the only parameter to be determined is the focal length a. The ellipse (CX2 / a21 + [r / 621 = 1) requires the determination of both a and b. The parabola has been used successfully as a curve representing shapes in many biological and engineering situations [14,6]. Figure 2b shows the fit of the parabola (y2 = 4a[c -XI)to the limb bud at time t = 0 in the growth pattern. The constant c is present only to position the origin of the parabola (y2 = 4ax) a point distance c from the coordinate origin. It does not affect the shape. Having found a simple geometric representation of the shape of the bud at an early stage it is now necessary to relate this to subsequent growth patterns. This was achieved by defining an origin at the intersection of major and minor axes, coinciding with the midpoint of the base of the limb (Fig. 2a), then constructing a series of radial vectors to intersect the boundary at successive increments along the X-axis (Fig. 2c). The next step was to associate a particular scaling coefficient (S) or “growth rate” with each of the radial vcctors (Fig. 2c). From Figure 3 it is clear that a constant scaling coefficient or “growth rate” would result in a uniform expansion symmetric about the X axis. This would be clearly contradictory to the type of patterns seen in later stages whcre there is a strong component in the negative Y direction (Fig. 2a). This implies therefore that the growth coefficients associated with the radial vectors would necessarily be greater for radii associated with negative Y than for those associated with positive Y for the same value of X . (It is important to emphasise at Lhis stage that the increment along X represents the perpendicular distance from the anterior posterior axis to the point of intersection of the vector with the boundary). The model initially selected for controlling growth rates was chosen as Ae-B
2
(C-X
1
where A and B are constants and X is h e distance from the anterior posterior axis. That is for each stage of growth we have to multiply each radial vector by an amount (s) proportional to 2 Ae-B (c-x )
59
Oh
12h
6h
22h
30h
Y
Figure 2. (a) Digitised outlines of the developing right chick limb bud seen from the dorsal surface. The first outline is at 72 h post incubation (t = 0 h), subsequent outlines, left to right show the time course of development. (b) Shows the basic parabola used to represent the initial shape of the bud and the radial vectors whose extensions represent the development at a given time. (c) Scaling coefficient applied to vectors R2, R,, R, and Rg from t = 0 to t = 6.
Since the growth of the bud is asymmetric about the Y axis it is clear that the model must reflect this and can only do so by appropriately varying the coefficients A , B between the growth stages. This is consistent with saying that the rates of growth in the upper part of the bud (Y > 0) are different from those in the lower part (Y c 0). Thus at any given growth stage the model is finally represented by 2
-B0(c-x )
s, = Ae
for Y > 0, r = 1,2, ..., 6
and
s2 = A e
-B&c-*
2
)
for Y c 0, r = 7,8, ..., 11
60
Since thc modcl is changing with time (t) we replace A, B0 and B1 above by A(t), Bo(f) and B l ( f ) . This process was found to give a fit of the simulatcd shape to the observed shapc of better than 10% ovcr the time period shown [15]. The simulation is continued up to t = 48 h. By this stage the geometry of limb outline is visibly growing more complex. The forces modclling the limbs are certainly more varied and includc both rapid clongation of proxinial skclctal clcmcnts [ 16, 171 and complcx morphogenctic movcmcnts involving limb and flank [18]. Thc conclusions that may bc drawn from the modcl at this stage suggcsls that the r a k of cxpansion is cxponcntial with the central section constraincd by the paramctcrs givcn in cquation 2. This conformation now enablcs us to considcr an extension of thc modcl to a 3D situation.
4. The 3D model In thrcc dimensions thc shapc of the bud is most closcly reprcsenled by a scmi-cllipsoid (Fig. 3) whose basic equation is
Notice that wc can select sections through this modcl which closcly rcsemblc those of the 2D modcl (scc above). We again conslruct a series of radial vcctors with rcfcrence to a spccificd ccnlral origin and from our previous modcl conclude that appropriatc functions for initiation of thc 3D growth simulation should be of thc form
Wc also concluded that the simulation and any subsequent analysis would bc hclpcd by adopting a finitc element approach. For our case a finitc clement is defined as a 'patch' on the surface of the scmi-ellipsoid dcfincd by a plane through an arbitrary numbcr of points on thc surface (Fig. 3). Our approach is to dclerminc thc normal to thc plane and associatc thc functions clcl'incd above [4] with cach normal. At each stagc of growth thcrcforc we extcnd thc normals by an amount calculatcd from thc function [4]. The cxtcnsion of cach of these normals givcs a sct of points which defines a ncw surface. To facilitatc a rapid calculation of Lhis surfacc wc usc a lcast squares fitting proccdurc applied to a variable numbcr of ncighbouring points and rcpcat thc process over thc growth period. The algorilhm for this proccss can bc cxprcsscd as a 4 stage proccdurc. 1. Gcncralc a mcsh of points on the surface of thc semi-cllipsoid [Equation 31.
61
Figure 3. The semi-ellipsoid, radial vector and surface patch with corresponding surface normal used as controlling parameters in the simulation process.
2. Determine the equations of the planar elements together with their normals, through an arbitrary number of neighbouring points. (The arbitrariness is determined by the user requirements in terms of the required resolution and accuracy). 3. Establish a set of radial vectors from a specified origin to each normal position at the centre of the planar elements and apply the set of exponential growth functions [Equation41 to each vector to generate new vectors. 4. Reconstitute the new surface from the ends of the vectors generated in 3, using the least squares fitting procedure.
Steps 2-4 are repeated until the end of the simulation period. Figure 4 shows the initial surface generated using step 1 of the algorithm and 2,000 vectors. To test the robustness of the model we allowed the algorithms to cycle between steps 2 and 4 over a period of 4 h and to account for the fact that variations in the growth rate occur at different parts of the surface we modify our exponential growth parameters according to the values of x and y in a given quadrant, i.e., vary the rate according to position. Figure 5 shows biphasal growth during the first 2 hours and Figure 6 shows the result of introducing quadriphasal growth up to 10 hours. The plotting software used was the standard G I N 0 package installed on a SUN Workstation. From Figure 6 we can see that the surface shows two sharp “ridge” type features across the centre. Attempts to extend the growth period beyond this causes folding along the ridges. This results in a non-singular value for the plotting function in these regions causing difficulties with producing output and consequently the standard wire frame model had to be abandoned. The fold running in the long axis has a real anatomical correlate and any representative model must be able to cope with this type of situation so our requirements were for suitable plotting software to handle these situations. The most obvious approach is to either section the model into two parts with the division along the fold line or to reconstitute the surface using hidden surface algorithms.
62
Figure 4. The initial surface generated by the computer as a wire frame representation.
1
.o .0
Figurc 5 . Thc rcprcscntation of early growth with a symmetric biphasal variation.
63
Bcforc continuing the discussion however we show how the rnodclling proccss cvcn at its prcscnt stagc hclps us to formulate an analysis for monitoring cellular behaviour. Of particular intcrcst to biologists arc thc typcs of forces acting on thc cclls rcsponsible for growth. Based on our current obscrvations we are able to formulate a sct of cqualions dcscribing the growth and combine these within a newtonian framework to derive an cxpression for thc surface forces relative to a specific origin. Our basic equations may be written as: S = Ae
-B
(x
'+ y 2 ,
[ A = at, B = bt, M = mt]
where a, b and m are conslants. Initial obscrvations show that m varies linearly with timc. We may wntc thc forcc acting on a finitc elcment over a time t as F = (d/dt)(mv)
whcrc v is thc vclocity of thc clcmcnt. Wc have averaged this value to the spccd of the ccnlrc of cach finitc clcmcnt with rcspcct to time, then F = mv
+ mit r10.0 9 .O 8 .O
7.0 10. 9. 8. 7 6.
E .O 5.3 4.0 3.0 2 .o I .o
.o
5 4.
3 2 1.
Figurc 6. As Figure 5 with an additional growth variation. (Note longitudinal ridgc developmcnt which has an obscrvcd anatomical corrclate).
64
from (5)
v = s 2
2
= w 2- b f ( x + y
){1-
:. ;= - &( x 2 + y2>e
bt( x 2 + y2)} -bt(x
2+
y2
thus 2
2
F = nim - b f ( x +y )[l- 361(x2+ y2)
+
Thc force has thcrcforc been cxpressed in terms of positional values x and y and thc paramctcrs a, b and m. Values for these parameters will be establishcd by matching the modcl against the experiinental obscrvations, i.c., we adjust the paramctcr values until a reprcsentative modcl is found for all values of 1. We will also be able to dctcrmine Lhc rates of change of the forces over the surface from the cxprcssion for 6F/6xand 6F/6y respec lively. Clearly the graphics rcprcsentation of the modcl which enables us to evaluatc hypothcscs on growth ratcs by comparing the simulated shape with that observed cxpcrimcntally is a major fcaturc of thc modelling process. Asidc from the hiddcn surface approach (Fig. 7) for graphics display of the model we arc also invcstigating an altcrnaLive approach which has provcd succcssful in rcpresenling biological type structures and would appear to ofcr considerable potcntial for our own studics. This approach, based upon the usc of hyperquadrics has bccn cxtensivcly rcscarchcd by Hanson [20] and we present a bricf outline of the method togcther with some initial results. Hyperquadrics may bc considcred as roughly analogous to splines in that they arc able to rcndcr a high rcsolution surface fittcd to a sct of points mapped in 3-dimcnsions just as splines givc a high resolution line fitted to a set of points mapped in two dimcnsions. The basic hypcrquadric equation is
with
whcrc x is a D-dimcnsional vcctor, r,i and d , are constants, 6, = f l (dcpcnding on whcthcr the required hypcrquadric is clliptic or hyperbolic in naturc) and ya is a uscr spccificd paramctcr.
65
Figure 7. A preliminary 3D reconstruction using hidden surface methods.
In the 3D case we may write the equations as N C I S , ' ( A u x + B u y + C,z+
Ya
D,t)l
=1
lF1
or in parametric form: x = r(O,$) cosOcos$ cos
y = r(O,$) sinOcos$cos z = r(O,$) sin$ cos t = sin
and solve for
For a more detailed description of the algorithms to solving the above equations see Hanson [19]. Figure 8 shows a number of 3D shapes based upon the above equations for varying y, The figure shows a variety of shapes which strongly resemble those of certain types of virus and approximate to the type of representation we require for our own
66
Figurc 8. An example of shapcs closcly rcscmbling those scen in certain virus rcprcscntations using thc nicthod of hypcrquadrics [ 191.
modcl. Figurc 9 shows a nuinbcr of surfaccs gcncratcd in which wc uscd thc 3D rcprcscntations of thc cquation for various paramctcr valucs of S and a. Wc arc currcntly sccking to rcconcilc the paramctcr valucs with thosc of thc model so that a mcnningful display of thc inodclling process can bc achicvcd.
5. Conclusions The 2D modclling has providcd suitablc information for initiating a 3D simulation of growth of thc chick wing bud ovcr a 30 hour pcriod. Wc have bccn able to dcrivc a numbcr of basic cquations which will cnable us to modcl the forccs shaping thc surface at any s ~ i g of c thc growth proccss. Promising mcthods for thcsc and othcr biological modclling cxcrciscs arc bcing invcstigatcd.
Acknowledgements Mr. Karl Matthcws, a Coinputcr Scicncc project sludcnt, for dcvcloping thc 3D plolting soltwarc uscd to producc thc rcsults shown in Figurc 7.
61
Figtirc 9. I’rcliminary surraccs gcncratcd on a microcomputer using hypcrquatlric mcthodology.
References Wolpcrt L. Positional information and the spatial pattern of cellular differcntiation. J theoret Biol 1969; 25: 1-47. 2. Wolpert L. Positional information and pattern information. Curr Top Dev Biol 1969; 6: 18 3-224. 3. French V, Bryant PJ, Bryant SV. Pattern regulation in epimorphic fields. Science 1976; 193: 969-981. 4. Gierer A, Meinhardt H. A theory of biological pattern formation. Kybernetik 1972; 12: 30-39. 5 . Murray ID. A prepattern formation mechanism for animal coat markings. J theoret Biol 1981; 188: 161-199. 6. Gordon R. Computational embryology of the vertebrate nervous system. In: Geisow MJ, Barrett AN, eds. Computing in Biological Science. Amsterdam: Elsevier, 1983: 23-70. 7. Ode11 G, Oster G, Bumside B, Albcrch P. A mechanical model for epithelial morphogenesis, J mafh Biol 1980; 9: 291-295. 8. Ede DA, Law JT.Computer simulation of vertebrate limb morphogenesis. Nature (Land) 1969; 221: 244-248. 9. Mitolo V. Un programma in Fortran per la simulazione dell’accrescimento e della morfogenesi. Boll SOCital Biol sper 1971; 41: 18-20. 10. Scarls RL, Janners MV. The initiation of limb bud outgrowth in the embryonic chick. Devl Diol 1971; 24: 198-213. 11. Hamburger V, Hamilton HL. A series of normal stages in the development of the chick embryo, J Morph 1951; 88: 49-92. 12. Spcmann H. Embryonic development and induction. New Haven: Yale University Press, reprinted Hafner, New York, 1938. 13. Goodwin BC. Cohen MH. A phase-shift model for the spatial and temporal organisation of dcvcloping systems. J fheoret Biol 1969; 25: 49-107. 14. Rarrctt AN, Burdett IDJ. A three-dimensional model reconstruction of pole asscmbly in Bacillus subtilis. J theoret Biol 1981; 92:127-139. 15. Barrctt AN, Summerbell D. Mathematical modelling of growth processes in the developing chick wing bud. Comput Biol A4ed 1984; 14: 411-418. 16. Summerbell D. A descriptive study of the rate of elongation and differentiation of the skeleton of the developing chick wing. J Embryo1 exp Morph 1976; 35: 241-260. 17. Archer CW, Rooncy P, Wolpcrt L. The early growth and morphogenesis of limb cartilage. Prog clin Biol Res 1983; 110: 267-278. 18. Scarls KL. Shoulder formation; rotation of the wing, and polarity of the wing mesoderm and ectodcrm. Prog clin BiolRes 1983; 110: 165-174. 19. Hanson AJ. Hypcrquadrics: smoothly deformable shapes with polyhedral bounds. Computer Vision, Graphics and Image Processing 1988; 44: 191-210. 1.
Statistics
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
71
CHAPTER 7
Experimental Optimization for Quality Products and Processes S.N. Deming Department of Chemistry, Univcrsity of Ilouston, 4800 Calhoun Road, Ilouston, 7 X 77204-SG41,U S A
Adaptcd, in part, from Stanlcy N. Dcming, “Quality by Design”, CIIEMWCII 1988; 18(9): 560-566.
1. Introduction Ovcr half a century ago, in 1939, William Edwards Deming wrote in the introduction to Waltcr A. Shewhart’s famous book on quality control [I], “Most of us havc thought of thc statistician’s work as h a t of measuring and prcdicting and planning, but few of us have thought it the statistician’s duty to try to bring about changes in the things that he measures. It is evident, however, ... that this vicwpoint is absolutcly csscntial if thc statistician and thc manufacturcr or rcscarch worker are to make the most o f each othcr’s accomplishmcnls.” Although Shcwhart’s methods wcrc taught bcforc and during World War 11, altcr the war thc mcthods wcrc largcly ignorcd in the west [ 2 ] . In postwar Japan, howcvcr, thc mcthods wcrc Latight by W. E. Dcming and olhcrs, and werc rcsponsiblc, in part, for recslablishing Lhc Japanese economy. Genichi Taguchi, a Japanese engineer, was espccially succcssful in using fractional factorial dcsigns in innovative ways to rcducc variation in a product or process at the dcsign stagc [3,4]. Aftcr almost half a ccntury, wcstem industry finally swms rcady to acccpt the statistician’s long-standing offer of help. “There is much to bc done if [industries are] to survive in thc new economic age. Wc statisticians have a vital rolc to play in Ihe transformation that is nccdcd to in‘akc our industry competitive in h e world economy” [51.
2. Quality Thc locus of Lhc current industrial transformation is “quality,” a word that has several nicanings. Two mcanings which are critical for both quality planning and stralcgic
12
0
0
20
40 60 Batch Number
80
100
Figure 1, Percent impurity vs. batch number for a chemical process.
business planning relate to (a) product or process advantages-features such as “inherent uniformity of a production process,” “fuel consumption of an engine,” and “millions of instructions per second (MIPS) of a computer” determine customer satisfaction; and (b) product or process deficiencies-“field failures,” “factory scrap or rework,’’ and “cngineering design changes” lead to customer dissatisfaction [6]. It is clear from these dctcrminants of quality that “[the] key to improved quality is improved processes. ... Processes make things work. Thousands of proccsscs need improvement, including things not ordinarily thought of as processes, such as the hiring and training of workers. We must study these processes and find out how to improve them. The scientific approach, data-based decisions, and teamwork are key to improving all of these processes” [ 5 ] . In the past, some industries achieved quality not by making the process produce good product but rather by separating the good from the bad based on mass inspection of the outgoing product. Only in rare situations is mass inspection the appropriate response to deficient quality. In general, “Inspection is too late, ineffective, costly. ... Scrap, downgrading, and rework are not corrective action on the proccss. Quality comes not from inspcction, but from improvement of the process” [71.
13
3. Statistical process control Considcr a chemical process that produces not only the desired compound in high yield but also a rclativcly small amount of an undesirable impurity. Discussions between the producer and consumer of this material suggest that an impurity level of up to 2.0% can be tolcratcd. A highcr impurity level is unacceptable. By mutual consent, a specification level of 5 2.0% is set. Figure 1 plots the percent impurity vs. batch number for this chemical process [81. Most of the time the percent impurity is < 2.0%, but about one batch in five is outside the specification. This can be costly for the manufacturer if there is no other customer willing to purchase “out of spec” material. These out-of-spec batches might be kept in a holding area until they can be reworked, usually by blending with superior grade material. But storage and rework are costly and ultimately weaken the competitive position of the manufacturer. Figure 1 is a way of letting the process talk to us and tell us how it behaves [9]. It seems to be saying that: on the average, the impurity level is below 2.0%; there is some variation from batch to batch; and the process behaves consistently-a moving average wouldn’t appcar to go up or down much with time, and the variation seems to be fairly constant with timc. Thcsc idcas are confirmed in the statistical “x-bar” and “I-” charts shown in Figures 2 and 3, respectively. To construct these charts, the group of 100 batches has been subdividcd into 25 scquential subgroups of four. For each subgroup, the average, ? (pronounced “x-bar”) and the range (r = greatest reading minus least reading in the subgoup) have bccn calculatcd. The resulting values are plotted as a function of subgroup number. (Subgroups are necessary, in part, to obtain estimates of the range.) As expected, the average pcrcent impurity doesn’t go up or down very much with time, and the range (a measure of variation) is fairly constant with time. Based on these observations, we make the assumption that the process is stable and place the middle, unlabelcd dashed line in Figure 2-the “grand average,” the “average of avcragcs.” We also estimate the standard deviation (s) and plot the three-sigma limits. The subgroup avcragcs will lie within these limits -99.7% of the time if the process is stablc. Only rarcly would a subgroup average lie outside these limits if the process is “in statistical control” (that is, if the process is stablc). Similar thrcc-sigma limits can be placed on the range. In Figure 3 only the upper control limit is labclcd-the lowcr control limit is zero. The unlabelcd dashcd line represents thc avcragc subgroup range. Only vcry rarcly (-0.3% of the time) would a subgroup rangc be grcatcr than the uppcr control limit if thc process is in statistical control. It is absolutcly csscntial to undcrstand that these control limits are a manifestation of Lhc proccss spcaking to us, tclling us how it behaves. These control limits do not represent how we would like the process to behave. It is common but misguided practice to
14
Lower C o n t r o l Limit
0
5
10 15 Subgroup Number
20
25
Figure 2. X-bar (mean) control chart from the data in Figure 1
0
5
10 15 Subgroup Number
Figurc 3. R (range) control chart froin the data in Figure 1.
20
25
I I0
15
20 25 Subgroup Number
30
35
Figure 4. Effcct of an out-or-control situation on the x-bar control chart.
draw on control charts lincs that rcprcsent our wishes. Thcse lincs can have no cffcct on Lhc bchavior of thc process. Control charts arc useful because they offcr a way of letting the proccss tcll us whcn it has changcd its bchavior. In Figurc 4 it is clcar that something significant happened at subgroup nunibcr 27-29. The process has clearly “gone out of control.” So many cxcursions so far away from thc conuol limit in such a short time would bc highly unlikcly from a statistical point of vicw if the proccss wcrc still operating as it was bcforc. Such cxcursions suggcst that thcrc is somc assignable cause (Shewhart) or specid cause (W. Edwards Dcming) for thc obscrvcd cffcct. Because thcsc excursions are undcsirablc in this cxarnplc (most of the individual batches produced would probably bc unfit for salc), it is cconoinically important to discover the assignable cause and prcvcnt its occurrcncc in thc future. Onc of thc most powerful uses of control charts is their ability to tcll us whcn the process is in statistical control so we can lcave it alone and not tamper with it. Another use of control charts is to show whcn spccial causes arc at work so that steps can bc taken to discovcr thc idcntity of these special causcs and use them to improve the process. Howcvcr, thcre arc two difficulties with this second use of control charts. First, wc have to wait for thc proccss to speak to us. This passive approach to proccss optimization isn’t vcry cfficicnt. If thc proccss always stays in stalistical control, wc won’t lcarn anything and the proccss can’t get bcttcr. It is not enough just to be in slatisti-
16
cal control-ur product must become “equal or superior to the quality of competing products” [6]. We probably can’t wait for the process to speak to us. We must take action now. Second, when the process does speak to us (when it goes out of control), it tantalizes us by saying “Hey! I’m behaving differently now. Try to find out why.” Discovering the reason the process behaves differently requires that we determine the cause of a givcn effect. Discovering which of many possible causes is responsible for an observed effect, an activity that continues to puzzle philosophers, is often incredibly difficult.
4. Experimental optimization In an excellent paper on cause-and-effectrelationships, Paul W. Holland [lo] concludcs that “[the] analysis of causation should begin with studying the effects of causes rather than ... trying to define what the cause of a given effect is.” That is a powerful conclusion. It is a recommendation that the technologist intentionaZZy produce causes (changes in the way the process is operated) and see what effects are produced (changes in the way the process behaves). With such designed experiments, information can be obtained now. We can make the process talk to us. We can ask the process questions and get answers from it. We don’t have to wait. Figure 5 contains the results of a set of experiments (open circles) designed to &scover thc effect of temperature on impurity. The right side of Figure 5 shows the presumed “cause and effect” relationship between impurity level and temperature. From the shape of the fitted curve, it would appear that, at least insofar as impurity level is concerned, our current operating temperature of 270 is not optimal. But there are two reasons why it is not optimal: not only is the level of impurity relatively high, but also the amount of variation of impurity level with temperature is relatively high. “Set-point control” is almost never set-point control. We might set the controller to maintain a temperature of 270.0, but time constants within the control loop and variations in mixing, temperature, flow rates, or sensors prevent the controller from maintaining a temperature of exactly 270.0. In practice, the temperature of the process will fluctuate around the set point. This variation in temperature is represented by the black horizontal bar along the temperature axis in Figure 5. Variations in temperature will be transformed by the process into variations in impurity level. The relationship between percent impurity and temperature is rather steep in the region of temperature = 270. When the temperature wanders to lower levels, the percent impurity will be high. When the temperature wanders to higher levels, the percent impurity will be low. This variation in impurity is represcnted by the black vertical bar along the impurity axis in Figure 5. The left side of Figure 5 suggests that variation in temperature will, over timc, bc transformed into variations in percent impurity and will result in a run chart similar to
270
I
280 Temperature
290
Figure 5. Results of a set of experiments designed to determine the influence of temperature on percent impurity.
that shown in Figure 1 when the process is operated in the region of temperature = 270. How could we decrease thc variation in percent impurity? One way would be to continue to operate at temperature = 270 but use a controller that allows less variation in thc temperature. This would decrease the width of the black horizontal bar in Figure 5 (variation in tcmpcrature) which would be transformed into a shorter black vertical bar (variation in impurity). Thc resulting run chart would show less variation. But this is a “brute force” way of decreasing the variation, There is another way to decrease the variation in this process. Ofher effecfs being equal, it is clear from the right side of Figure 5 that we should change our process’s operating conditions to a temperature of about 295 if we want to dccreasc the level of impurity. In this example, there is an added benefit from working at these conditions. This benefit is shown in Figure 6 . Not only has the level of impurity been reduced, but the variation in impurity has also bccn reduced! This is because the relationship between impurity and temperature is not as steep in the region of temperature = 295. When the process is operated in this region, it is said to be “rugged” or “robust” with respect to changes in temperature-the process is relatively insensitive to small changes in temperature [ 111. This principle of ruggedness is one aspect of the Taguchi philosophy of quality improvement [12]. We make the process insensitive to cxpccted random variations.
78
SPEC
2io
280
290
Temperature
Figure 6. Improved control limits for proccss operatcd near temperature = 295
If thc proccss wcrc opcratcd at this ncw tcmpcraturc, the corresponding control chart would bc similar to that shown in Figure 7, and the corresponding run chart would look likc Figurc 8. Some pcrsons criticize run char& likc Figurc 8 as belonging to “gold-platcd proccsscs,” processes that are “bcttcr than they nccd to be.” In some cascs thc criticism might bc justificd-for cxamplc, if Lhc cconomic conscquenccs of opcrating at thc highcr tcmpcraturc wcrc not justificd by thc cconomic conscqucnccs of producing such good product. But in many c a m thcrc are thrcc agumcnts that speak in favor of thcsc gold-platcd proccsscs. First, managcnicnt docsn’t havc to spend lime with customcr complaints, and no onc wastcs lime on nonproduclivc “firc-fighting.” Sccond, improvcmcnt of product oftcn opcns new markcts. And third, thcrc is a lot of “elbow room” bctwecn the pcrccnl impurity produccd by the iniprovcd proccss and the original specification limit in Figurc 8. I f thc proccss starts to drift upward (pcrhaps a heat exchangcr is fouling and causing thc impurity lcvcl to increasc), within-spcc matcrial can still bc produced while the spccial cause is discovcrcd and climinatcd.
5. Quality by design Considcration or quality in inanuracturing should begin before manufacturing starts. This
79
In N
SPEC
?
sw -,-I
L 3
a
2:LD +JW U L
60 d
ul 0
1
140
160
200 B a t c h Number
180
220
240
Figurc 8. I'crccnt iinpuriiy vs. bntch numbcr for a chcinical proccss operated near temperature =
295.
80
is “quality by design” 1131. Just as there is a producer-consumer relationship between manufacturing and the customer, so, too, is there a producer-consumer relationship between R&D and manufacturing. The manufacturing group (the consumer) should receive from R&D (the producer) a process that has inherent good quality characteristics. In particular, R&D should develop-in collaboration with manufacturing-a process that is rugged with respect to anticipated manufacturing variables [ 141. Experimentation at the manufacturing stage is orders of magnitude more costly than experimentation at the R&D stage. As Kackar 1131 has pointed out, “[It] is the designs of both the product and the manufacturing process that play crucial roles in determining the degree of performance variation and the manufacturingcost.” Data-based decisions, whether at the R&D level or at the manufacturing level, ofien require information that can be obtained most efficiently using statistical design of experiments. Creating such designs requires teamwork among researchers and statisticians. Researchers would agree that it is important for statisticians to understand the fundamentals of the production process. Statisticians would agree that it is important for researchers to understand the fundamentals of experimental design. As Box has stated [15], “If we only follow, we must always be behind. We can lead by using statistics to tap the enormous reservoir of engineering and scientific skills available to us. ... Statistics should be introduced ... as a means of catalyzing engineering and scientific reasoning by way of [experimental] design and data analysis. Such an approach ... will result in greater creativity and, if taught on a wide enough scale, could markedly improve quality and productivity and our overall competitive position.” The statistical literature is filled wiih information about experimental design and optimization [ 16-76] and can be consulted for details.
References 1. Shewhart WA. Statistical Method from the Viewpoint of Quality Control. The Graduate
School, The Agriculture Department,Washington, DC, 1939. 2. Godfrey AB. The History and Evolution of Quality in AT&T. AT&T Technical Journal 1986; 65(2): 4-20. 3. Bendell A, Disney J, F’ridmore WA, Eds. Taguchi Methods: Applications in World Industry. IFS Publications, Springer-Verlag,London, 1989. 4. Ross PJ. Taguchi Techniquesfor Quality Engineering: Loss Function, Orthogonal Experiments, Parameter and Tolerance Design. McGraw-Hill Book Company, New York, NY, 1988. 5. Joiner BL. The Key Role of Statisticians in the Transformationof North American Industry. Am Stat 1985; 39: 224-227. 6. Juran JM. Juran on Planning for Quality. Macmillan, New York, NY,1988, pp. 4-5. 7. Deming WE. Quality, Productivity, and Competitive Position. Center for Advanced Engineering Study, Massachusetts Institute of Technology, Cambridge, MA, 1982, p. 22. 8. Grant EL, Leavenworth RS. Statistical Quality Control. 6th ed., McGraw-Hill, New York, NY, 1988.
81
9. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, NY, 1978, pp. 1-15. 10. Holland PW. Statistics and Causal Inference. J A m Stat Assoc 1986; 81: 945-960. (See also ‘‘Comments’’and “Reply,” pp. 961-970.) 11. Deming SN. Optimization of Experimental Parameters in Chemical Analysis. In: De Voe JR, Ed. Validation of the Measurement Process. ACS Symposium Series no. 63, American Chemical Society, Washington, DC, 1977, pp. 162-175. 12. Taguchi G. Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Kraus International Publications, White Plains, NY, 1986. 13. Kackar RN. Off-Line Quality Control, Parameter Design, and the Taguchi Method. J Qua1 Techno1 1985; 17: 176-209. 14. Walton M. The Deming Management Method. Dodd, Mead & Co., New York, NY, 1986, pp. 131-157. 15. Box GEP. Technometrics 1988; 30: 1-18. 16. Anderson VL, McLean RA. Design of Experiments: A Realistic Approach. Dekker, New York, 1974. 17. Anon. ASTM Manual on Presentation of Data and Control Chart Analysis. Committee E-11 on Statistical Methods, ASTM Special Technical Publication 15D, American Society for Tcsting and Materials, 1916 Race Street, Philadelphia, PA 19103, 1976. 18. Barker TB. Quality by Experimental Design. Dekker, New York, NY, 1985. 19. Bayne CK, Rubin IB. Practical Experimental Designs and Optimizalion Methods f o r Chemists. VCH Publishers, Deerfield Beach, FL, 1986. 20. Beale EML. Introduction to Optimization. Wiley, New York, NY, 1988. 21. Bcveridge GSG, Schechter RS. Optimization: Theory and Practice. McGraw-Hill, New York, NY, 1970. 22. Box GEP, Draper NR. Empirical Model-Building and Response Surfaces. Wiley, New York, 1987. 23. Box GEP, Draper NR. Evolutionaty Operation: A Method for Increasing Industrial Productivity. Wiley, New York, 1969. 24. Cochran WG, Cox GM. Experimental Designs. Wiley, New York, NY, 1950. 25. Cornell J. Experiments wirh Mixtures: Designs, Models, and the Analysis of Mixture Data. Wiley, New York, NY, 1981. 26. Daniel C, Wood FS. Fitting Equations to Data. Wiley-Interscience, New York, NY, 1971. 27. Daniel C. Applications of Statistics to Industrial Experimentation. Wiley, New York, NY, 1976. 28. Davies OL, Ed. Design and Analysis of Industrial Experiments. 2nd ed., Hafner, New York, NY, 1956. 29. Davis JC. Statistics and Data Analysis in Geology. Wiley, New York. 1973. 30. Deming SN, Morgan SL. Experimental Design: A Chemometric Approach. Elsevier. Amsterdam, The Netherlands, 1987. 31. Dcming WE. Out of the Crisis, Center f o r Advanced Engineering Study. Massachusetts Institute of Technology, Cambridge, MA, 1986. 32. Dcming WE. Some Theory of Sampling. Dover, New York, NY, 1950. 33. Dcming WE. Statistical Adjustment of Data. Dover, New York, NY, 1943. 34. Dianiond WJ. Practical Experimeu Designs. 2nd ed., Van Nostrand Reinhold, New York. NY. 1989. 35. Draper NR, Smith H. AppliedRegression Analysis. 2nd ed., Wiley, New York, 1981.
82
36. Duncan AJ. Quality Control and Industrial Statistics. revised ed.. Irwin, Homewood. IL, 1959. 37. Dunn OJ, Clark VA. Applied Statistics: Analysis of Variance and Regression. 2nd ed.. Wiley, New York, 1987. 38. Fisher Sir RA. Statistical Mefhodsfor Research Workers. Hafncr, New York, NY, 1970. 39. Fisher Sir KA. The Design of Experiments. Hafner, New York, NY, 1971. 40. Flctchcr R. Practical Mefhods of Optimization. 2nd ed.,Wiley, New York, NY, 1987. 41. Hacking I. The Emergence of Probability. Cambridge University Press, Cambridge, ENGLAND, 1975. 42. Havlicek LL, Crain RD. Practical Statistics for the Physical Sciences. American Chemical Society, Washington, DC, 1988. 43. Himmelblau DM. Process Analysis by Statistical Methods. Wiley. New York, 1970. 44. Hunter JS. Applying Statistics to Solving Clicmical Problems. CIIEM7ECII 1987; 17: 167-1 69. 45. Ishikawa K. Guide to Quality Control. Asian Productivity Organization, 4-14, Akasaka 8chome, Minato-ku, Tokyo 107, JAPAN, 1982. Available from UNIPUR. Box 433 Murray Hill Station, New York, N Y 10157 (800) 521-8110. 46. Juran JM, Editor-in-Chief. Juran’s Quality Conlrol Handbook. 4th ed., McGraw-Hill, New York, NY, 1988. 47. Khuri AI, Cornell JA. Response Surfaces: Designs and Analyses. ASQC Quality Press, Milwaukee, WI, 1987. 48. Kowalski RK. Ed. Chemomefrics: Theory and Application. ACS Symposium Series 52, American Chemical Society, Washington, DC. 1977. 49. Mallows CL, Ed. Design, Data and Analysis: by Some Friends of Cuthbert Daniel, Wiley, Ncw York, NY, 1987. 50. Mandel J. The StatisficalAnalysis of Experimental Data. Wiley, New York, NY. 1963. 51. Massart DL, Kaufman L. The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, 1983. 52. Massart DL, Vandcginste BGM, Deming SN, Michotte Y, Kaufman L. Chemometrics, a Textbook. Elsevier Science Publishers, Amsterdam, 1987. 53. Mendcnhall W. Introduction to Linear Models and the Design and Analysis of Experiments. Duxbury Press, Bclmont, CA. 1968. 54. Miller JC, Miller JN. Statistics for Analylical Chemistry. Wiley. New York, 1988. 55. Montgomery DC. Design and Analysis of Experiments. 2nd ed., Wiley, New York, NY, 1984. 56. Moore DS. Slatistics: Concepts and Controversies, Freeman, San Francisco, CA. 1979. 57. Natrella MG. Experimental Stafisfics,National Bureau of Standards Handbook 91. Washington, DC, 1063. 58. Neter J, Wasscrman W. Applied Linear SfatisficalModels: Regression, Analysis of Vuriance, and Experimental Designs. Irwin, Homewood, IL, 1974. 59. Norman GR, Streiner DL. PDQ Statisfics. B. C. Deker, Toronto, 1986. 60. Scherkenbach WW. The Deming Route to Quality and Productivity: Road Maps and Roadblocks. ASQC Quality Press, Milwaukee, WI, 1986. 61. Scholtes PR. ?’he Team flandbook: IIow to Use Teams to Improve Quality. Joiner Associates, Inc., 3800 Kegent St., P. 0. Box 5445, Madison, WI 53705-044s (608) 2384134.1988. 62. Sharaf MA, Illman DL, Kowalski BR. Chemometrics. Wiley, New York, 1986.
83
63. Shewhart WA. Economic Control of Quality of Manufactured Product. Van Nostrand, New York, NY. 1931. 64. Small BB. Statistical Quality Control llandbook. Western Electric, Indianapolis, IN, 1956. 65. Snedecor GW, Cochran WG. Statistical Methods. 7th ed.,The Iowa State University Press, Ames, IA, 1980. 66. Stigler SM. The llistory of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, Cambridge, MA, 1986. 67. Taylor JK. Quality Assurance of Chemical Measurements. Lewis Publishers, Chelsea, MI, 1987. 68. Wemimont GT. (Spendley W, Ed.), Use of Statistics to Develop and Evaluate Analytical Methods. Association of Orficial Analytical Chemists, Washington, DC, 1985. 69. Whcelcr DJ. Keeping Control Charts. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1985. 70. Wheeler DJ. Tables of Screening Designs. 2nd ed., Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005, 1989. 71. Wheeler DJ. Understanding Industrial Experimentation. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005, 1987. 72. Wheeler DJ, Chambcrs DS. Understanding Statistical Process Control. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1986. 73. Wheeler DJ, Lyday KW. Evaluating the Measurement Process. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1984. 74. Wilson ER, Jr., An Im-oduction to ScientiJic Research. McGraw-Hill, New York, NY, 1952. 75. Youdcn WJ. Statistical Methods for Chemists. Wiley, New York, NY, 1951. 76. Youden WJ, Steiner EH. Statistical Manual of the Association of Official Analytical Chemists. Association of Official Analytical Chemists, Washington, DC, 1975.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor),Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
85
CHAPTER 8
Experimental Design, Response Surface Methodology and Mrilti Criteria Decision Making in the Development of Drug Dosage Forms D.A. Doornbos, A.K. Smilde, J.H. de Boer, and C.A.A. Duineveld University Centre for Pharmacy, University of Groningen, The Netherlands
1. Introduction If a paticnt consults a doctor he/she will have a fair chance to leave the doctor’s office with a prescription. Even in Thc Ncthcrlands (where the consumption of medicines is relativcly low) that chancc is about 50%. So almost everybody will know that a drug (the active substance) is not administcrcd in a pure form but has been formulated into a dosage form. This may be a fast disintegrating tablet, a sustained release formulation, a suppository, an ointment, an injcctable and there are many more. These dosage forms have in common that they are prepared with several excipicnts, that during the production proccss many proccss variablcs influence the properties of thc end-product and, most important, that their quality must be excellent and constant. This not only means that their chemical composition must conform to specifications, but more spccifically that those propcrtics that influence the response of the organism or the target organ must mcct criteria of constant value: they must be ruggcd. Frequently several criteria must be mct simultaneously and often some of these criteria are conflicting. To sum up: drug dosage forms are produced from a mixture of one or more drugs and some additives, many process variablcs influence the quality of the dosage form and critcria for that quality arc sct and must be maintaincd. As Stclsko [l] stated: “Pharmaceutical scicntisls arc often confronted with the problem of dcveloping formulations and processes for difficult products and must do so in spitc of competing objectives. Pressures, placed on the scicntist to balance variables and mcct these objcctivcs, can be cornpoundcd when limitcd funds, time and resources rcquirc rapid and accurate dcvclopmcnt aclivitics. Statistical expcrimcntal design provides an cconomical way to crficicntly gain the most information while expending the least amount of expcrimcntal cffort.”
86
TABLE 1 Factors and optimization criteria for tablets (direct compression).
Compositional factors
Process variables
(quantitative and/or qualitative) filler binder disintcgrant lubricant
tablet machine compression force mixing time relative humidity glidant (y/n)
Criteria: crushing strength, friability, disintegration time, dissolution rate, tablet weight variance, kcepability, robustness of these properties to variation of compositional and process variables
2. Drug dosage forms, factors and criteria For each typc of dosage form a list of the most important composition- and process variables can bc composed. Moreover, for each dosage form optimization criteria can be given. Some examples are given in Table 1. It can be expected that some of the factors or all of thcm interact. This means that the magnitude of a response of a criterion-value to a change of the level of a certain factor depends on the level or one or more other factors (a second, third or higher order interaction). In such a case a univariate scarch for an optimum levcl of the (compositional or proccss) factors docs not guarantee a real optimum. A multivariate search is indicated: one optimizcs the factors not separately one after each other, but the factor levels are varied according to a pre-planned design in a sequential- or in a simultaneous mode. In pharmaceutical literature some examples can be found of the application of the scqucntial Simplcx Mcthod, as advocated by Spendlcy et a1 [2] and the Modified Simplex Mcthod as proposed by Nclder and Mead [3]. Although this is an efficicnt method if factors are optimized for one single criterion, provided that thcrc are no multiple optima, it is not first choice if the optimization aims at a nurnbcr of criteria simultancously, as is often the case with drug dosage forms. In High Performance Liquid Chromatography it was tricd to solvc the multicriteria problcm by combining single criteria likc Resolution, Analysis timc and Number of peaks into a composite criterion, e.g., the well known Bcrridge Critcrion [4], but even combination of so few single responses gives rise to an ambiguous solution as has bccn shown by Debets et al. [ 5 ] . This will bc cvcn more thc case in drug formulation studics.
87
3. Examples of the use of sequential simplex, factorial design, and mixture design techniques in drug formulation studies
3.1 The sequential simplex An example of the use of a combined criterion, used in a Modificd Simplex optimization of a tablct formulation can be found with Bindschaedlcr and Gurny [6]. They studicd the cffcct of coinpression force and Aviccl/Primojcl ratio on a criterion that was a weighted sum of tablct hardncss and disintcgration time. Zicrcnbcrg [7] uscd the Modified Simplex mcthod to study the rclcasc of clcnbutcrolc (a single criterion) from a polyacrylate system for transdcrmal application. The factors were conccntration of clenbuterole and film thickncss. In both studics a good formulation was achieved but it must be rccognizcd that thcy workcd with simplc two-factor systcms, the Simplex being a triangle that visually can bc followcd in two-dimensional space, while in most formulation studies far more factors arc involvcd.
3.2 Factorial design and fractional factorial design Quite a fcw studics on thc use of factorial design in formulation research have been rcportcd. In the pcriod 1981-1989 we countcd ca 60 papers. In Tables 2, 3 and 4 the TABLE 2 objects of Lhcse studics are summarized. Dosage forms optimized in papers Among thc dcsigns uscd arc full factorial 1981-1989. dcsigns with 2, 3, 4, 5 or cvcn 6 factors Fast disintegrating tablets mostly at 2 lcvcls, but incidcntly at 3 or 4 Effervescent tablets Icvcls. Thcre were some dcsigns with 3, 4 or Slow-release tablets 5 factors at 2 lcvcls and anothcr factor at 3 Sintered tablets Lozenges Icvcls. Thc smallest dcsign was a 22 factoriCapsules al dcsign; the largcst number of cxpcrimcnts Solid dispersions was 96 in a 25*3 dcsign. In some studics a Suspensions factorial dcsign was augmcntcd with a Star Solutions Dcsign to give a Ccnlral Composite Dcsign. Thcrc arc two papers that dcscrve special TABLE 3 Unit Operations studied in papers mcntion. In a recently published study 1981-1989. Chowhan ct a1 [8] evaluated thc cffcct of 4 proccss variablcs at 3 levels each on friabilTabletting by direct compression Granulation ity, dissolution, maximum attainablc crushMicroencapsulation ing strcnglh and lablct wcight variation. In a Filmcoating full 34 factorial dcsign 81 cxpcrimcnts must
88
be performed but with the aid of a computer
TABLE 4
program for Computer Optimized Experi- Optimization criteria for tablets. menlal Design (COED) the 22 most inforCrushing strength malive factor combinations were selected. Friabilitv Quadratic response surface models were Disintegration time Weight variation calculated for each criterion. Optimum Dissolution behaviour rcgions were found by pairwise overlaying B ioavailability the individual contour plots. The other one Chemical stability is the recent paper by Chariot et a1 [9] in Solubility which they describe the way they selected a sct of experiments from a factorial lay-out according to a D-optimal design, using the program NEMROD. Instead of the 72 expcriments according to a 23*32 factorial design 11 experiments were selected, allowing the estimation of a regression equation with 5 linear, 2 quadratic and 3 two-factor interaction tcnns. In some papers factorial designs have bcen fractionated to 24-1, 25-2 or 2 6 3 designs, thereby strongly reducing the experimental effort at the expense of information about higher order interactions. In most studies the factorial designs were only used to identify the factors that most significantly influenced the response under study. But in a number of cases a mathematical modcl for each of the responses was postulated and the model fitted to the data, thus applying Response Surface Methodology. A response surface depicts one response, e.g., rablct crushing strength, as a function of some independent variables, e.g., compression force and concentration of binder, in general compositional- and process variables. The goal of response surface studies is to obtain a regression model that provides a means of mathematically evaluating changes in the response due to changes in the independent variables. Mostly experimental modcls are used, polynomials of first but preferably second or third order to describe curvature of the (hypcr)surface. From the response surface (for two factors a three-dimensional surface results, mostly depicted as a two-dimensional contour plot), optimal regions for the responses can be predicted. The prcdiction error will dcpcnd on the chosen design, on the error in the measured response at the design points and on the quality of the fit of the postulated model.
3.3 Mixture designs The above mentioned studies with factorial and fractional factorial designs have in common that with some exceptions only the effccts of process variables and qualitative compositional variables have been studied. However, in drug formulations in most cases quantitative compositional variables have significant effects. In those cases where quantitative compositional variables have bcen studied they were ueatcd as factorial variables,
89
20 kN
10 kN
-
p I actoce
p- lactose
A
avicel
a
- I actose 1aq
avicel
a-lactose l a q
Figure 1. Contour plot and levels of crushing strength (kg) of placebo tablets containing sodium starch glycolate as a disintegrant, compressed at 10 kN (left) or 20 kN (right).
p - I act ose
p- I actose
B
avicel
a- lactose l a q
avicel
a-lactose l a q
Figure 2. Contour plot and levels of disintegration time (s) of placebo tablets containing sodium starch glycolate as a disintegrant. compressed at 10 W (left) or 20 kN (right).
thereby neglecting the advantages of the mixture design approach. In Mixture Design Methodology use is made of the fact that the fractions of the componenls sum up to one:
41 + 42+ 4 3 = Givcn a polynomial of a certain degree, less experiments than would be necessary with a factorial dcsign suffice for the estimation of the coefficients of that polynomial, using this mixture constraint.
90
p - lactose
p- Iactose
C
Figure 3. Contour plot and levels of friability (%) of placebo tablets containing sodium starch glycolate as a disintcgrant, compressed at 10 kN (left) or 20 kN (right).
p- lactose
p-lactose
itarch glycolate
avicel
a-lactose laq
avicel
A
a-lactose laq -
Figure 4. Combined contour plot of crushing strength, disintegration time and friability of placcbo t;iblcts containing sodium starch glycolate as disintcgrant, comprcsscd at 10 kN (left) or 20 kN (right).
Up till now ca 10 papcrs havc appcarcd on the use of mixture dcsigns in drug formulation studies and in studies into solubility of drugs in mixed solvents; this is rclativcly few if compared with the large number of publications on applications of mixturc designs in chromatography, in particular liquid chromatography. An example can bc found with Van Kamp ct a1 [lo, 111 who studied scvcral combinations of disintcgrants, filler-binders and fillcrs as pscudocomponcnts with drugs and who added comprcssion rorcc as a proccss variable.
91
In our research group we developed software for the optimization of liquid- and thinlayer chromatographic separations, the program POEM [ 121. In cooperation with the dcpartment of Pharmaceutical Technology of our University Centre for Pharmacy we dcvclopcd at the same time the program OMEGA [ 131 for optimization of drug formulations using thc mixture design technique for binary, ternary or quaternary mixtures or mixtures with pseudocomponcnts. In both programs several statistical criteria can be selcctcd to evaluate the quality of the modcls that can be chosen: the programs offer the choice bctwccn linear, quadratic, special cubic and cubic modcls. Moreover one can choose bctween the use of one single criterion, combined criteria (only for POEM) or the use of the MCDM technique (scc below).
4. Sequential designs versus simultaneous designs One must bear in mind that in Response Surface Methodology based on simultaneous designs, each response will be represcnted by a separate response surface. If more criteria are dcemed important, then in each design point all criteria should be measured and modellcd, if necessary with models of different order. This will result in a number of regression equations and the belonging response surfaces. These equations describe the whole factor spacc studicd. The advantages of the use of Response Surface Methodology over scqucntial mcthods like Simplcx are - knowlcdgc on the dcpcndcncc of criteria on factors will be obtained over the whole factor spacc studicd - without cxlra expcrimcntal effort as many criteria can be studied as dcemed important - optimal dcsign thcory allows an optimal spread of dcsign points over the factor spacc.
5. Combinations of mixture and factorial designs: sparse designs If not only proccss variablcs but also compositional variables influence the responses, with the ultimate possibility of all variables intcracting, a combincd design must be used, as shown in Figure 5. This will incrcasc the expcrimenlal effort considcrably, unlcss cfficicnt fractionation can bc accomplishcd. So far in thc literature on formulation rcscarch only factorial Figure 5. Combined inixturc factorial design designs have bccn fractionatcd; we thought for a three-component mixture w i t h two it Would be challcnging 10 de\rClop fraction- process variables, Exma design points arc atcd combincd mixture-factorial designs used to judge quality of fit.
92
TABLE5 Simple scheme for model choice with mixture factorial designs. ~
Model #
1 ml m2 m3 ml*m2 ml*m3 m2*m3 ml*m2*m3
1
fl
M
X
X
X
X
X
X
X
X
X
fl*l-2 X
X
X
X X
Figure 6. Starting-point for projection and rotation to find sparse designs. A 23 factorial design is projected on the plane containing the mixture triangle.
Figure 7. Contraction of designpoints outcide the mixture triangle to the boundaries of the triangle.
Figurc 8. The combination design resulting from a 25-1 design, using the contraction piclured in Figurc 7. Left-right and loweruppor triangles represent thc factor levels -1 and + l .
Figure 9. Rotation and contraction of all designpoints to the boundarics of the triangle to construct a sparse design.
93
with a concomitant hierarchy of polynomial models. From Table 5 the models can be constructed. Optimality of the developed designs should be judged; we will use the measures G, V and D optimality. We will restrict our rescarch to four-component mixtures and thrcc process variables. In Figures 6 to 10 two suggested strategies are shown for a three-component mixture and two process variablcs. Figure 6 shows h e starting point, Figure 7 a successive projection and Figure 8 thc rcsulting sparse design. The second strategy is shown in Figure 6 and Figure 9, a rotation over 30" followed by a projection. Figure 10 shows the resulting sparse design.
Figure 10. The combination design resulting from a 25-1 design, using the rotation and contraction pictured in Figure 9. Left-right and lower-upper triangles represent the factor levels -1 and +1.
6. Criteria; multi criteria decision making As was said in one of the preceding sections in optimization studies of drug formulations often several criteria must be met simultaneously. Combined criteria do not offer an unambiguous solution to the multicriteria problem; the same value of the criterion can be found with an indcfinite number of combinations of the controllable factors. The solution wc have chosen for chromatography [Smilde et al. 141 as well as for drug formulation studics [de Bocr ct al. 151 is Mulli Criteria Decision Making, based on the concept of Parcto-oplimality. Thc MCDM method does not make preliminary assumptions about thc weighting factors, thc various responses are considered explicitly. MCDM makes provisions about mixturcs in thc whole factor space, therefore it can not be used in combination with scqucntial optimization methods. It can easily be understood for two critcria and a three-componcnt mixture. Aftcr the selection of models for the critcria and rcgrcssion on thc valucs of thc critcria in the design points the factorspace, a triangle, is scanned with a predcfincd step-size, e.g., 2%. In each scan point both criteria are calculatcd and thcir valucs used as coordinates in a two-dimensional graph. From the resulting sct of points (each of them rcprcscnting a mixture composition) the Pareto-optimal points arc sclcctcd: a point is Parcto-optimal if there exists no other point in the design space which yields an improvement in one criterion without causing a dcgradation in the other. By evaluating quantitatively the pay-off bctween the two criteria a choice can be made bctwccn thc PO points and the mixture compositions belonging to it. Thc MCDM mcthod was implcmented in the programs POEM and OMEGA. Using thc tcchnique then proved to be a slight drawback. In the conventional MCDM mcthod
I
300 L-ttcrget
100
200
disintegration t i m e
400
0 200
100
(s)
Figure 11. Target-MCDM: target value for disintegration timc set at 200 s., crushing strcngth maximized.
300
disintegrotion t i m e
400 (5)
Figure 12. Tolerance-MCDM: thc MCDM plot.
arc not PO but havc a prcdef'ined maximum deviation in thc critcrion valucs comparcd to the PO points.
Alfo-lactose
7. Conclusion Scqucntial optimization strategies for drug dosagc forms have found limited applicability; factorial- and mixturc dcsigns howcvcr ham succcssfully bccn uscd. Promising ncw cll'icicnt combincd fxtorial- and mixturc dcsigns arc bcing dcvclopcd. Thc
0
Beta-lactose
0
0
Dried p o t a t o starch
95
MCDM to bc used with simultaneous optimization strategies allows decisions to be made based on the pay-off between optimization criteria. Quality of drug formulations will be improved using these techniques.
Acknowledgement We are indcbtcd to G.E.P. Box and J.S. Hunter for providing us with thc idea of projection designs (Fig. 6).
References 1.
2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13.
14. 15.
Stctsko G. Drug Dev Ind Pharm 1986; 12: 1109-1123. Spendlcy W. Hcxt GR, Himsworth FR. Technometrics 1962; 4: 441. Ncldcr JA, Mead R. Comput J 1965; 7: 308. Bcrridge JC. J Chromatogr 1982; 244: 1. Dcbcts HJG, BajcmaRL, Doornbos DA.Anal ChimActa 1983; 151: 131. Bindschacdlcr C, Gumy R. Pharm Acfa Ilelv 1982; 57(9): 251-255. Zicrcnbcrg B. Acta Pharm techn 1985; 31(1): 17-21. Chowhan ZT, Amaro AA. Drug Dev I d Pharm 1988; 14(8): 1079-1106. Chariot M, Lewis GA, Mathicu D, Phan-Tan-Luu R, Stevens HNE. Drug Dev Ind Pharm 1988; lI(15-1 7): 25 35-2556. Van Kamp HV. Thesis Groningen 1987. Van Kamp HV, Bolhuis GK, Lcrk CF. Pharm Weekblad Sci Ed 1987; 9: 265-273. Predicting Optimal Eluens Composition (POEM), Copyright Research Group Chemomctrics (Head: Prof DA Doombos), Univcrsity Centre for Pharmacy Groningcn, The Ncthcrlands. (Dcmo available). Optimal hlixture Evaluation with Graphical Applications (OMEGA), Copyright Rcscarch Group Chcmomctrics (Head: Prof DA Doornbos), University Centre for Pharmacy Groningcn, The Netherlands. (Dcmo available). Smilde AK, Knevclman, Cocncgracht PMJ. J Chromatogr 1986; 369: 1. Dc Rocr, JH, Smildc AK, Doombos DA. Acta Pharm Technoll988; 34(3): 140-143.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Auromarion (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
97
CHAPTER 9
The Role of Exploratory Data Analysis in the Development of Novel Antiviral Compounds P.J. Lewi, J. Van Hoof, and K. Andries Janssen Research Foundation, Janssen Pharmaceutica NV, B-2340 Beerse, Belgium
1. Introduction Exploratory data analysis is a part of the inductive approach to scientific discovery. Where deductive methods proceed from established models to planned experiments and the collection of new expcrimental facts, induction works the other way around. Here data are collcctcd and analysed in order to obtain a tentative working hypothesis. Induction and deduction are complementary. Once a working hypothesis has been obtaincd by induction, it must be verified and confirmed by independent investigators. Eventually it may gain the status of an established model and become part of the deductive process. This has been described as the ‘arch of knowledge’ [Oldroyd 19861. Induction as a scientific method has been founded by Francis Bacon [1620] almost at the samc time whcn RenC Dcscartes dcveloped his method of deduction [1637]. It appears that inductive methods are applied most profitably in those areas where formal modcls are lacking. This is the case in many ficlds of biology and medicine, espcciall y in the search for novel therapeutic agents. The development of a new medicinc rcquires thc synthesis of some 4,000 new synthetic chemical compounds yearly, each of which has to be tested in numerous batterics of screening tests. This is essentially a Baconian approach to scientific discovery. Incidentally it was Francis Bacon who first proposcd to collcct expcrimcntal obscrvations into ‘lablcs’, which are then to be explored systcmalically in order to yield a ‘first vintage of a law’. Although Bacon was a practicising lawyer, his method has led him to rcmarkablc insights, for example that ‘heat is related to motion’ [Quinton 19803. The exploratory method is also rcfcrrcd to as the Edisonian approach. Indccd, the developmcnt of a practical incandcscent lamp by T.A. Edison sccms to have been the result from several thousands of Lrials, rathcr than from thcorctical considerations. Multivariate data analysis as a method of exploration dcpcnds on the tabulation of adcquatc data, as wcll as on the visualization of relcvant relationships in these data. Often
98
these relationships are not apparent to the unaided eye. For this reason, we need an instrument for looking into tabulated data. Such an instrument may be called a ‘datascope’, by analogy with the microscope and telescope. A drop of water may appear clear and limpid to thc naked eye no matter how long we look at it. But under the microscope, a whole microcosmos is revealed, enough to have allowed its inventor Antoni Van Leeuwenhoek (1632-1723) to write more than 500 communications to the Royal Society in London. A datascope may be thought of as a personal computer equipped with appropriate software for multivariate data analysis. In this paper wc describe how the ‘datascope’ has lcd to a relevant discovery in the search for effcctive antiviral drugs. The exploratory data analysis described in this paper has been performed by means of the SPECTRAMAP program. SPECTRAMAP is a trademark of Janssen Pharmaceutica NV. Information about this software can be obtained from the first author.
2. Method The method of exploratory data analysis that is discussed here has bcen called spectral map analysis (SMA). Originally, SMA has bcen developcd for the visualization (or mapping) of activity spectra of chemical compounds that have been tested in a battery of phannacological assays [Lewi 1976, 19891. Activity spectra are represented in our laboratory in the form of bar charts representing the effective doses of a given compound with rcspcct to thc individual tesls [Jansscn, 19651. Some compounds possess very similar spccua of activity, even if thcy differ in average potcncy. Other compounds may have widcly dissimilar activity spccua, evcn when they have the same average potcncy. The problcm of classifying compounds with respect to thcir biological activity spcctra is a multidinicnsional problcm, which can be solvcd by factor analytic methods. Basically, SMA involves the following steps: (1) logarithmic transformation of a dala table X with n rows and m columns, which produces Y , (2) subtraction of the corrcsponding row- and column-means from each element in the transformed table Y (double-ccntcring) yielding 2, (3) calculation of the variance-covariance matrix V from Z, (4)extraction of orthogonal factors I; from the variance-covariance matrix V , ( 5 4 ) calculation of the coordinates of the rows and the columns along the computed factor axes which are represcntcd by thc factor scores S and the factor loadings L, finally (7) biplot of the rows and columns in a planc spanned by the first two factors. In algcbraic notation, the procedure can be written as: Y = log x Z..=Y..-Y.-Y IJ
1J
1.
(1)
.+Y..
.J
1 ‘ v =;zz
A=
F’VF
with
F’F= I
(4)
99
whcrc I is thc unit matrix of factor space and where 1 is a diagonal matrix which carrics the factor varianccs (cigcnvalucs) on its principal diagonal. We assume that factors are arranged in dccrcasing order of their corresponding factor variances. The above algorithm is equivalcnt to a singular value decomposition (SVD) of the double-centcrcd logarithmic matrix Z , as it can be shown that the table 2 can be reconstructed from thc factor coordinatcs S and L [Mandcl 19821:
Thc biplot of SMA is a representation of the rows and columns of the data tablc in a plane diagram spanned by the first two columns of S and L [Gabriel 19711. Note that thc scaling of factor scores S and of factor loadings L in steps (5-6) is symmetrical in the sense that their varianccs arc equal to the squarc roots of the factor variances (singular values):
SMA is only diffcrcnt from logarithmic principal components analysis [PCA, Hotclling 19331 in thc sccond stcp (2) of thc algorithm (Fig. 1). Ordinary PCA uses column-centcring:
instcad of double-ccntcring which is applied in SMA. Although the algorithmic distinction bctwccn PCA and SMA may sccm trivial, its implication is farrcaching [Schapcr and Kaliszan 19871. In SMA we obtain that both the rcprcscntations of the rows and thc columns of the data table are centcrcd about the origin of factor spacc. In column-ccntcrcd PCA wc gcncrally find that only the rcprcscntations of the rows arc ccntcrcd about the origin of factor space. In terms of effects of compounds obscrvcd in a battcry of tests, SMA corrects simultaneously for differcnces of average potcncy betwccn compounds and for differcnces of average sensitivity between tests. Hcncc, in SMA all absolute aspects of the data are removed as a result of double-ccntcring. What rcmains arc diffcrcntial aspccts, which can be expressed in terms of ratios (as a icsult of the prcliminary logarithmic transformation). Thcse diffcrcntial aspects are called contrasts. Thcy rcfcr to the spccificitics or preferences of the various compounds for the diffcrcnt tcsts. Vicc versa, contrasts can also bc undcrstood as specificitics or prcfcrcnccs of the
100
Principal Components Analysis (PCA)
Spectral Map Analysis (SMA) ~
Logarithms
I Subtraction of colurnn-means
Size
+
Contrasts
Subtraction of rowand colurnn-means
7 Variance-Covariancematrix
Contrasts
I I I
Extractlon of factors
Coordinates along factor axes
Bipiot
Figure 1. Schematic diagram of principal component analysis (PCA) and of spectral map analysis (SMA). The distinction which lies in the type of centering (column-wise or both rowand column-wise) has farreaching implications as is explained in the text.
various tests for the different compounds. Stated otherwise, SMA analyses interactions, which are always mutual, between compounds and tests [Lewi 19891. It is erroneous to maintain that the second and third factors of PCA are identical to the first and second factors of SMA. Indeed, the factors extracted from column-centered PCA usually contain a mixture of the size component and of the contrasts (Fig. 1). The size component accounts for differences in average potency of the compounds. In PCA, this size component cannot be readily separated from the components of contrasts, although the first component usually expresses the largest part of the size component.
3. Application Common cold or influenza is caused by rhinovirus infection. There are 100 different typcs of rhinoviruses, each with its own antigenic characteristics. In our laboratory all 100 scrotypcs of rhinoviruses have been tested against a panel of 15 antiviral compounds. This resulted into a table with 100 rows and 15 columns. The values in this table express the concenuation of a particular substance required to inhibit half of the viral particles in a culture of a given serotypc [Andries e.a. 19901. Inhibitory concentrations are inversely related to the potency of a compound in a given test or, alternatively, to the sensitivity of a test for a given compound, From these data appeared that the compounds differ strongly
101
SPECTRAMAP
100 Rhlnovlruses and 15 Antlvlral Compounds
6a:
. I
Figure 2. Spectral map derived from a table of 100 viral serotypes (hexagons) and 15 compounds (squares). Areas of hexagons and squares are proportional to the average sensitivity of the serotypes and to the average potency of the compounds. Serotypes and compounds that are at a distance from the center and in the same direction show high contrasts as a result of their specific interaction. The separation of the 100 serotypes into two distinct groups formed the basis of a new hypothesis for the mode of interaction of antiviral compounds with the rhmoviruses.
in their activity spectra against the 100 serotypes, and this irrespective of their average potency. Looking at the data, it was not clear why some of the compounds where active against a particular group of serotypes while leaving the other serotypes intact, and why some other compounds had no effect on the former group while being active against the latter. This, of course, presents an ideal problem for SMA, since the interest is in specific interactions of therapeutic agents with biological systems, independently of the average potency of the compounds and independcntly of the average sensitivity of the viral serotypcs. The data were analyzed according to the SMA mcthod described above after transformation of the original data into reciprocal values, This is required by the fact that inhibitory concentrations of the compounds are inversely related to thcir antiviral activity. Thc resulting biplot is shown in Figure 2. Thrce rcading rulcs apply to this SMA biplot. First, hexagons refcr to the 100 rows (serotypcs) of the table, whilc the squares identify the 15 columns (compounds). Second, arcas of the hexagons are proportional to the average sensitivity of h e viruses while the areas of the squares are proportional to the average potency of the compounds. Most importantly, the third rule defines the positions of the hexagons and squares in the biplot.
102
Thosc scrotypcs that are sclcctivcly dcstroycd by a particular compound will be altractcd by it. Thosc scrotypcs that are lcft untouched by the compound will bc rcpcllcd by it. Similarly, bccausc of the mutuality of intcractions, compounds arc attractcd on thc biplot hy n scrotypc against which it is sclcctively active, whilc they arc rcpcllcd by a serotypc that is not inhibited by it. Thc ccntcr of the biplot is indicatcd by a small cross ncar the middle of thc plot and rcprcscnts thc point which is dcvoid of contrast. Compounds that arc closc to thc ccntcr arc active against most of thc 100 serotypcs. The further away from thc ccntcr, the grcatcr the specificity of compounds for the various scrotypes, and vicc vcrsa. Compounds that arc at a grcat distance from one anothcr show large contrasts. Convcrscly, serotypcs that arc far apart also cxhibit large contrasts. The two-dimensional biplot of Figure 2 accounts for 70 pcrccnt of the Lou1 variance in thc logarithmically transformed and doublc-ccntcrcd data table.
4. Rcsult and discussion Thc most striking fcalurc of the biplot of Figure 2 is Lhc scparation of thc 100 scrotypcs inlo two distinct groups. This immcdiatcly suggestcd the existcncc of two classes of scrolypcs. Left from thc ccntcr wc find a group of scrotypcs which is more sensitive to the coinpounds displayed at thc lcft (including the WIN compound). On thc right from thc ccntcr wc obscrvc a larger group of scrotypcs which is more scnsitivc to compounds on the right (among which thc MDL and DCF compounds). When thc chemical structures of thc individual compounds wcrc coinparcd with their position on thc biplot, it appcarcd that molcculcs on the lcft contain long aliphatic chains, whilc those on thc right posscss polycyclic slructurcs. Molcculcs that wcrc active against most or all scrotypcs and which appcarcd ncar thc ccntcr of thc map posscss both fcalures, i.c., at thc samc time an aliphatic and a cyclic part. It has been cstablishcd that antiviral compounds bind to a hydrophobic pocket inside the protcin cnvclopc of thc virus [Andrics 19901. From our cxploratory analysis it was induccd that iherc arc two diffcrcnt typcs of drug-binding pockcts which have cvolvcd from a common ancestor. One type of pocket is more clongatcd anti is present in the lcfunost group of scrotypcs. The other typc of pocket is widcr and appcars in the rightmost group of scrotypcs. This working hypothcsis is supported by diffcrcnccs among thc amino-acid scqucnccs of the proteins that line tlic walls of thc two pockcts. It is also strcngthcncd by diffcrcnccs belwc.cn the clinical symptoms that arc associatcd with common cold infections produced in thc two groups of scrotypcs. All thcse obscrvalions tcnd to confirm thc cxistcncc of two drug-dcfincd groups of rhinoviruscs [Andrics 19901. A practical implication of thc two-group hypolhcsis is the clcar dircctions that can bc given to organic chcmists for synthcsis of appropriate compounds that can bind to both typcs of rhinoviruscs. Anothcr bcncfit tics in the rational sclcction of a small subset of
103
rhinoviruses which can serve as an effective reduced screening panel. This greatly simplifies the work involvcd in screening newly synthesized antiviral compounds. The inductive exploratory approach also dcmonstratcd that synthetic drugs can produce relevant insight into the slructurc of large proteins.
References Andries K, Dewindt B, Snoeks J, Woutcrs L, Moereels H, Lewi PJ, Janssen PAJ.Two groups of rhinoviruses revealed by a panel of antiviral compounds present sequence divergence and differential pathogenicity. J Virology 1990; 64: 1117-1123. Bacon F. Novum Organon Scientiarum. 1620. Modem edition: London: William Pickering, 1899. Descartes R. Discours de la me’thode pour bien conduire sa raison et pour dkcouvrir la vkritk duns les sciences. La dioptrique, les mktkores et la g k o d t r i e . Jan Maire, Leyden, 1637. Gabriel KR. The biplot graphic display of matrices with applications to principal components analysis. Biometrika 1971; 38: 453467. Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psycho1 1933; 24: 4 1 7 4 1 . Jansscn PAJ,Niemegeers CJE. Schellckcns KHL. Is it possible to predict the clinical effects of neurolcptic drugs (Major tranquilizers) from animal data? Arzneim Forsch (Drug Res) 1965; 15: 104-117. Lewi PJ. Spectral Map Analysis. Analysis of contrasts, especially from log-ratios. Chemometrics Intel1 Lab Syst 1989; 5 : 105-1 16. Lewi PJ. Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim Forsch (Drug Res) 1976; 26: 1295-1300. Mandcl J. Use of the singular value decomposition in regression analysis, The Am Statistician 1982; 36: 15-24. Oldroyd D. The arch of knowledge. An introductory study of the history of philosophy and methodology of science. New York, N Y Methuen, 1986. Quinton A. Francis Bacon. Oxford: Oxford University Press, 1980. Schapcr K-J, Kaliszan K. Application of statistical methods to drug design. In: Mutschler E, Wmkcrfcldt E, eds. Trends in medicinal chemistry. Proc 9th Int Symp Med Chcm, Berlin. Wcinheim, FKG: VCH, 1987.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Compufingand Aufomation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
10.5
CHAPTER 10
Some Novel Statistical Aspects of the Design and Analysis of Quantitative Structure Activity Relationship Studies D.M. Borth and M.A. Dekeyser Research Laboratories, Uniroyal Chemical Ltd, P.O. Box 1120, Guelph, Ontario, N11I 6N3, Canada
Abstract A case study in developing Quantitative Structure Activity Relations (QSAR) is presented. (QSAR involves using statistical techniques to correlate molecular structure with biological activity). Two novel statistical aspects of the study are emphasized: (i) the use of a new technique for selecting molecules to synthesize and test in order to maximize the structure vs. activity information for a given amount of chemical synthesis effort, (ii) the usc of a statistical technique called censored data regression to include information from molecules for which only a lower bound on the measure of biological activity (ED50) was available.
1. Introduction 2,4-Diphcnyl- 1,3,4-oxadiazinc-5-ones have good acaricidal activity [Dekeyser et al., 19871, especially against the twospotted spider mite, Tetraizychus urficae, which is a serious plant pest. This paper 0 N is concerned with the development of I quantitative structure-activity relationships (QSAR) required to obtain the most acaricidally active member in 2-(4mcthylphcnyl)-4-(subslituted)phcnyl- 1,34-oxadiazin-5-oncs (Fig. 1). Acaricidal activity in this scrics was first discovered with the compound in Figure 1, with Figure 1. Chcmical Structure to bc optimized by Cnt R = H. A prcvious QSAR study revealed c ~ ~ o i c c o f s u b s t i ~R.
/< ,”
106
that rcplacemcnt of lhc mclhyl (CH3) group rcsulted in greatly reduced acaricidal activity; thus, our attcntion was devoted to introducing various substitucnts, R, in place of H. Thc numbcr of possible substituents is very large and it would be impractical to synthcsizc and tcst them all. This was the basic reason for adopting thc QSAR approach. The strategy in thc QSAR study rcportcd in this paper was (i) to sclcct a statistically mcaningfill yct synlhctically practical subset of substitucnts (ii) synlhcsizc the various analogucs (iii) biologically tcst thc compounds made (iv) statistically analyze thc data to dcvclop an cquation rclating biological activity to the physicochcmical paramctcrs of the substitucnts (v) usc thc equation to prcdict thc activity of the analogucs not yct made (vi) synthcsizc and tcst a numbcr of analogucs (including thosc prcdictcd to be most active) to validate thc prcdictions. Thc cmphasis in this papcr is on some novcl aspccts of items (i) and (iv). hlorc dctails of lhe entire study are rcportcd by Dckcyscr and Borlh [1990].
2. Substitnent selection A list of 433 substitucnts of known physicochcmical paramctcr values wcrc ratcd accorcling to difficulty of synthesis. Of thc 433 potcntial analogucs, 78 were judgcd to bc impractical to synthcsizc. The remaining 355 compounds werc ratcd for difficulty on a scalc from 1 to 8 with a 1 rating indicating a rclativcly casy analogue to synthcsizc. (A rating of 8 indicates that a compound is cxpcctcd to takc about 8 timcs as long to synthcsizc as a compound with a rating of 1). Thc primary sourccs for Lhc physicochcmical paramctcrs were Hansch and Leo [ 19791 and Exncr [ 19781. For somc substitucnts, values not previously rcportcd were ohtidined by intcrpolation and extrapolation from valucs for closely rclatcd substitucnts [Relyca, 19s91. From this list a set of 20 compounds were choscn for synthcsis, using a mcthod of sclcction dcvcloped by Borth ct al. [1985]. The sclcction mcthod took into account: (i) thc estirnatcd difficulty ol synthcsis for each potcntial analoguc (ii) thc amount of inronnation (dcl‘incd as cxpccted changc in statistical cntropy) which each potential analoguc providcs for making prcdictions of activity over thc wholc set of 355 practical mono-substitutcd analogs. Esscntially, the sclcction mcthod is bascd on balancing statistical cl‘ficicncy against synthcsis difficulty, yiclding the most information for a givcn cost. That is, some sclcctions arc bcttcr stitistically, in that thcy will provide bcttcr covcragc of thc paramctcr spacc, and will allow morc accuratc prcdictions of activity for thc 355 - 20 = 335 analogucs not actually synthcsizcd. Howcvcr, a sclcction bascd on statistical cfficicncy alonc may rcsult in a hcavy wcighting towards compounds which arc difficult to synthcsizc. Thc set of analogucs choscn for synthcsis consists of thc first twcnty in Tablc 1. This ublc also givcs thc rclntivc synthcsis difficulties for cach compound as well as thc QSAR paramctcrs and the ED~o’s.(Of coursc, thc EDSO’Swcrc not known at the limc that the substitucnts wcrc sclcctcd). Thc avcragc dirficulty rating is 1.5, and the maximum is 4.
107
The average difficulty rating over the entire set of feasible compounds was 6.32. Also, for sake of comparison, a selection was made assuming equal synthesis difficulty for all analogues. This selection is given in Table 2. The average difficulty rating for the comTABLE 1 Physical data, synthesis difficulty rating and efficacy data for compounds.
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Substituent
Difficulty rating
L
4
m-NO2 p-NO2 p-OCH3 o-SO~C&~ 0-F p-F 0-BR p-BR m-OCH2CbH5 p-OCH2CbHs P-C~H~ o - NH~ m-NH2 p-OH m-OCH3 P-SO~CH, m-F 0-CL o-GH~ mC2H5 m-SCH, P-c&5 p-COOCzH5 p-COOH m-CH2C& m-C& p-OCONHCH3
1 1 1 1 2 1 1 1 1 2 2 2 2 2 8 1 4 1 1 2 2 2 2 8 8 8 8 8
A
0.00 0.84 0.52 0.60 2.01 2.39 2.51 -1.30 -5.96 0.11 0.11 0.22 -0.03 0.27 0.00 0.15 0.84 1.19 1.66 1.66 1.10 -1.40 -1.29 -0.61 0.12 -1.20 0.22 0.76 1.39 0.99 0.64
1.74 0.46 -0.32 2.01 1.92 -0.42
F
R
MR
Hd
ED50
0.00
0.00 -0.11 -0.04 -0.13 0.00 -0.07 0.04 -0.68 -0.02 0.09 0.04 0.11 -0.53 0.12 -0.32 -0.37 -0.18 -0.21 -0.15 -0.43 -0.10 -0.59 -0.24 -0.66 -0.18 0.18 -0.13 -0.16 -0.09 -0.04 -0.07 -0.09 0.12 0.12 0.00 -0.03 -0.15
0.0 4.7 4.7 4.7 29.0 24.3 25.7 4.2 20.2 6.0 6.0 6.0 6.5 32.2 4.4 -0.4 7.6 7.6 30.7 30.7 9.4 4.2 4.2 1.5 6.5 12.5 -0.4 4.8 9.4 9.4 13.0 24.3 16.2 5.9 29.0 24.3 15.3
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
152 >10,000 33 123 66 25 905 45 3 >1.000 >10,000 >10,000 >823 29 655 917 425 >1,000 171 31 826 151
-0.05 -0.04 -0.04 -0.07 0.10 -0.17 0.02 0.87 0.84 0.66 0.67 0.26 0.7 1 0.54 0.43 0.55 0.44 0.21 0.21 -0.05 0.02 0.02 0.29 0.25 0.54 0.43 0.5 1 -0.06 -0.05 0.19 0.08 0.33 0.33 -0.05 0.08 0.41
>1,000
>1,000 >1,000 728 714 1.000 >1,000 120 116 117 902 >2,251 >4,000 30 34 >1,000
108
pounds chosen in this way is 6.25, which is more than 4 times the average for Table 1. The statistical efficiency of the two selections can be evaluated in various ways. One criterion is the relative average standard error of prediction over the set of 355 feasible compounds. For the selection in Table 1 this is 0.76, vs. 0.51 €or the selection in Table 2. Another criterion is the maximum relative standard error of prediction over the set of 355 feasible compounds. For the selection in Table 1 this is 1.45, vs. 0.700 for the selcction in Table 2. Thus if the selection in Table 2 were used, the average synthesis difficulty would incrcasc by a factor of 4, while the average standard error of prediction would dccrcasc by 33% and the maximum standard error of prediction would decrease by 52%. Hcnce the loss in statistical efficiency is much less than the gain in ease of synthesis. Note, also, lhat as shown in Borth et al. [1985], the same statistical efficiency with less total synthesis effort could have been achieved by adding more easy to synthesize compounds. This was not done because the statistical efficiency of the 20 compounds selected was dcemcd to be adequate. The following is a brief description of the mathematical method used for substituent selection. Let xi be the row vector (corresponding to the ith substituent) such that the proposed TABLE2 mathematical model relating structure to activity Sclection and actual difficulty ratings is Log(lED5Oi) = x i p where p is the vector of which result from basing selection on unknown coefficients to be estimated. (For a statistical efficiency alone. more specific definition of xi, see Equation 3 and the following discussion in the next section.) Thc Substitucnt Difficulty rating selection process is iterative, as shown in Figure 4 2, and is based on the following criterion for 4 adding a substituent to the selection: 4 8 8
4
4 1
8 8 8 8 8 4
8 8 8 8 8
4
and the following closely related criterion for deleting a compound from the selection:
where D;is the difficulty rating for the ilh substituent and Xi is a matrix whose rows are the vectors X I , x2, ..., xi corresponding to the substituents currently selected. The numerators of the above expressions are equal to the statistical information (specifically the expected entropy changc) due to adding or deleting the ith
109
substituent to or from the group of substituents already selected. Note that the quantity Xi(Xi-1' X&X
i'
(CLUSTER ANALYSIS)
ADD ANALOGUE
WITH MAXIMUM
is proportional to the standard error of prediction for the i'th substituent [Draper and Smith, 19811 assuming data is available for DELETE ANALOGUE substituents 1, 2, ..., i-1. The constant of WITII MNIMUM proportionality is 0,the residual standard deviation from the fitted model. Of course this is unknown at the substituent selection stage, but is assumed to apply equally to all substituents. Thus the addition criterion, i.e., Equation 1 has the intuitively appealing property that in considering substitucnts of equal synthesis difficulty, the one for which the uncertainty of prediction is the greatest, Figure 2. Flow diagram of substitucnt selecgiven the substitucnts already in the selec- tion process. tion, will be chosen. As discussed by Borth [ 19753, this is related to a principle of the statistical design of experiments which applies quite generally. Similar considerations apply to the deletion criterion, i.e., Equation 2. As shown in Figure 2, the selection process is an iterative one in which the sclcction is improved by alternately adding and deleting substituents. The starting point for this process was generated by clustering the 355 feasible substituents into 20 (the number of substituents to be selected) based on the elements of the vector xi, for each substituent. The initial selection was the substituent in each cluster with the lowest difficulty rating. In the case of a tic, the cluster closest to the cluster center was chosen. The SAS proccdurc FASTCLUS [SAS Institute, 19881 was uscd for this purpose. Full details of the statistical justification of the sclcction method are given by Borth ct al. [19851.
I-[
3. Statistical analysis Statistical analyses were carried out to relate Log(lED50) to the folIowing physicochcm(a zeroical parameters: II (lipophilicity), F (inductive effect), R (resonance eCfwt), one indicator variable describing hydrogen donating effect) and M R (molar refractivity). The initial data analyzed consisted of the first 28 compounds in Table 1. These compounds are the twenty sclccted as described in the previous section, plus 8 compounds which were made prior to the selection. However, the 0-CH3 analogue was found to be
110
extremely inactive and did not fit the regression model and was deleted from the analysis. The remaining compounds in Table 1 were not used in developing the initial regression modcl, but were uscd to validate the model. Subsequently the model was rcfitted using all of the data (with the exception of the o-CH3 analog). Some of the data in Tablc 1 consists of a lower bound for the ED50 rather than a specific value. This is because less than 50% control was exhibited at all rates tested. Statistically, this is referred to as censored data. Rather than attempting to obtain EDSO'S for these very inactive compounds, the data was analyzed using a special technique for rcgrcssion analysis which allows for censored observations [Aitkin, 1981 and Wolynctz, 19791. The computations were carricd out using a SAS procedure called LTFEREG [SAS Institute Inc., 19881 on a Compaq 386/20 computer. This technique recovcrs a substantial amount of the information which would bc lost if the censored observations were merely dclctcd from the analysis. However, the reporting of the statistical analysis results is differcnt than from standard regression packages. Initially, the mathematical modcl considered for the data was as follows:
Log( 1/ED50) =
+
+ p2d2 + p3F'+ P4R' + P-j(OMR)' + &,(MMR)' + p7(PMR)' d
(3)
where the prime indicates that the variable has bccn standardized by subtracting the mcan and dividing by the standard deviation calculated over the compounds included in thc analysis. The variablcs OMR, MMR and PMR lake on the MR value for the substitucnt whcn the substitucnt is in the corresponding position and lake on the value 0, otherwise. (Equation 3 was uscd at the substituent selcction stage. Specifically, xi = (1, ni,q 2 , Fi,Xi, OMRi, MMR;, PMR;) was used in Equations 1 and 2. Standardization of thc variables was not rcquircd at the selection stage, although it is desirable at thc analysis stage.) After preliminary data analysis, it was discovered that n and n2 were not signiricant, but that the inclusion of the ifd parameter as well as a term involving (PM1?)'2 significantly improved the fit. Thus the final model is
Table 3 summarizes the statistical analysis of compounds 1, and 3 to 28 of Table 1, as fitted to equation (4). The mcan and standard deviation are provided for converting from actual to codcd values, e.g., F' = (F -0.304)/0.305 and (PMf<)'2= [(PMX4.014)/7.821)]2. Notc that n = 27 and s = 0.34 (maximum likelihood estimatc). Although r2 is undclincd for ccnsorcd data rcgrcssion, it is possible to estimate the value which would havc bccn obtaincd without censoring. This may be done by calculating 1 minus the rcsidual variance from Equation 4 divided by Lhe residual variance from the modcl: Log(lED50) = constant, i.c., 1 - 0.342 / 0.912 = 0.86. Therefore, we estimate r = 0.93.
111
TABLE 3 Summary of statistical analysis of compounds 1, and 3 to 28, of Table 1, based on Equation 4.
Variablc
Mean
Constant F R OMR MMR PMR Ph.IR2
-
-
0.304 -0.171 3.989 2.663 4.014
0.305 0.240 9.143 7.009 7.821
lid
Std. Dev.
-
-
0.148
0.362
1
pi
Std. Err.
Chi Square
0 1 2 3 4 5 6 7
-2.52 -0.59 -0.48 0.33 0.25 0.55 -0.28 -0.70
0.10 0.09 0.11 0.08 0.08 0.19 0.07 0.14
637.7 45.7 18.3 15.6 9.3 8.3 16.2 26.3
P 0.0001 0.0001 0.0001 0.0001 0.0023 0.0039 0.0001 0.0001
Equation 4, with the parameter estimates given in Table 3, was validated by predicting the EDSOof 9 additional compounds which were subsequently synthesized and tested. A comparison, indicating good agreement betwecn predicted and actual values for these compounds, is shown in Table 4. Subsequently, the equation was refitted to take all of the available data (with the exception of the o-CH3 analogue) into account. These results are shown in Table 5. The final equation has n = 36, s = 0.32 (maximum likelihood estimate) and an estimated r = 0.94. A plot of actual vs. predicted values for these 36 compounds is shown in Figure 1. Note that points plotted as + represent censored data (upper bounds on actual Log(lED50). When these censored data points fall above the line they do not indicate lack of correlation with the predicted values, since the true value is known to fall below the plotted point. The following is a brief description of the methodology of censored data regression. In the abscnce of censored data, multiple regression analysis involves determining the TABLE 4 Rcsulki of validating Equation 4 on 9 additional compounds.
Substituent o-C2H5 rn-CzH5 m-CH&Hs m-CbH5 p-C6H5 p-COOH p-COOC2H5 m-SCH3 P-OCONHCH~
Log(l/EDSO) -2.08 -2.06 -1.48 -1.53 -2.96 <-3.60 <-3.35 -2.07 <-3.00
Predicted Log(l/EDSO) -1.95 -2.07 -1.45 -1.81 -2.64 4.94 -2.92 -2.34 -4.45
EDSO
Predicted ED50
120 116 30 34 902 >4,000 >2,250 117 >1,000
89 117 28
64 439 86,151 839 220 28,069
112
values of the vector of unknown constants, p, which minimize the sum of squares between the observed data and the values predicted from the equation. In censored data regression, the parameters p and oare chosen to maximize the logarithm of the likelihood function. Let y; = Log(l/EDSOJ with the data ordered so that the n, censored observations come fist, followed by n - n, non-censored observations. Also, let pi = xi/?. Then the natural logarithm of the likelihood function is, up to a constant independent of p and 0, givcn by
where @(z) is the left tail area of the standardized normal distribution. Examination of the Equation 5 indicates that when n, equals zero the log likelihood function is maximized with respect to p by minimizing the sum of squared differences between the observed and predicted values, i.e., censored data regression is equivalent to ordinary regression, as we would expect. When n, is not zero the censored observations do play a role in determining the parameter estimates. The actual amount of information contributed by particular censored observations will vary. In general, the censored data points will contribute most to the regression when they do not fall too far above the line depicting actual vs. fitted points in Figure 3. In ordinary regression analysis, the overall significance is evaluated via the F-ratio which may be related to the level of statistical significance by reference to tables of the F distribution. In censored data regression the analogous statistic is computed from the logarithm of the likelihood function. (See,for example, Schneider, 1986. Note, however, that his equation for the logarithm of the likelihood function relates to right censored data, whereas we are dealing with left censored data.) Specifically, to assess the overall significance of parameters to p7 in Equation 4, we compute the difference between the log TABLE 5 Summary of statistical analysis of compounds 1, and 3 to 37, of Table 1, based on Equation 4.
Variable Constant
F R OMR MMR
PMR PMR2 /Id
Mean
Std. Dev.
-
-
0.263 -0.134 3.253 4.100 4.725
0.286 0.221 8.125 8.540 8.183
-
0.167
-
0.377
I
pi
0 1 2 3 4
-2.54 -0.55 -0.48 0.28 0.33 0.39 -0.27 -0.75
5 6 7
Std. Err.
Chi Square
0.09 0.07 0.09 0.07 0.07 0.15 0.07 0.13
718.4 64.3 25.8 18.0 21.8 8.3 15.9 32.0
P
o.oO01 0.0001 0.0001 0.0001 0.0001 0.0108 0.0001 0.0001
113
Figure 3. Actual vs. predicted activity of compounds 1, and 3 to 37, based on Equation 4 with parameter estimates given in Table 5. Note that points plotted as + which fall above the line do not indicate lack of fit, since they represent upper bounds on activity for the respective cornpounds.
-1
-2
50 m
B g
-3
3
U
s-
0
z0
3
s
-4
0
< -5
+ INDICATES UPPER BOUND O N ACTUAL LOG(lIED50) -6 -6
I
I
I
I
-5
-4
-3
-2
-1
PREDICTED LOG(l/ED50)
likelihood maximized for the full model and that for the restricted model
Log( l/ED50) = Po. Twice this quantity is asymptotically distributed as chi-squared with 7 degre S f freedom under the null hypothesis = p2 = ... p7 = 0. For the analysis of the 36 compounds referred to in Table 5, the computed chi-square value is 64.3. The probability of obtaining so large a value under the null hypothesis is less than 0.00001, so the null hypothesis is rcjectcd. Similarly, for the analysis of the 27 compounds referred to in Table 3, the computcd chi-square value is 43.8 for which the probability value is less than 0.00001, so the null hypothesis is also rcjcctcd on this smaller data set. Note that thep values are approximate [Schncidcr, 19861. It is interesting to compare the chi-square values with and without the censored data included, since this gives us some indication of the information provided by these observations. For the data analyzed in Table 3 the chi-square value with the censored data
114
omittcd is 24.5, vs. 43.8 with the censored observations included. For the full data set of 36 observations, including the 12 censored observations increases thc chi-square value from 37.5 to 64.3. The chi-square statistic has an expected value of about 7 under the null hypothcsis and the larger the chi-square value, the stronger the evidence is against the null hypothesis.
4. Results and discussion The acaricidal activity of 37 1,4-diphenyl-4H-l,3,4-oxadiazin-5(6H)-one was varicd with the variation of thc 4-phcnyl ring substituents (Table 1). The structure-activity corrclation for the 4-substitutcd phcnyl dcrivatives was represented by Equation 4 and describcd in Table 5. One substitucnt, 0-CH3, was omittcd from the analysis because it showed unexpcctcd poor activity. Dclcting observations when fitting a QSAR equation, though not uncommon, is undcsirable statistically in that it indicates the existcncc of some phenomcnon not cxplained. Howevcr, the credibility of the equation for prcdicting thc activity of ncw compounds is confirmed by the prcdictions in Table 4. It is worth noting that thc o-CH3 compound has an unusually wcak band at 250 cm-1 in its UV spcctrum in comparison with a sample of other compounds from this study. This suggests an unusual conformation or rc-clcctron dislribution in the oxadiazinc ring.
5. Conclusions A QSAR study was camed out to optimize acaricidal activity in 2,4-diphcnyl-4H-1,3,4-
oxadiazin-5(6H)-ones, resulting in a six-fold increase in activity over the lead compound. A new substitucnt sclcction technique was used resulting in a selection which was statistically efficient whilc minimizing the required synthesis effort. The censored data rcgrcssion tcchniquc successfully rccovcrcd useful information from compounds which for which only a lower bound on thc ED50 was available.
Acknowledgements We are indcbtcd to A. Mishra of our Synthesis Dcparlmcnt for prcparing some of the compounds and R.C. Moore of our Biology Department for carrying out the acaricidc tcsts and for their invaluable suggestions and encouragcmcnt. R.J. McKay developcd thc original computcr program for the substitucnt selection.
References Aitkin M. A Note on the Regression Analysis of Censored Data. Technometrics 1981; 23: 161-163.
115
Borth DM. A total entropy criterion for the dual problem of model discrimination and parameter estimation. Journal of the Royal Statistical Society, Series B 1975; 37: 77-87. Borth DM, McKay RJ, Elliott JR. A difficulty Information Approach to Substituent Selection in QSAR Studies. Technometrics 1985; 27: 25-35. Dckeyser MA, Mishra A, Moore RC. Substituted Oxadiazinones. USP 4,670,555,1987. Dekeyser MA, Borth DM. Quantitative Structure-Activity Relationships in 4H-1,3,4-Oxadiazin5(6H)-ones. Submitted for publication in: J Agric Food Chem 1990. Draper NR, Smith H. Applied Regression Analysis. New York: WileyJ981: 83-85. Exner 0. A Critical Compilation of Substitucnt Con~tants.In: Chapman NB, Shorter J, (Ed). Correlation Analysis in Chemistry, Recent Advances. New York: Plenum Press, 1978: 439-540. Hansch C, Leo A. Substituent Constants for Correlation Analysis in Chemistry and Biology, New York: Wiley, 1979. Relyea DI. Personal Communication. Uniroyal Chemical, Naugatuck, CT 06770 USA, 1989. SAS Institute, Inc. SAS/STAT User’s Guide. Release 6.03 Edition SAS Institute Inc., Cary, NC, USA, 1988: 493-518 and 641-666. Schneider H. Truncated and Censored Samples from Normal Populations. New York: Dekker, 1986: 188. Wolynetz MS. Maximum Likelihood Estimation in a Linear Model from Confined and Censored Data. Applied Statistics 1979; 28: 195-206.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
117
CHAPTER 11
Site-Directed Computer-Aided Drug Design: Progress towards the Design of Novel Lead Compounds using ‘Molecular’ Lattices R.A. Lewis and I.D. Kuntz Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94143-0446, USA
Abstract The objcct of this papcr is to describe a novel approach for designing molcculcs by computcr. Thc goal is to providc an intcractive system that will use thc forccs of molecular recognition and cxpcrt input from an organic chemist to dcsign lead compounds that are synthetically and biologically viable. When the binding site of a receptor is known through pharmacophoric mapping or x-ray crystallography, rational computer-aided dcsign of novcl lcad compounds offcrs the hope of avoiding the shortcomings associated with the prcscnt methods of discovery by chance or mass-screening, as wcll as using any knowlcdgc of mcchanism that might be availablc.
1. Introduction Thc foundation of our undcrstanding of thc spccificity of biological function is based on thc principlcs of molccular recognition [l]. Thc binding and action of a drug is controllcd by the patterns of molecular forccs found at the contact surface of the receptor. The objective is to produce molecular graphs that match the unique spatial patterns of recognition forces within the binding sitc. Thcse graphs may then be synthesiscd and assaycd as lcad compounds and optimiscd by classical structure-activity model-refinement studics to suggcst furthcr candidalcs for tcsting (Fig. 1). Many drug-rcccptor intcractions arc conuollcd by a few key receptor goups, Icrmcd sitc points, that arc found at thc contact surface of thc rcccptor binding sitc. Thcse givc risc to rcgions of space, callcd ligand points, in which ligand atoms that interact with thc sitc may bc locatcd. Thc rcccplor binding site must bc survcycd to locate thcsc ‘hot spots’; this an important first stcp towards thc dcsign of novcl lcad compounds. As Cohcn
118
IDENTIFICATION OF RECEPTOR TARGET
MODEL OF ACTIVE SITE (xray, nrnr, pharrnacophore)
I I
1
DESIGN OF LEAD COMPOUNDS
SYNTHESIS AND BIOLOGICALTESTING
I
OPTlMlSATlON OF LEAD COMPOUNDS
I
I I
ANALYSIS OF DATA (QSAR, MODELLING)
t MODEL OF ACTIVE SITE/LIGAND COMPLEX
I
1
QSAR, METABOLISM, DELIVERY, TOXICITY STUDIES
CLINICALTRIALS
VIABLE DRUG I
Figure 1. A gcncral schcme for computer-aidcd drug dcsign.
el al. 121 have obscrvcd, it is still a vcry difficult task to join thcsc hot spots Logcthcr to gcncratc conformalionally scnsiblc, synthctically accessiblc targct molcculcs. It is assumcd that a map of Lhc site is available and that at lcasl some of the molecular recognition forccs produced by the site can be computcd and displayed. Thc sitc is assumcd to bc rclativcly rigid; large scale conformational shifts would present an cntircly ncw gcomctry and should be treated as a ncw sitc. This kind of ‘snapshot’ approach is rcasonablc if thc conformation of Lhc sitc is rclativcly stablc kinctically, or if a large numbcr of snapshots can bc takcn, so that a larger samplc of thc conformational space of thc rcccptor can be examined.
119
It is frequently observed in x-ray complexes that the ligand is partially enveloped by the receptor; this can only happen in regions of the surface with negative curvature, that is, in clefts or grooves. The overall shape of the receptor may influence the rate of successful receptorfligand collisions [3] via perturbation of the local electric field but this will not bc enough for detailed discrimination. The receptor atoms are modelled by the hard sphere approximation. The shell formed by the atoms in the binding site defines a steric mould or keyhole into which the ligand must fit. If any ligand atom is forced to lie beneath the accessible surface of the site, it will experience very strong repulsive forces that will dominate all other types of interaction and force the complex apart. Methods have been developed for determining the active site from the coordinates of a receptor [4, 51 and from ligand binding data [6]. The binding site will have a distinct electrostatic profile due to the differing elecuonegativities and bonding environments of the receptor atoms. The electrostatic profile of a binding site may be calculated by various methods but the results are dependent on the model used for the dielectric [7]. The affinity of the ligand will be enhanced if the pattcrn of ligand residual charges can be made to complement that of the receptor. Groups of opposite charge should be placed near to each other whereas groups of similar potential should bc separated. In rcgions of low polarity, the drug-receptor interaction is influenced more by entropic and wcak dispcrsive effccts. The work of Eisenbcrg and McLachlan [8] has providcd some approximate mcans of quantitating the free energy of hydrophobic interactions. Complcmcntarity is achieved by placing non-polar regions of the ligand and receptor adjaccnt to each other. Thcrc will also be groups at the surface of the binding site that can participate in hydrogcn-bonding interactions with the ligand. These groups can bc donors, acceptors or both and should bc matched by a complementary acceptor or donor, to stabilise the binding. The gcomctry of the hydrogen bond is constrained by the geometries of the participating molccular orbitals so that a hydrogen-bonding site point will give rise to a small ligand point region. The ligand points can be visualised with energy contours from the GRID program [9] or with probability maps from the HSITE program [lo] using statistics derived from the Cambridge Structural Database [ll]. The spatial pattern of the hydrogen-bonding ligand points forms a tight geometric arrangement of hooks and eyes that defines the binding oricnhtion of the drug. Hydrogen bonds do not conUibute greatly to the affinity of a ligand: the encrgy derived from bond formation in the complex is balanccd by the cost of solvent rcmoval and rearrangement [ 123. However, the complex will be dcstabiliscd if a hydrogen-bonding site on the ligand or the receptor is left unfilled. Thcrc is often a complicated intcrplay between the demands of the various intermolccular intcractions. Regions of high electrostatic complementarity bctwccn the ligand and thc rcccptor may dominate areas of poor or negatively correlated hydrophobicity and vice vcrsa. The encrgctics of hydrogen-bond formation in the complex must be wcighcd
120
against interactions with the surrounding solvent. The need for steric complementarity dominates all the other molecular recognition forces. Any ligand must fit neatly into its rcccptor if it is to form a complex with reasonable affinity. All these molecular recognition forces must be satisfied during the course of drug design if the postulated ligands are to bc viable. There are three basic approaches towards site-directed drug design: molecular graphics, database searching [13, 14, 151, and automated structure generation [16]. There arc weaknesses in each of these methods: this project will seek to unify the approaches and hopefully use thcir strengths to provide a more powerful drug design tool. Molecular graphics has proved to be a useful aid to the medicinal chemist in the search for novel lead compounds. However, the process of manually building structures within a binding site is a time-consuming process that is wholly dependent on the skill and creativity of the human operator. The operator must assimilate the shape and the electrostatic, hydrogen-bonding and hydrophobic profiles of the site. The construction of appropriate molecules is like solving a multidimensional jigsaw puzzle, and there is a wide selection of atom and bond pieces to choose from. The designer is likely to take shortcuts based on a subjective interpretation of the information contained in the site surveys. There is a natural human resistance to constructing other skeletons once one acceptable structure has been found. The number of structures or molecular skeletons that could bc generated rises exponentially with the size and complexity of the site: a human designer could not hope to explore all the possibilities methodically and exhaustively. However, there is an underlying rationale to the way in which a medicinal chemist tackles the problems of structure design; an understanding of this rationale will provide the rules for a systcm that seeks to perform an automatic, exhaustive and unbiased search for novel lcads. Thc DOCK program of Kunlz et al. [13] uses the site as a steric mould to generate a negative image of the site. In terms of the lock-and-key analogy of drug-receptor intcraclions, this is like generating a keyhole from the lock. A database composed of the threediincnsional coordinates of a set of chemical compounds can bc searched to find those structures that have the best stcric match to the negative image of the site. The database can be made up of either multiple conformations of the same compound or of a variety of distinct chemical entities. The structures that do best in the steric screen will have good shapc complcmcntarity and few, if any, van dcr Waals overlaps. These structures can be used as the basis set for a program of drug discovery: other molecular rccognilion forces can be visually inspected to assess h e complcmcntarity bctwccn the docked slructures and the site and possibly to select a lead compound for synthesis and further testing. Thc rigid-body fitting procedure used in most database searches is not completely exhaustive. For example, the conformational spaces of the ligand and the receptor are not explorcd. A database of multiple conformers of the same compound is only 3 partial solution. The scoring system used in the DOCK program is a simple function based only on
121
steric fit [17]. It would take much longer to evaluate a rigorous energy potential for each potential ligand-receptor match. Neither of these shortcomings is serious and they are compensated for by shorter computer run times. It is also intended that the molecules with the best steric scores be used as templates for further design, by suitable small scale alterations of the atomic structure, to produce molecules with a higher degree of chemical and steric complementarity. The database strategy is mainly bounded at the highest level, the human interface. The number of matched molecules that can be examined thoroughly at a graphics screen is limited by operator time; the assessment of synthetic accessibility is a long, difficult task. It would be better to use the information from a database search and to design some sort of interface to allow a human to construct a molecular skeleton that is synthetically attractive and complementary to the site. The strategy for automated structure generation is based on a division of the drug design process into two subproblems. The specificity of a drug-receptor interaction is driven principally by the shape of the binding site and by the geometric arrangement of the hydrogen-bonding and salt-bridge ligand points. The first goal of a structure generator is to link up the ligand points to form a molecular skeleton without incurring steric penalties and without creating high-energy, strained structures. Molecular skeletons are also called molecular graphs: a molecule or a molecular skeleton can be thought of as the physical embodiment of a mathematical graph, where the atoms correspond to nodes and the bonds to edges. The affinity of a ligand for a receptor is the sum of all intermolecular interactions across all the atoms involved in complex formation. The profile of the site is known, as are the positions of the ligand atoms. The types of the atoms are largely unknown. The right atom types must be chosen to optimise the intra- and inter-molecular forces and to create a high affinity ligand. This is the atom assignment problem; like the structure generation problem, it is combinatorial in nature and so heuristics must be used in its solution. Structure gencration can be performed in 2 or 3 dimensions. It is simpler to look at the 2-dimensional case first. It has bcen notcd that a great majority of drug molecules contain at least one ring and that the cyclic part of the structure is often important to the biological action [ H I . A survey of the Cambridge Structural Database was used to create a library of planar ring fragments; these fragments were fused into larger ternplates for use in structure generation [ 191. This implics that any generated structures will be flat but this consideration is outweighed by the synthetic, entropic and combinatorial advantages of creating ring-based molecular templates. The templates are fitted to a planar array of ligand points and stcrically clipped in only a few seconds of computer time [20]. Structurc generation can also be used to create 3-dimensional molecular ternplates by the use of a face-centred cubic (diamond) lattice [21]. The diamond lattice can be used to generate aliphatic chains and rings. The generator is given a seed and a target region to
122
Target
Accessi ble Surface
Figure 2. Structure generation in 3-dimensions using a regular diamond lattice.
conncct; it moves through the lattice, avoiding steric clashes with the site, creating molecular skeletons that connect the two regions with as few atoms as possible (Fig. 2). This method can be used for de novo generation and for extending existing molecular graphs. It executes very quickly and so can bc uscd intcractively. This algorithm has been used to gcncratc mimics of the p-aminobenzoyl glulamate tail of methotrexate [ 2 2 ] . The structure generation methods try to crcate a general template that links all the ligand points by moving through regular lattices of atoms placed in the space bctwcen all the sitc points. At each atom, a choice is made of the next vertex (atom) to visit; this dccision is based on heuristic rules. Structure generation is particularly pronc to combinatoria1 explosion and so certain simplifying assumptions have to be madc, in particular, thc types of moves the generator is allowed to make. Once again, the idca is to suggest lcad compounds for furthcr optimisation. The generator will run along allowcd pathways containcd within a lattice or framework of bonds, always trying to fill the site in an energctically sensible way. So far only rcgular lattices have bccn used; the case against using random irregular lattices is that the process of generation would be dcpcndent on starting point choscn: thcrc is also an infinite numbcr of such lattices, most of thcm usclcss. We have recently conccived of a lattice-based design strategy that might overcome some of thcse problems. It uscs the results from a database scarching algorithm to create an irrcgular lattice and t.hc tcchniqucs of structure generation to cxplorc that lattice. An irrcgular lattice is defined as a composite chemical graph of objects, madc up of atoms, rings and molecules, all linked by cdgcs. The lattice should be viewed as a hicrachy. At thc lowcst Icvcl, it is a chemical graph or a collcction of atoms and rings linkcd by chcmically allowed edges or bonds (Fig. 3). At the next Icvcl, i t is a collection of thc moleculcs that were obtained from a database search. Each moleculc is a separate entity,
123
DOCK INTO SITE
/
FUSE INTO LATTICE
TRACEPATHS
Intermolecular bond Intramolecular bond
--
Figure 3. An overview of the procedure for forming an irregular molecular lattice.
or discrete subset of the lattice, linked to other molecules in the lattice. Atoms within each molcculc are linkcd by intramolecular edges or bonds. At the highest level, the molecules are joined by intcrmolccular cdges, that is, the atoms at either end of the edge belong to different parent molecules. Intramolecular bonds are defined lcss stringently than their intermolecular counterparts. Each molecule is derived from an x-ray structure, so that atoms that arc close together must be bonded or the structure would not be
124
physically realistic. In this way, each x-ray structure is accorded a unique status within the hierarchy of the lattice. There are several geometric and chemical rules that are used to define the lattice: (i) Bond lengths. A bond is deemed to be sensible if the distance between atoms i and j , Dii. falls within the statistically determined limits for a bond formed from the atom types of i andj. These limits may be narrowed if the hybridisation of either atom has been previously defined. There are two types of bonds that can be present in the lattice. An intramolecular bond is a direct copy of a bond present in an original molecule from the basis set. An intramolecular bond is defined as D;;< Ri + R; + Tolerance where R; and Rj are the half-bond radii of atoms i and j . This test ensures that the original molecular bonding pattern is maintained. This upper bound limit implies that bonds that are unreasonably short will not be detected. Intermolecular bonds must be defined more stringently, and upper and lower bounds are imposed on each X - Y atom pair. The limits are obtained by allowing a spread of three standard deviations either side of the observed mean bond lengths [23]. (ii) Bond angles and dihedrals. An angle taken as sensible if the angle between three bonded atoms i, j and k falls within 5 degrees of the ideal bond angle defined by the hybridisation of j , which in turn depends on the type of the bonds formed between atoms i and j and between j and k. For example, if j is an sp2-hybridiscd atom, then the bond angle must be approximately 120 degrees. All intramolecular triples are allowed. A dihedral angle is sensible if the angle between the bonded atom quartet ijkl is allowed by the hybridisation type of J and k. For example, if j and k are sp2-hybndised then the dihedral angle must be 0 or 180 degrees, corresponding to syn and anti isomers. (iii) Chemical sense. Certain atom types do not form stable combinations and so bonds between them are not constructed. A list of disallowed combinations has been given by Lindsay et al. [24]. The valence of an atom, together with the hybridisation, will determine how many bonds that atom can form. For example, an oxygen atom ought to form at most two bonds during the generation process.
2. Procedure and results A negative sphere image of the enzyme dihydrofolate reductase [25] (4DFR from the Brookhaven database [261 ) was generated and clustered using programs from the DOCK package version 1.1 [13]. The largest cluster, with 643 spheres, corresponded to the binding site of the known inhibitor, rnethotrexate. The cluster was filtered to produce a smaller representation of the site; only those sphere centres that were within 5A of an inhibitor atom were retained. This reduced the number of spheres to 47 (Fig. 4). This last step was performed for two reasons: (i) the shape of the binding site is poorly defined around the mouth region where the L-glutamate tail of the inhibitor is thought to bind. The spheres dcscribc a Y-shapc: the ligand lies in the stem and one of the arms whilst the other arm is
125
Figure 4. A relaxed stereo plot of the negative sphere image of the active site of 4DFR. The inhibitor methotrexate is also shown.
empty. The searching program will give equal weight to all regions so that the true binding site in the deep cleft might not be explored so fully, giving misleading results. (ii) The combinatorial docking algorithm has a smaller number of spheres to match and so the computer program runs much faster. Automatic reclustering programs that do not use ad hoc filters are being developed (Shoichet et al., in preparation) The DOCK program was used to match 10,000 structures selected [Seibel et a]., in prcparation] from Cambridge Structural Database against the reduced sphere cluster. The 300 bcst scoring structures were produced in the correct orientation for fitting into the site (Fig. 5). Thc ranking was done solely on the basis of a shape complementarity evaluation. These structures formed the basis set for generation of an irregular lattice. Apart from the filtcring of the initial sphere cluster, no special knowledge about the binding site was uscd to bias the typcs of structures produced. Other methods could also be used to gcncrate the basis set, provided that all the structures fit within the binding site. High affinity for thc site is not a condition for membership. An adjacency or edge list was produced for each atom in the data base by looking for other atoms that formed valid inter- and intra-molecular bonds using the criteria
Figurc 5. A rclaxed stereo plot showing 50 shucturcs dockcd to the active site of 4DFR. The Lglutamate fragment of methotrexate is shown in bold.
126
Figure 6. A histogram of the numbcr of intermolecular links made by atoms in the lattice.
discussed above. A histogram of the number of intermolecular lattice edges that cach non-hydrogen atom can form (Fig. 6) was taken from a lattice of 147 molecules and 5,176 atoms. The graph shows that the lattice is very well connected. Some atoms formcd more than 200 scparate links; the average number of links was 42. About 10,000 intcrrnolccular links were found, showing that there will be many new ways of connecting the atoms to form novel structures. If more stringcnt bond length criteria were to be applied, the number of links would of course dccrease. These finer filters should be applied at a latcr stage when the particular local environment constructed by the designer will dctcrmine what is allowed. We believe that it is better to construct a lattice that is a superset of possibilities and to prunc it at a local level. Bond angle cover was asscsscd by taking triples from the adjacency list and putting thcm through an angle screen. For each atom i, triples are made by taking two atoms j l and j2 from the adjacency list of i. The bond angle is then computed from the dot product of the vectors jl->i and j2->i. All intramolccular triples are immediately passed but if ejthcr j l or j 2 originale in a different parent structure to i, an intermolecular screen is applicd. This allows deviation of 5 degrees from ideal sp', sp2, sp3 bond angles. This
127
strategy permits the presence of strained intramolecular rings but prevents unusual intermolecular links. On average, each atom was the centre of 250 triples, which is a success rate of 15%. Only 6% of the atoms failed to make any intermolecular triples. The number of molecule-molecule links was computed. If two molecules are to be joined sensibly, there must be a valid pathway between the two molecules. For a pair of linked atoms, i from molecule A, a n d j from molecule B, there must be two valid atom triplets: i1.i.j and i.j.j,. This checks to see that i can be a sensible extension of molecule B and j of molecule A. This step was taken to ensure that different parent structures within the lattice could be linked in a conformationally sensible manner. For the 4DFR lattice of 147 structures, only 80 pairs of molecules could not be sensibly joined together; this represent a failure rate of 0.74%. The average number of good molecule-molecule links is 9 per pair. The site was divided into boxes to look at the average densities of atoms, intermolecular links and triples. Each box was a cube of side length 1.5A. The average density of atoms within the 4DFR site was 3.4 atoms per cubic A. The density of links was about the same. The highest density of atoms was observed to be within one box (or about one bond length) of the site surface. This may be rationalised in terms of the scoring scheme used in DOCK. Atoms that are close to the contact surface of the receptor are given positive scores; these sum up to give the ranking score. Thus, molecules that hug the surface of the site have higher scores. The lattice is composed of the structures with the best scores, so that this observation is not unexpected. The usefulness of the lattice can be assessed by trying to trace out known ligands using the atoms and bonds of the lattice. The lattice derived from the methotrexate site of 4DFR was examined to see whether it could be used to construct the methotrexate molecule in its complexed conformation. Neither methotrexate or any of its derivatives were present in the original database of structures that the lattice was formed from. Two experiments were pcrformed. The first looked for a simple coordinate match of a lattice atom to the atoms of metholrexate; the second experiment required that the coordinate match be supplemented by a match of atom types, so that the lattice atom type was the same as
Figure 7. A relaxed stereo plot of the coordinate match superposed on the structure of methotrexate bound to 4DFK.
128
Figure 8. A relaxed stereo plot of the atom match superposed on the structure of methotrexate bound to 4DFK. Missing atom matches are augmented by coordinate matches.
the type of the nearby inhibitor atom. Only lattice atoms within 0.5A of an inhibitor atom were considered as a match. The results are presented in Figures 7 and 8 and in Table 1. The cover is quite good around the pteridine ring system of the inhibitor. The rms deviation of the coordinate matches is 0.21.k The atom matches do not cover the heteroatom positions very well; missing atom matches have been augmented by coordinate matches in Figure 8 to give the best mixed match to methotrexate. This is not so important as the designer will be able to position those atoms from the hydrogen-bonding ligand points that lie near to N1, NA2, NA4 and N8. The benzoylamide fragrncnt is less well covered and the glutamate sidechain not at all. This can be explained by looking at the relative positions of the lattice atoms and the drug molecule (Fig. 5). It can be seen that there are few docked structures in the upper portion of the site. The structures that are embedded in the cleft have greater steric complementarity than those in the mouth of the site. Thus, the lattice has poor cover on the external boundary of the site. This effect might be overcome by building the lattice by random selection from a set of molecules that have a threshold shape match to the site. The bcst match to each inhibitor atom is scored only by a distance criterion. Better looking structures could be obtained by choosing lallice atoms with a poorer match but with improved bonding relationships to their neighbours. However, even these elementary experiments show that there is adequate richness of cover contained within the lattice to mimic many of the features of the known ligand. Further studies are being performcd. The results are encouraging but not yet convincing; thcy can easily be improved by using a larger, more evenly distributed basis set.
3. Discussion Thc investigations into Uie active site of 4DFR showed that the lattice contained atoms that could bc joincd into an methornexate-like structure. The methods for using the irrcgular lattice as a design tool are slill being dcvelopcd, but a prototype stratcgy has been dcviscd (Fig. 9). In certain pms of the site, there will be hot spots of binding activity, or ligand points, possibly at sitcs where salt bridgcs or H-bonds can form. The medicinal
129
TABLE 1 The coordinate and atom matches between the lattice and methotrexate. Methotrexate atom type 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
atom
Coordinate Match distance/A atoms in cut-off
N1 c2 NA2 N3 c4 NA4 C4A N5 C6 c7 N8 C8A c9 N10 CM c11 c12 C13 C 14 C15 C16 C
C 0 C C C 0 C C C C C C
0
-
-
N CA CT 01 02 CB
C C C 0
0.15 0.28 0.39 0.36 0.40 0.21
X 0 C
-
C C N C F C
X C
0.12 0.21 0.06 0.20 0.27 0.32 0.19 0.34 0.16 0.13 0.25 0.27 0.28 0.18 0.4 -
0.39 0.50 0.37 0.33 0.29 0.39
16 25 27 30 8 7 18 10 9 9 12 15
5 2 3 0 1 1 2 6 4 1 0 2 2 1 2 1 3
Atom Match distance/A atoms in cut-off -
0.27
-
0.27
0.19 -
0.16 0.13 -
0.27 0.30 -
0.4 -
0.39 0.50 -
0.33 0.40 0.39
0.40 0.28 0.39 0.36
0.21
0 20 0 0 7 0 15 0 8 6 0
12 3 0 2 0 1 1 0 6 3 1 0 1 2 1 1 0 3
No matches to CG, CD, OEl, OE2. RMS DeviationslA Percentage hits
0.30 82%
0.33
55%
chemist selects some ligand points of interest within the binding site, and the lattice is searched for atoms or links to the ligand points. A selection of structures is produced and examined graphically. The designer could then select another ligand point and a structure generation program would walk through the lattice to build several structures that can be modified, ranked and filtered by the designer on the basis of synthetic accessibility. The operator would drive the program interactively to build the structures.
130
MODEL OF ACTIVE SITE (xray, nmr, pharmacophore)
PERFORM A DOCK RUN
FUSE THE MOLECULES INTO A MOLECULAR LATICE
LOOK FOR A GOODINTERACTION SITE
CHOOSE A STARTING POINT
LOOK FOR A GOOD INTERACTIONSITE
t
t
CHOOSE A MOLECULE, OR PART OF A MOLECULE, FROM THE PATHS ALLOWED BY RULES
LEAD COMPOUNDS THAT ARE MORE SYNTHETICALLY ACCESSIBLE
Figure 9. A general strategy for lattice-based drug design.
This approach would bc morc timc-cfficicnt as the gcncration is scnsibly constraincti by thc latticc: thc dcsigncd slructurcs would be stcrically complcmcntary and conlormationally stablc bccausc of thc way in which the latticc has bccn defined. If thcrc arc scvcral ligand points, thcn thc gcncrator must try to help thc dcsigncr lo link them up by restricting thc allowed pathways that can bc followed. In the prcvious discussion, building up a molcculc from the lattice has bcen trcatcd as a joining-the-dots algorithm. This may be the ultimatc goal for an automated program but would probably be of littlc iisc for an intcraclive program. Uscrs would be happier with connccting largc chunks with shortcr links, or building from a seed slructurc. A starting fragmcnt could be chosen from the lattice and the olher molecules that are connected to the choscn fragmcnt could be dctcrmincd. The routincs required to process such information should bc ablc to handlc any kind of superatom, for cxamplc, a molcculc or a ring.
131
It is intcndcd that, initially, the design process be driven primarily by a synthetic impctus; howcvcr, complcmcntarity should not be ncglccted, especially when the designer is sclccting ncw atoms to add to the growing structure. There should be some way of assessing complcmcntarity a priori as a score. This would encourage the selection of high-scoring atoms but not prevent other considerations from overriding this. At the end of the dcsign proccss, a force field minimisation of the entire ligandbinding site complcx is rcquircd. In this way, the lattice strategy seeks to combine the expert’s synthetic knowledge with the chemical sense of database searching and the exhaustive nature of structure gencration.
4. Conclusion A database of structures is compared to the binding site and the members of the database
arc rankcd according to steric complementarity. The superposition of the best structures within the binding site provides a forest of atoms and bonds that, when combined in a chemically sensible way, form an enormous irregular lattice. Several rules used to define chemical scnsc within the lattice have been dcscribed. The results obtained from spatial analysis of a lattice obtaincd from the mcthotrexate binding site of dihydrofolate reductase show a surprising richncss of cover of atoms, bonds and vicinal triples throughout the sitc. This implies that there are many paths that could be traced through the lattice that would connect the component atoms of the lattice in new and interesting ways. Examplcs of paths that correspond to a known ligand have been found. Novel algorithms arc bcing devclopcd that will allow an expert synthetic chemist to move through the latticc and to mace out molecules that interact strongly with the receptor and that are reasonable synlhctic targets. The development of this approach will be a significant stcp towards thc intcgration of both biological and chemical knowledge into the rational drug dcsign proccss.
Acknowledgements RAL would like to thank the Royal Commission of 1851 for a Research Fcllowship and thc Fulbright Commission for a Senior Scholarship. This work was partially supported by grants GM-3 1497 and GM-39552 (George Kenyon, Principal Investigator). We thank the UCSF Computcr Graphics Laboratory (RR-01081, Robert Langridge, Principal Investigator) for hclpful discussions.
References 1.
Dean PM. Molccular Foundations of Drug-Reccptor Interactions. Cambridge, Cambridge University kcss, 1987.
UK:
132
2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15. 16. 17. 18.
19. 20. 21. 22. 23.
24. 25. 26.
Cohcn NC, Blaney JM, Humblet C, Gund P, Bany DC. J Med Chem 1990; 33: 883-894. Sharpe K, Fine R. Honig R. Science 1987; 236: 1460-1463. Lewis RA. J Comp-Aided Mol Design 1989; 3: 133-147. Lcwis RA. h4efh Enzymol, in press, and references therein. Marshall GK. Ann Rev Pharm Tox 1987; 27: 193-213 and references therein. Fcrsht AR, Stemberg MJE. Prot Eng 1989; 2: 527-530. Eisenbcrg D, McLachlan AD. Nafure 1986; 319: 199-203. Roobbyer DN, Goodford PJ, McWhinnie PM, Wade KC. JMed Chem 1989; 32: 1083-1094. Danziger DJ, Dean PM. Proc Roy SOC,London Ser B 1989; 236: 115-124. Taylor R, Kennard 0 , Versichel W. Acta Cryst 1984; B40: 280-288. Fcrsht AK. Proc Roy SOC,London Ser B 1984; 187: 397407. Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE. J Mol B i d 1982; 161: 269-288. Van Dric J, Wcininger D, Martin YC. J Comp-Aided Mol Design 1989; 3: 225-51. Jakcs SE, Watts N, Willett PJ. Bawden D, Fisher JD. J Mol Graph 1987; 5 : 41-48. Lcwis KA. JMol Graph 1988; 6: 215-216. UesJarlais RL, Shcridan RP, Seibel GL. Dixon JS, Kuntz ID, Vcnkataraghavan R. J Med Chem 1988; 31: 722-729. Adamson GW, Creasey SE, Eakins JP, Lynch MF. J Chem SOC Perkin Trans I , 1973: 207 1-2076. Lewis RA, Dean PM. Proc Roy SOC.London Ser B 1989; 236: 125-140. Lewis RA, Dean PM. Proc Roy SOC,London Ser B 1989; 236: 141-162. Lewis RA. J Comp-Aided Mol Design, 1990; 4: 205-210. Lewis RA. Paper presented at the Sirfh European Seminar on Computer-Aided Molecular Design. London, 1989. Allen FH, Kennard 0, Watson DG, Brammer L, Orpen AG, Taylor R. J Chem Soc Perkin Trans I1 1988: Sl-Sl9 Lindsay RK, Ruchanan BG, Feigenbaum EA, Lederberg J. AppIicaiions of ArtiJicial Infelligencefor Organic Chemisfry: the DENDRAL project. McGraw-Hill, New York, 1980. Filman DJ. Bolin JT, Matthews DA, Kraut J. J Biol Chem 1982; 257: 13663-13672. Remstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MP, Rogers JR. Kennard 0, Shimanouchi T, Tasumi M. J Mol Bioll977; 112: 535-542.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
133
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 12
A Chemometrics I Statistics I Neural Networks Toolbox for MATLAB H. Haariol, V.-M. Taavitsainenl, and P.A. Jokinen2 IKemira Research Center, P.O. Box 44, SF-02271 Espoo, Finland and =NESTE Technology, P.O. Box 310, SF-06101 Porvoo,Finland
1. Background Several well-written software packages already exist for statistical analysis and handling of the data produced in various fields of science. In practical work, however, one often encounters certain problems when using the software. In many cases the most time consuming part of the work can be the preparation of the data: to get the raw data, obtained, e.g., from a laboratory equipment, in a form required by the software. Here the nuisance often is twofold: on the one hand, the original data often is written in files of non-standard (non-ASCII!) form. The equipment may be sold with accompanying software. This software usually is designed for a limited purpose only and often is scarcely documented-the user may eventually not really know how the analysis is done. To get the data in a form suitable for external software some format-transforming auxiliary program may be needed. On the other hand, during the analysis various manipulations of the data matrices are necessary: one has to remove missing values, outliers, split or concatenate matrices etc. Often this is not as easy as it should be.
Dala manipulation.
Generalify. Software packages tend to be oriented towards some fields of data analysis. One package may have an extensive library of ANOVA or factor analysis, another contains nice tools for generalized models (quantitative and qualitative factors), a third specializes in logistic regression. As a side effect, some other areas may be lacking-a nice statistics package might not be comprehensive enough for experimental design, missing, e.g., mixture designs. Consequently one is forced to use several packages. This in turn leads to the mess of saving one’s files with one format, transforming them, loading again into another program, etc. A preferable situation would be to work in one environment only and be able to use tools from external software as subroutines.
134
Flexibility, extendability. Real-life problems often are not of fixcd standard forms. A dat?-analyst thus faces the need of dcvcloping tailorcd solutions for spccific tasks. In this situation the efforts rcquircd by coding and documentation should be kcpt to a minimum. Thcsc rcquircmcnts are best met in a working environment which contains a matrix language with a natural syntax and built-in functions for basic manipulations. In addition, the cnvironmcnt must bc fully opcn for the user’s cxpansions and must contain the basic programming structurcs. In what follows we discuss our toolbox for data-analysis, writtcn in the MATLAB matrix language. The prcsent state of thc toolbox is the result of an ‘evolution’ skirted some years ago. New functions addcd to the library havc bccn detcrmincd by the necds met in practical work. The driving forces in the process havc thus been the demand to solvc various problcms in chcmical industrics-and the case of writing the solutions with M ATL A B.
2. MATLAB MATLAB [lo] is an interactive matrix manipulation package which contains a comprchcnsivc collection of numcrical lincar algcbra functions. In addilion, it providcs a programming languagc with the basic control structures. The first versions of MATLAB wcrc writtcn for mainframe computcrs to give an intcractivc environment for thc use of the classical Fortran matrix algebra subroutine collections LTNPACK and EISPACK. With the new, cnhanccd versions the popularity of MATLAB has strongly increased in thc recent years.Thc present versions of MATLAB work-with identical syntaxes-both in micros (PC-MATLAB), scvcral workstations, and mainfrarncs (PRO-MATLAB for VAX-machines). MATLAB includcs thc basic matrix manipulation routincs as built-in functions. Consequently matrix opcrations can be programmed in a very compact way. The notation resembles thc way one writes matrix computations as mathematical formulas. Thus, in many c a m formulas writtcn in the MATLAB language are rather sclf-documenting. In addition, a built-in hclp systcm facilitatcs the intcractivc documentation of user-written functions. As an cxample, considcr thc basic regrcssion problem of fitting thc coefficients b, of the modcl y = bo + b1x1
+ ...+ 6, x,,
givcn a data vcctor y corresponding to an experiment matrix X as well as an estimate cxperr of the cxpcrimcntal error. Thc estimates for the cocfficicnts, their standard dcvialions and t-values, thc residuals and cstimatcd y -values are exprcsscd with inalhcmatical formulas and in the MATLAB language as
135 T
-1
b=(X X ) X yest = X b res = y - yest srdbi = ti =
T
y
(dm)
experr i = 1, ...,n i = 1, ..., n
bi/stdbi
Yo
b = X\Y;
%
yest = x*b:
%
res = y-yest:
o/o
stdb = sqrt(diag(inv(x’*x)))*experr:
Yo
t = b.lstdb;
Thc vcry same lincs work also in a multiresponse case, where y is a matrix. The coding of a user-writtcn function to pcrform the regression is equally straightforward. Below are the first lincs of the function, for which we give the name ‘regm’: function [b,stdb,t,yest,res,experr]=reg(x,y,experr,intcep) % function [b,stdb,t,yest,res,experr]=reg(x,y,experr,intcep)
the function computes the regression coefficients and their Yo t-values for a given design matrix and data.
YO X
the design matrix
%
Y
the data
%
experr
the estimated std of the experimental error.
o /‘
INPUT:
%
In case of a multiresponse matrix ‘y’, ‘experr’ must be a vector.
%
OPTIONAL (default: computed from the residuals ‘res’) intcep
YO
intercept option, model contains an intercept intcep=l : any other value: no intercept
% %O
OPTIONAL (default: intercep = 1)
% % ‘lo O/O
OUTPUT
b stdb
the estimated regression coefficients the estimated std of the coefficients ‘b’
t yest
the t-values of ‘b’
%
res
the residual: ‘y - yest’.
O/O
experr
the estimated std of the experimental error. OPTIONAL, computed only if ‘experr’ not given as input.
YO
%
YO
the values of the response computed by the model
The first linc givcs thc input and output paramctcr lists. All the lines immcdiatcly following it and bcginning with a comment % charactcr give the intcractivc hclp text, appcaring as rcsponsc to thc command ‘hclp reg’. Then follows the code itself, thc corc of which was givcn abovc. Additional control structures are addcd to take care of the dcfault values: the paramctcrs dcscribcd as OPTIONAL in the hclp text may simply be omittcd or givcn the valuc zcro, in which casc thc dcfault values arc used.
136
In addition to MATLAB itself scveral ‘MATLAB toolboxes’ for different fields of science have already been written and are commercially available: a Signal Processing Toolbox, a System Identification Toolbox, Control, Robust Control toolboxes, etc. Recently a Chemomelrics Toolbox has also been published [Sl. In our opinion, however, this colIcction4ontaining the basic multiple linear regression, principal component and PLS regressions-is rather insufficient, so we have kept using and expanding our own collection. Taking into account all the built-in facilities of MATLAB for matrix algebra onc would expect a comprehensivc Statistics Toolbox to be available, too. For some reason this is not the case. Thus, we have codcd the most neccssary statistical methods and distributions using MATLAB. Since we have tried to avoid ‘re-coding the wheel’, more sophisticated statistical routines are included in the toolbox from the IMSL/STAT library via the MATLAB/IMSL interface. At present this interface only works in the VAX environ ment More discussion about the properties of MATLAB with examples can be found in [ll], where the use of MATLAB for chemometrics was dealt with, too. The paper 171 contains an analysis of ecological data done wilh the aid of our toolbox.
.
3. The contents of the Toolbox
3.1 The basic functions All thc MATLAB functions work identically in both the PC and VAX environment. So the user may run the routines of the toolbox without distinction in both settings. However, some of the functions use the IMSL subroutines via the interface only available in VAX. So we first dcscribe the functions valid in both environments. Below is a list of the main areas covered together with the basic functions UTILITY FUNCTIONS data manipulation statistical distributions EXPERIMENTAL DESIGN fractional factorial designs composite designs mixture designs REGRESSION multiple linear regression backward multiple linear regression CHEMOMETRICS principal component analysis
137
-
-
principal component regression PLS crossvalidation etc. validation methods
BIOMETRICS
-
ordination methods mpn estimation for dilutions dose-response models NEURAL NETWORKS networks with none/one hidden layer The data manipulation functions include tools like removing rows or columns (or both) from a matrix, expanding a data matrix with new variables obtained as interaction terms of original variables (i.e., as products of given columns of the matrix), centering or scaling matrices, etc. The regression functions are restricted to the basic linear regression and the backward regression stepwise eliminating statistically insignificant factors from the model. The reason is that the more comprehensive routines are available from the IMSL library in the VAX version. For the same reason, e.g., ANOVA is missing here. The principal Component and PLS routines are accompanied by graphical utility functions for plotting the scoresfloadings or biplot figures. The ordination methods in the biometrics department contain principal component type routines such as the correspondence analysis. The mpn (most probable number) and dose-response utilities compute the maximum likelihood estimates for the parameters of the models in question and plot the likelihood functions or the contours of them.
3.2 The interface to external software MATLAB provides h e tools for the user to call his own C or Fortran subroutines from MATLAB, as if they were built-in MATLAB functions. MJZX-files serve for several purposes, for example - Pre-existing codes can be used as such. - Some computations (notably for-loops) are slow in MATLAB. They may be re-coded for efficiency. - Direct access for external hardware. MEX-files are executable images, produced from compiled C or Fortran code and linked into the MATLAB system. In principle, each subroutine should be linked separately. Using the facilities of the VAX/VMS operating system, an interface has been developed which enables large mathematical and statistical subroutine libraries like NAG or IMSL to be linked at a time.
138
To give an idea of the interface we consider an example from the IMSL/STAT library. The subroutine GDATA [6] provides the user of IMSL with several classical test data sets used in statistical literature. To run the routine one must write a Fortran main program which defines the types, dimensions etc. of the variables used and calls h e subroutine. Thc Fortran call rcads as call dgdata(idata, iprint, nobs, nvar, x, Idx, ndx),
where the input parameters idata and iprinr specify the data set and print options, the output parameter x contains the data. The rest of the parameters denote several matrix dimensions. Using the interface, the same subroutine can be called directly from MATLAB by typing imsl(’dgdata idata:i iprint:i nobs:i nvar:i x Idx:i ndx:i’)
Hcre the suffix ‘i’ indicates which parameters are of the type ‘integer’. Even the above line can easily be hidden from the user by writing a short MATLAB function conLaining it together with some parameter initializations and a few appropriate help lines, e.g., O h
function x = gdata(idata)
% the function returns the following example data sets: 70
% idata
nobs 16 176
nvar
5
6 7
150 144 13 197 296
2
Fisher Iris (Mardia Kent Bibby 1979, Table 1.2.2) Box & Jenkins Series G (1976, p. 531) Draper & Smith (1981, p. 629-630) Box &Jenkins Series A (1976, p.525) Box &Jenkins Series J (1976, p.532-533)
a
100
4
Robinson Multichannel Time Series (1 967, p.204)
% 1 010 2 Yo
3 4
% 5 a /‘ O/O
o /‘
INPUT OUTPUT
7 2 1
5 1
idata: X:
Data set Longley (1967) Wolfer sunspot (Anderson 1971, p. 660)
cf. above the data matrix
tclling the user how to call the function, the meaning of the input and output as well as whcrc to find the rcfercnccs. Note that the parameter for prinling options bccomcs obsolctc, since the user may easily screen the data in MATLAB. Nor does he need to care about any matrix dimensions. The advantages of using cxtcrnal routines in the above manner, via the interface and imbedded in a MATLAB function, are obvious:
139 - a large Fortran subroutine library can be used without any code written by the user. - the use is interactive.
the use is simplified: one only has to specify relevant information (no definitions of data typcs, leading dimensions of matrices etc). - on line help as a guide.
-
Of course the manual of the external library is still needed for the use of more involved routines, the MATLAB help being intended for a quick guidance. In addition to the statistical tools some useful mathematical routines are included in the toolbox, notably linear and quadratic optimization and (smoothing) splines. Below the main areas presently covered are given. REGRESSION most IMSL routines, e.g., GLM for generalized models (including qualitative factors) RLEQU for models with linear constraints RBEST for best subset regression ANOVA most IMSL routines, e.g., ALATN for Latin squares ABALD for balanced designs ANWAY balanced n-way classification, etc FACTOR ANALYSIS a part of the IMSL routines, e.g., FACTR for initial factorial loadings FROTA for orthogonal rotations FOPCS for Procrustes rotations with a target matrix MATHEMATICAL ROUTINES DLPRS for linear programming QPROG for quadratic programming CSSMH,CSSCV etc for spline smoothing The interface also gives the possibility of adding Fortran code, for instance public domain or user-written routines, to the external library. However, the necd of such applications is mainly encountered in the field of nonlinear parameter estimation for mechanistic models, which are not included in the present toolbox.
4. Quadratic optimization As an application of the MATLABDMSL interface we present a method which combines
140
quadratic optimization with the information given by a principal component type data analysis, Suppose we want to model the connection between data matrices X and Y , whcrc the independent variables in the X matrix represent, e.g., expcrimcntal conditions and the dependent variables of Y give the quality of the product. Thc first idea might be to try a usual multiple linear regression or perhaps PLS regression. But we may end up with a regrcssion modcl with no statistically significant factors, or a crosvalidation may show a PLS modcl totally inadequate. The failure might be due to some nonlincaritics, discontinuities etc. in the relation between X and Y. However, a principal cornponcnt (PCA or PLS) score plot of the observations might reveal aninteresting trend between X and Y anyway. Figure 1 gives a real data example of this situation. We havc plottcd the observations in the X matrix as the scores on the first two principal PLS components. Each obscrvation is labelled with the corresponding Y value. In this case the Y-values vary bctwcen 0 and 100, the bcst quality being denoted by 0 and the worst by 100. We can sce that the ‘good’ products are mapped to the left lower comer of the plot, the ‘bad’ ones to the right upper comer. If the chemist now would like to make a new experiment, how should he-missing a modcl bctween X and Y-choose the experimental conditions, given as thc components of a vector x = (xl, ..., xJT? One feels that the information given by the score plot should be taken into account, i t . , one would like to predetermine the place of the new point in the scorc plot. Denote the principal component vectors by pi, i = 1, 2. Thc score componcnts of the observation x are given as the projection x T p .
As projcctions from a typically highcr dimensional space (n > 2 ) into 2 dimensions, the pair tl,t2 by no means determines the vector x uniquely. To restrict the scarch wc can use our a priori knowledge of the possible values by giving lower and upper values a Ix I b for the components of x, or impose some linear constraints, Ax = c. A unique solution x can be obtained by formulating the scarch of x as an optimization problem with the constmints given above. We require that the solution is ncar a group of observations with some desirable properties. For instance, dcnote by xo thc mean valuc of the observations labelled as ‘good‘ in the score plot. We then would like to minimize the Euclidean distance IIx - xoll bctween x and XO.Denote furthcr by T a polygonal target domain in the scorc plot. The point x is now obtained as the solution of the quadratic programming problem
Minimize Ilx -xu112
with the constraints
141
constrained projection on a PLS-plot
100
-1
6 47
0
,15
0
17
-2
-3
39
t
’
-4
45
-3
-2
-I
0
1
2
3
4
Figure 1. The PLS-scores of the data set for quadratic programming example. xTp
insideT
a<x
In Figure 1 the target domain of our example is shown. The score point of the solution inside the target domain is labelled as -1. The IMSL routine QPROG [5] was used to solve the above task. Again, the routine was imbedded in a MATLAB function to guarantee a fluent, interactive use.
5. Neural networks as nonlinear models 5.1 Overview Artificial neural networks have had much attention in recent years due to the hope of building intelligent systems that can learn. This MATLAB toolbox includes an important subset of neural networks. The emphasis here is to use them for ‘nonlinear statistics’. We describe neural networks as a tool for nonlinear (non-polynomial) modelling extending the scope of the more traditional linear or polynomial models of statistics and chemometrics. An artificial neural network consists of simple processing elements that can be thought of as neurons. A simple example of an artificial neuron is shown in Figure 2. The
142
bias
Figure 2. An artificial neuron: inputs xi, output y , wcights wi.the link functionf.
output y of the neuron is givcn as
where x = (xl,..., xJT is the input to the neuron, the vector w = (wl,..., wJT gives the weights, and f (-) is somc link function at the output. The ‘intercept’ term wo is here called the bias. Thc link function is usually of a sigmoidal form, for example the logit function
m =1 g or f(z)=tanh z
IS the link function is linear this modcl with one neuron coincides with an ordinary multiple linear regression model and the estimation of the weights wi can be done by thc wcll known matrix formulas. Other interesting spccial cases are obtained with, e.g., the logit or probit link functions. In fact, Lhe functions so constructed coincide with those used in statistics as ‘generalized linear models’ for analyzing, e.g., binary data or random variables from the exponcntial family of distributions [9]. The eslimation of thc parameters wi now requires an iterative algorithm. The software package GLIM for these models is dcscribcd in [I]. Due to the nonlinearity of the model, the parameters wi cannot be estimated in closed form. We outline here, in the simple case of the model (l),the method of steepest dcsccnt type used for thc iterative procedure in backpropagation. Denote by X and Y the ‘training set’, the input data matrix and the output data. The aim is to build a functionf mapping
143
the input vectors .dij to the respective components of Y. The method of steepest descent tries to minimize a given performance function E with respect to the parameters wi using aE w(t + 1 ) = w ( t ) - p my
where the ‘time’ f denotes the iteration steps. The derivative of the performance function is dependent on the particular form of the model to be identified, j.f is a small positive constant. The error between the true output Y and the calculated outputf(wTdi)) is denoted by
In the case of least squares solution the performance function is E = 2~~ and the derivative of the performance function becomes
In most implementations of the backpropagation algorithm, the derivative in Equation 2 is indeed computed as above, taking one by one sample vectors from the matrix X. The iteration stops when the overall error reaches a prescribed level or when the number of iteration steps allowed is exceeded. The initial values for the weights are usually given as random numbcrs. The neurons can be connected into more complex feed-forward structures that contain a hidden layer between input and output layers, see Figure 3. In the case of one hidden layer and linear link functions in all neurons of the network, the model tends to a linear principal component regression model, PCR [2]. The addition of nonlinearities to the neurons increases the family of models that can be expressed using the same connection topology. It has been shown that a feed-forward neural network with one hidden layer and sigmoidal nonlinearities can approximate any linear or nonlinear continuous function with arbitrary accuracy, although an estimate for the number of hidden neurons needed is not known theoretically,cf [3, 141. The estimation of feed-forward neural network models is also done using an iterative algorithm. The algorithm has been derived independently by Werbos [15], Parker [123 and Rumelhart & Hinton [13] and it is currently known as the backpropagation algorithm. The only addition to the previous algorithm is a method to propagate the error from the network output to the hidden neurons of the network. The weight update is done for each neuron of thc network using an expression similar to (2), only the calculation of thc dcrivative is now diffcrcnt. For the details of the algorithm see for example [131. The backpropagation algorithm usually converges slowly. This might discourage the usc of it, cspccially in a PC-environment. It is thus advisable to pay attention to an
144
Figure 3. A network of artificial neurons with one hidden layer.
cffcctive implementation of the algorithm. In our case, a natural solution is to code the backpropagation not as a MATLAB function but as an external subroutine written in the C language, and use the EX-interface facility provided by MATLAB. The convergence (the ‘training’ phase) can also be considerably spccded up by a proper selection of thc initial values of the weights and other parameters of the backpropagation algorithm. Kceping in mind our principal use of the neural networks, nonlinear extensions for linear modcls, it is tempting to take the initial values so as to construct a network whose output approximates that given by the best linear model found. In addition to specifying the initial values of the parameters one has to fix the suuctiire of the network: the number of layers and the number of ncurons in each layer. The simple structure with one layer only, cf. Figure 2 for a scalar output case, may be cmployed if one, e.g., wants to take into account a saturation type nonlinearity or just wants to scale the output into a fixed interval. In the case of a network with one hidden layer the number of neurons in the hidden layer bears no direct relation to the dimension of thc input or output. The choice of the number of neurons of the hidden layer might be comparcd to the choice of dimensions in a PCA or PLS model. In fact, using the linearization indicated above for choosing the initial values one is led to choose the weights in the first layer by the principal component loading vcctors of the training matrix X [4]. Thus one may start by fitting a principal component or PLS model to the training data, lake the number of hidden neurons identical to the dimension of the principal component modcl and continue by training the network so specified. Typically, in problems of interpreting spectral data in chemomctrics, the relation bctwccn the spccuum and dependent variables ( e g , concentrations) is theoretically lincar, nonlinearities being due to phenomena like saturation, instrument effects, etc. In such, not too severely nonlinear situations an initialization by linearization may drop the
145
iteration time to a tcnth of that required by the standard way of starting with random numbers. For more details of the initialization. see [4]. As dcscribcd above, the feed-forward connected neural networks provide a natural nonlinear extension for linear modcls. As special cases they contain the linear and gencralizcd linear models. As with traditional statistical models, polynomial nonlinearities can be modcllcd by forming new intcraction variablcs. An advantage of the neural network modcls as comparcd to polynomial ones is that the output can be scaled to remain bounded in a given interval for all values of the input.
5.2 An application example This cxamplc dcscribcs the rcsults of a chemometric problem with real data. A NIR (ncar infra rcd) spectrum is used for prediction of the concentration of one chemical substance of the sample. The spectra used in the estimation are shown in Figure 4. A part of the data is used to build the model, the rest is used to test the prediction power of the modcl. In the following we test two model combinations: a PLS regression followed by a neural nctwork with one hiddcn layer, and a multiple linear regression followed by a network with one output layer only. Figurc 5 shows the prcdiction rcsults using the best PLS model (best with respect to thc numbcr of principal dimensions), and a feed-forward neural network with one hiddcn laycr. The true concentration is shown with a continuous line, the PLS model prediction Input Spectra
0
20
40
60
80 Input #
Figure 4. Thc NIK spcctra.
100
120
146
Input Specmum #
Figure 5. The concentration data (-). (- - -) and a PLS model (-.-.).
Predictions: a neural network with one hidden layer
with a dottcd line and neural nctwork prediction with a dashed linc. The number of hidden neurons of thc fccd-forward neural network was 5, the same as the numbcr of factors in thc corresponding PLS modcl. The number of inputs, the spcctral wavelengths, was 130 and the number of outputs onc. Thc ncural network model was able to predict slightly bcttcr than the linear PLS model. This can bc attributed to the nonlinearity of the instrument response. Thc inhcrcntly nonlinear ncural nctwork model is able to compcnsate slight distortion bcttcr than the PLS modcl. The tcst data sct contained three spcctra that were outliers for both modCIS. Figurc 6 shows the rclativc prcdiction error of both modcls. Thc outlicrs can be idcntificd clcarly from thc continuous crror curve that corresponds to the PLS model. Thc outlicrs corrcspond to the spcctra numbcrs 3, 7 and 10. This small prcdiction data sct shows alrcady soinc interesting propcrtics of the nonlinear ncural nctwork model. Espccially some of the outlicr spcctra arc prcdicted clearly bcttcr by the nonlincar modcl than by thc PLS model. This bchaviour sccms to bc related to the same ability to compcnsatc the nonlincaritics of the data. Thc mcan rclativc prcdiction crror of here is 2.8%for thc PLS modcl and 1.6% for the ked-forward ncural nctwork modcl. So we conclude that the ovcrall bchaviour of the ncural nctwork model is here bcttcr. This conclusion has been confirmcd by vcry analogous rcsults from similar but indcpcndcnt real data examples. Thc samc problcm was also solvcd using multiplc lincar rcgrcssion and the simple ncural nctwork modcl with onc neuron, cf. Figure 2. As mcntioncd carlicr, this modcl is
147 Relative Prediction Error
Input Spectrum #
Figure 6. The prediction errors: a PLS model (-) layer (- - -).
and the neural network with one hidden
Predicted Concentration
23
22 21 e
'$O
20
5
s=
19 18
17
16
Input Spectrum #
Figure 7. The concentration data (-). and a MLR model (-.-.).
Predictions: a neural network with one ncuron (- - -)
148
equivalcnt to a gcncralized linear rnodcl with logistic nonlincari ty. The prediction rcsults with thcsc modcls using the same data sets are shown in Figurc 7. The mean relative prediction error of this data set is 2.25% for multiple linear regression and 1.1% for thc onc ncuron model. An incrcasing order of the prcdiction performance of thc difcrent rncthods in this example is thus PLS, MLR, network with one hiddcn layer and the ‘network’ with one ncuron as the bcst. Rccalling the analogy bctwcen neurons and principal componcnts wc might anticipatc that thc data may have one strong eigenvector that accounts for most of the useful information. In fact the largcst cigcnvalue of the training data set is ncarly 1,000 times largcr than the second largcst, giving an explanation for the good pcrforniancc of die simplest ncural modcl.
References 1.
2. 3. 4.
5. 6. 7.
8. 9. 10.
11. 12. 13.
14. 15.
Aitkin M, Anderson D, Francis B , Hinde J. Statistical Modelling in GLIA4. Oxford: Clarendon Press, 1989. Baldi P, Homik K. Neural Nctworks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks 1989; Vol. 2: 53-58. Homik K, Stinchcombc, White H. Multilaycr Feedforward Networks arc Universal Approximators. Neural Networks 1989; Vol. 2: 359-366. Haario H,Jokincn P. Increasing the learning spced of backpropagation by linearization. Submit t ed . IA4SL MathlLibrary. FORTRAN Subroutines for Mathematical Applications. IMSL Inc., January 1989. IMSL StatlLibrary. FORTRAN Subroutines for Statistical Applications. IMSL Inc., January 1989. Kaitda S , Haario H, Kivi K, Kuosa H. Effects of Environmental Parameters on Planktonic Communilies. Chemometrics and Intelligent Laboratory Systems 1989; 7 : 153-1 62. Kramer K.Chemometrics Toolbox User’s Guide. The MalhWorks Inc., February 24, 1989. McGullagh P, Ncldcr JA. Getwralized Linear Models. London: Chapman and Hall, 1989. PC-MATLAB User’s Guide. The Mathworks Inc., 1989. O’Havcr TC. Teaching and Learning Chemometrics with MATLAB. Chemometrics and Intelligent Laboratory System 1989; 6: 95-103. I’arkcr D.Learning-Logic. Invention Report S81-64, File 1. Office of Tcchnology Licensing, Stanford University, Stanford, CA, October 1982. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. In: Kumelhart DE, McClclland JL, eds. Parallel Distributed Processing. Volume I, Chapter 8. Cambridge, MA: The MIT Press, 1986. Stinchcombe M, White H. Universal approximation using feedforward Networks with nonsigmoid hiddcn layer activation functions. Proceedings oflJCNN 1989; Vol. 1: 61 3-617. Werbos P.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis. Cambridge. MA. Harvard University, August 1974.
Data Analysis and Chemometrics
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Auromarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
151
CHAPTER 13
Neural Networks in Analytical Chemistry G. Kateman and J.R.M. Smits Laboratory for Analytical Chemistry, Catholic University of Nijmegen, Toernooiveld, 652.5 ED Nijmegen, The Netherlands
Neural nctworks seem to be a hot topic today. Newspapers bring new applications of neural nctworks nearly daily, computers dedicated to neural networks are announced regularly. But in chcmistry or chemometrics new announcements are scarce. There are a number of explanations for that phenomenon. One reason is that chemometrics is used to algorithmic solutions. Analytical chemistry always has been more an art than a science and analytical chemists do not like that. They have always been trying to incorporate science in their trade: chemistry, physical chcmistry, physics and lately mathematics. However, analytical chemistry is not that exact, as it dcpcnds on choices (of methods and strategies), interpretations (of not always clcarly understood physical or chemical phenomena) and noise removal, without being able to dcfinc Lhc type of noise properly. Statistics, applied mathematics, became quitc popular during the last decades. Noise removal, pattern recognition and confidence estimation use slatistical tcchniques, often completed with fuzzy techniques. The introduction of artificial intelligence, however, brought new possibilities. Alongside the algorithmic techniques used in mathematics, heuristics was used. Manipulation of heuristics, knowlcdgc without apparent mathematical theory, asked for ncw inslrumcnls. Among the most successful techniques that entered analytical chemistry are the expcrt systems. These systcms consist of a set of rules, a set of facts and a logical infcrence machine. Although these tools can handle symbolic knowledge, they still use mathematics: the IF.. THEN.. rules are derived from formal logic. So the knowledge engineer has to collcct the expertise in a logical way, not quite as strict as the algorithmic parameters should bc collected, but still a gap in the knowledge of the expcrt causes a gap in the knowlcdge of the machine. Chemometricians have b a n searching for tcchniques that could supplcmcnt the algorithmic and logic tools. Among the possible candidates are gcnctic algorithms, clever brute force methods that use noise to find their way, and ncural nctworks.
152
Neural networks are computer realizations of a model of the (human) brain. In 1943 McCulloch and Pitts [ 11 postulated a modcl of nerve action that depended on the interaction of neurons. With this model they could explain some peculiarities in the way frogs handled visual information. McCulloch also stated that his model could bc simulated in a Turing machine, the theoretical precursor of the computer. The availability of real computers made the development of a working model of neural action possible in about 1957 [ 2 ] .This “pcrceptron” as it was called stimulated research in the field, but a very critical theoretical treatment by Minsky and Papert in 1969 [3] drove all research undcrground. It was not until about 1985 that new developments were published, but since then research has bccn increasing very fast. Most research is in the biological and psychological field: trying to understand the way the brain works. Part of the research aims at practical applications in control (of processes and movements), image and spccch processing and pattern recognition. Neurons are cells in the nerve systcm. They are connected to many other neurons and they are supposed to act only if they get a combination of signals from other neurons or dctcctor cells that are above a ccrtain threshold level. The computer modcl is built in the same fashion. A number of “ncurons” act as input units. These units arc conncctcd to output units via a “hidden” layer of intermediate cells. All units are independent, there is no hierarchy. The network of conncctions between the units in the input layer, the hidden laycr(s) and the output layer brings structure to the system. These networks can be altcrcd by h e user or cvcn automatically or autonomously. A unit works as follows: 1. Incoming signals from other units (or the outsidc world) are weighted with a factor. 2. The signals are combined to the net unit input. Usually this is a summation of all incoming signals, multiplied by the weight factor. A threshold can bc built in. 3. The new activity of the unit is some function of the net input and the old activity. 4. The output signal is some function of the activity of the unit. It can be a sigmoid function, a threshold or some statistically altered function. 5. The output signal acts as the input signal for the units that are connected with the unit that delivers the signal. The signal can also act as an output signal for the output units, and give a connection to the outside world. In this system the weights can be altered by Lraining. The activities and weights contain in fact the knowledge of the system. Training can be performed by applying cerlain rulcs to the adaption of the weights when a given input expects a certain output. These rules are simple and stablc and do not depend on the status of the system. Thc adaption rules force the system itcrativcly to a stable situation. The whole system is thus defined by the numbcr of units, the connection network, activity rules and weight adaption rules and is independent of the problem that is offered to the input units, once the system has bccn dcfincd. Training of the system with problems and their solutions is therefore indcpen-
153
dent of the connection between problem and solution. This connection can be totally heuristic. Problems that could be solved algorithmically can be offered also but usually an algorithmic solution is faster and more accurate. After the training is (nearly) complcted (for instance when the difference between the expected output and the learned output becomes stable) the weighting process can be frozen. Application of a problem then results in the learned output. As the activities of the units are simple functions, the output response can be very fast. Re-starting the training is possible, though at the expense of more time. However, the training action, the adaption of the weights, is a parallel action. Newer computers that work in parallel mode can make the training much faster. Applications for neural networks can be divided in four groups: 1. Memory. Neural networks act in fact as non-organized memories. After a successful training every trained input can be recovered, without use of a systematic recovery system. Therefore the access time is short. The network is relatively insensitive to noise. 2. Pattern recognition. Contrary to statistical pattern recognition neural networks can handle symbolic knowledge. They are fast and the training is easy. 3. Generalization. When a problem that has not been included in the training set is presented to the trained network, the network always will provide an answer. When the answer is an interpolation between the problems used for the training the result is probably quite good. The system is insensitive to exact matching. Extrapolation also gives answers but here the results arc usually worse; only nice linear extrapolations give good results but in that case extrapolations by algorithmic calculations arc more precise. 4. Processing of incomplete data. The property of generalization can be used to compensate for missing data in the problem set. This property, combined with the possibility
t
I
'\
Hidden UnH
/
Input Unit*
Figure l a & b. Network able to solve the XOR problem: a) 1 hidden unit, b) 2 hidden units (right).
154
Figure l c & d. Two possible solutions (left column (c) and right column ((1)) fowid by nctwork lb. (The tlucc squares in one column represent Lhc two hidden units and thc output unit from top to bottom, respectively, as function of the received input signals).
155
of continuous training, makes the system TABLE 1 an ideal instrument for on-line decisions, XoR Problem. e.g., in robotics. Input patterns Output patterns The working of neural networks can be explained with some simple problems [4]. 00 0 1. The X O R problem. (exclusive or) The 01 1 10 1 XOR problem as stated in Table 1 can be 11 0 successfully trained in the network shown in Figure 1: two input units, one output unit and one hidden unit. Note that a TABLE2 more complicated network, using two The encoding Problem. hidden units, can fail by getting stuck in Input patterns Output patterns a local minimum. Although the practical use of the XOR configuration as such is 10000000 10000000 small, many. problems consist of XOR ~ ~ ~ ~ 0 0 0 ~01 000000 _ 001 00000 situations. It can bc uscful to know how 00100000 00010000 0001 0000 these sub-problems can be solved. 00001 ..- - - 000 ... 00001 000 2. The encoding problem. This problem OOOOO~OO 000001 00 0000001 0 (Table 2, Fig. 2) can also be used as a 00000010 oooooool 00000001 basic unit, for instance as a first step in defining problems. 3. The addirion problem. Although computations can be done more efficiently and faster in an algorithmic way, neural networks can be trained to learn some simple arithmetic rules, for instancc adding binary numbers (Table 3, Fig. 3). For some applications this can bc useful. Output Units
N Output Unit
Iogp Hidden Units
N Input Units Input Units
Figure 2. Network able to solve the encoding problcm.
Figure 3. Network able to solve the addition problem.
156
4. The negation problem. Suppose that a number of patterns has to be copied. There is only one problcm. Onc of thc bits that is part of the pattern negates the output if it is 1. The system does not “know” which bit is the negator, so it has to bc trained to discover that bit and apply it corrcctly. This problem, as described in Table 4, can be solvcd by the set-up of Figure 4.
TABLE 3 The addition problem.
Input pattern
00
00 00 00 01 Some of the dcscribcd problcms can be solved algo- 01 rithmically, e.g., the adding problcm. OLhcrs can be 01 01 solvcd by algorithmic pattern recognition (thc nega10 tion problcm) or cxpert systems (the XOR problcm). 10 Ncural networks arc not the best solution for all 10 problcms, but can be compared with other ways of 10 problcm solving for certain applications. Some 11 11 applications in chcmistry that could not be solved 11 by classical tcchniqucs have been subjected to ncu- 11
+ +
+
+ + +
+
+
+ + +
+
+ + +
+
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
Output pattern 000 00 1 010 011 00 1 010 011 100 010 011 100 101 011 100 101 110
ral network tcchniqucs. Elrod, Maggiora and Trenary [ 5 ] described an application in the prcdiction of chemical reaction product distribution. They fcd thcir nctwork with TABLE4 facts about 42 substituted benzcncs and trained the Thc negation problem. network on the distribution of meta and combined ortho and para nilration products. After the training Input patterns Output patterns they used the frozcn systcm to predict the behavior 000 of 13 other compounds. The systcm gave 10 good 0000 0001 00 1 answers, as good as an expert chemist. One of the 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Figure 4. Network able to solve the negation problem.
010 011 100 101 110 111 111 110 101 100 011 010 00 1 000
157
TABLE 5a Problems. acid base Fe 2+ Fe 3+ c u 2+ Na * + Mg 2+ Ca 2+ strong weak high concentration ( > E - 2%) medium concentration ( > E -4, <= E-2%) low concentration ( < = E - 4%)
TABLE 5b Solutions. acid titration base titration phenolphtalein indicator methyl orange indicator gravimetry electrogravimetry redox titration complexometric titration rcdox indicator metallochromic indicator visual endpoint detection sell indicating endpoint photometric endpoint detection potentiometric endpoint detection amperometric endpoint detection spectrophotometry potentiometry coulometry voltammetry spectrography flame AES
ICP-AAS AAS high precision buret pre-oxidation pre-reduction
main problems that also has been identified in other uses of artificial intelligence in chemistry, is the representation of chemical information, especially molecular structures. Mostly, connection tables are used, but a system capable of representing formulas as used by organic chemists is urgently needed. Stubbs [6] described the application of a neural network to prediction of adverse drug reactions. Applications that touch analytical chemistry are the prediction of protein structures from data on amino acids (Qian and Sejnowski [7] and Holley and Karplus [8]). Liebman [9] used a library of protein substructures to predict protein structure. He succeeded in 65 to 70%. He now tries to assign Fourier transform infrared spectral bands to specific substructures and then to predict structure and properties of unknown proteins from the spectra alone. Bos et al. [lo] applied a neural network to the processing of ion selective electrode array signals. We tried the application of a neural network in a typical symbolic environment, the choice of a method of analysis using purely heuristic reasons. A trial set of 8 compounds in 3 concentration ranges was used as the problem set and 26 methods of analysis and special types of those analyses was used as the solution set (Table 5). The 23 problem/solution couples were offered in a random way, 60 training rounds were used. In Figure 5 it is shown that even after such a short training period many solutions are already trained sufficiently, Figure 6 shows the training speed of one problem for this application. Of course, a long way has to be gone yet before all analytical mcthods can be chosen with a neural network, if there is any need at all. In our laboratory Smits [ll] developed a system for the analytical characterization of algae. Algae are introduced in a flow cytometer and six parameters (Table 6) are measured. Training of a neural network with these parameters and the classes of
158
Figure 5 . Results of pattern recognition network.
uscd algae made the on-line idcntification of 8 algae possible contained in mixtures with a score of about 97% in the training sct and 95.5% in the test set. This result can be compared favourably with, e.g., k-Nearest Ncighbour pattcrn recognition technique. Application of ncural network technology, for instance a hardware configured network, howcvcr, allows the fast and cheap application of the technique to other flow-cytomctcr problems.
159
Learning c o m p l e t e set, Calcium Darckot r
100,
80
-
4- rrdox
tltr
-13-
redox lndlc
+I-
amp. e p. daln
4
coulomrtry
voltammetry
60
v
4 ICP
-
-+-
AAS
L
0
-A-
v, 40
'preox.
-
I /
kJ
201iJ ff
"_--
Figure 6. Training speed of pattern recognition network.
Anothcr topic presently being studied in our laboratory is the application of neural nctworks to the structure elucidation of IR spectra. Here again the problem of structure rcprcscntation hampers the on-line application. It will be clear that neural nctworks do not yet give solutions to all analytical problcms and will not give answers to some or many of these problems. Some rcasons can be traccd already. The method is by nature not exact, so the solution of problems in the arithmetic ficld can bcttcr be Icrt to algorithmic programs. However many problcms in analytical chcmistry arc not exact: method choicc, mcthod optimization, structure elucida- TABLE 6 tion, spcctrum rccovcry, classification of rcsults etc. etc. F e a t u m Hybrid systems can be of much use but are not yet Excitation 488 nm: developed for practical use. What is considered a severe 1. ~l~~~~~~~~~~515 - 600 problcm is thc "black box" nature of neural networks. 2. Fluorescence 650 -750 nm Not onlv is there no thcorv that exnlains the behavior of ncural nctworks, but from a psychological point of view Excitation 633 nm: 3. Fluorescence 650 - 750 nm the lack of an cxplanation systcm is much worse. Expcricnce from thc ficld of cxpcrt systcms taught us Lightscatter: that cxplanation facilities are a prcrcquisite to get the 4.Perpendicular cxpcrt syslcm acccptcd on thc work bench. The uscr 5* Forward 6. Time-of-flight wants to know how the syslcm came to the advice, even
160
if the advice is bctter than can be expected from the user. A neural network never will be
able to give an explanation of how it came to its advice by nature of its construction. The advance of parallcl computers, hardware or software implcmented, makes plausible that research in neural networks will grow and that applications in analytical chemistry will be found.
References McCulloch WS, Pitts WH. Bull Math Biophys 1943; 5: 115-133. Kosenhlatt F. Principles of neurodynamics. Washington DC: Spartan Books, 1962. Minsky M.Papcrt S. Perceptrom: an iMroduction to computational geometry. Cambridge: MIT Press, 1969. 4. Kumelhart DE, McClelland JL. PDP Research Group. Parallel distributed processing. Chapter 8, Cambridge: MIT Press, 1986. 5. Elrod DW, Maggiora GM, Trcnary RG. C&EN, April 24, 1989.25. 6. Stubbs DF. C&EN April 24,1989,26. 7. Qian N, Sejnowski TJ. J Mol Bioll988; 202: 865. 8. Holley LH, Karplus M. Proc Not1 AcadSci 1989; 86(1): 152. 9. Liebman MN. C&EN, April 24, 1989,27. 10. Ros M, Bos A, Linden WE, v.d. Anal Chim Acta, in press. 11. Smits JRM, Breedveld LW, Derksen MWJ, Kateman G. Applications of neural networks in chcmometrics. Proc Int Neural Netw Conf, Paris, 1990. 1. 2. 3.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
161
CHAPTER 14
Electrodeposited Copper Fractals: Fractals in Chemistry D.B. Hibbert Department of Analytical Chemislry, University of New South Wales, P.O. Box 1 , Kensington, New South Wales, 2033, Australia
1. Introduction A fractal (a term coined by B.B. Mandclbrot Ill) is a geometrical figure having a dimcnsion greater than that of its topological dimension. Thus a wiggly protein molecule fills more space than would be calculated by assuming a one-dimensional piece of string; the roughness of a catalytic surface may be interpreted as a fractal of dimension greater than the value of two usually reserved for flat sheets. Indeed all of nature becomes populated with these fractal objects, and nothing more so than chemistry. This paper will contain a brief introduction to fractals and will review the impact thcy have had on chemistry. I shall also describe in more detail our work on electrodeposited copper trces and thcir rclations to fractals.
2. Fractal geometry A fractal is characteriscd by a dimension. This number is a property of the system, anti also a property of the way it is measured. The most quoted dimension, known as the fractal dimension or Hausdorff-Bcscovitch dimension (D), is related to the way the objccl fills its embedding space. The spcctral or fracton dimension D, is a quantity ihat depends on the connectivity of a system and thus on the branching found in the fractal. A consequence of the notion of fractal dimension is that of self similarity. A fractal is self similar if its shape is the same whatever the overall size. As the scale of the object contracts or expands its shape maintains the same changes. A self affine fractal docs this by different amounts in different directions. A popular example of a natural fractal is the answer to the question “how long is Lhc coastline of Australia?”. The length of the coastline is measured by taking a ruler round a map of Auslralia and multiplying the ruler length by the scale of the map. However as largcr and larger scale maps are used longer and longer coastlines are measurcd. This occurs because at higher scales more fcaturcs can be measured. For example a river reaching the sea will be seen on a scale of 100 km/cm, but not on 1,000 km/cm. Thc
162
astounding thing is that if the logarithm of the length is plotted against the logarithm of the scale of the map, a straight line is obtained over many orders of magnitude of scale. As with all real examples the range over which the object is fractal is finite with cutoffs at large scales (comparable with the size of Australia) and small scales (a few metres).
2.1 The measurement of D The Box counting method is widely used in determining the fractal dimension. In it a measure is made of the number of squares nccdcd to cover an area as a function of the size of a single square. The objcct is covered with N(&) squares of side E and it is found that N(E) scales as d .If D is the same as the topological dimension the system is known as Euclidean and is not considered to be fractal. There is also an upper bound on D , the embedding dimension, which for a real object is the three dimensional Euclidean space we rind ourselves in. In a refinement the squares may bc replaced by circles when we find that the number scales as r D where r is the radius of the circle. In the mass-radius method a series of circles of increasing radius is constructed and the amount of the object falling within a circle determined. It is round that the amount M ( R ) within radius R scales as RD. This latter method is used for determining the fractal dimension of copper elcclrodeposits later in this paper. The spectral dimension D , can be defined in terms of a random walk on the object when lhe probability of returning to the origin in time t decreases as t-DSl2.
3. Fractals in chemistry 11 is not possible to cover all aspccts of chemistry that have been studied in tcrms of their
fractal properties in the space of this review, and the examples given should bc seen as a non-rcprcscntative sample. The references should lead to further reading in areas of interest [2].
3.1 Fractal materials -polymers,
aggregates and colloids
Propcrlies of polymers have long been shown to exhibit power-law dependences which in recent times have been reintcrprctcd in tcrms of fractals. The process of the growth of a polymer may be seen as a self-avoiding random walk in the medium of the monomer. The number of units within a radius R scales according to the fractal dimension D as XD. Flory [3]proposed a simple free cnergy model that predicted D to bc 5/3, with reasonablc agrecmcnt to experiment. From this approach other propertics may be dctcrmincd. For example Lhe osmotic prcssurc of a solution of a linear polymer scales as the length of the polymer to thc power 4/5 [41.
163
A particular example of the influence of fractal dimension on properties of polymers and gels is in chromatography where pore size distribution and related properties may be probed by chromatographic techniques. Proteins have also been shown to have fractal regimes both of the protein itself [5] and its surface [61. A diffusion limited aggregate (DLA) has a fractal dimension of 1.7 and is one of the most ubiquitous objects in nature. The field in which diffusion limited aggregation occurs is Laplacian, and this is common to a variety of situations that lead to tree-like forms. Computer simulations [7] have proved very effective in probing this phenomenon. Typically a random walker of a finite size is started some distance from a central point. When it reaches the point it sticks and a second walker starts. The structure that grows is a tree. The arms of the tree harvest walkers and screen the interior leaving holes in the object giving rise to the familiar shape. Experiment and simulation coincide for a numbcr of realisations of DLA such as fluid displacement in thin cells [8] and in porous media [9], dielectric breakdown [ 101, dissolution of porous materials [ll] and electrodeposition 112, 131. Typical fractal relations that are found are that the radius of gyration scales as the cluster size to the power 1/D, and the correlation function scales as D - d where d is the embedding dimension. Interesting effects arise when the sticking rules for a walker are changcd. A noise reduction parameter (m)is introduced, that is walkers must strike a given site a certain number of times before one is allowed to stick [14]. This creates socalled lattice anisotropy, with growth in one direction being favoured. By m = 3 an obvious direction to the growths is observed. It is proposed that the noise reduction parameter has an effect similar to that of increasing the size of the cluster [15]. Ultimately a snowflake-like pattern can be produced [16].
3.2 Fractals of surfaces Avnir and Pfcifer [I71 showed the measurement of the surface of a solid by molecules of increasing size is a process akin to the box counting method of measuring fractal dimension. The number of molecules (N) of cross sectional area S cover an area according to N = k S-Dnwhere D is the surface fractal dimension. In terms of the apparent surface area A = LA N S(2-O)E where LA is Avogadro's constant. Adsorbing series of hydrocarbons on a variety of surfaces have found values of D ranging from near 2 (graphitised carbon blacks) to almost 3 (silica-60). The fractal concept has been extended to chemical reactions on surfaces. For dispersed metal particles it is found that the activity per particle scales as the radius of the particle ( R ) as RD where D is a reaction dimension [ 181.D < 2 is interpreted in terms of reaction at fractally distributed active sites.
164
3.3 Reaction kinetics Kopclinan [19] has invcstigatcd the kinetics of reactions on a fractal surface. Simulations of the reaction A + B and A + A on a Serpinski gasket (D, = 1.365) show anomalous reaction orders at the steady state. For A + A the rate is k [AIh (where h = 1 + 2/D,) which has been confirmed by Monte Car10 simulation [20]. The ratc coefficient beconics time dependcnt with k = k@. As a realisation of this scheme the photodimerisation of naphthalene has been studied in porous media [21]. The reaction A + B shows scgregation on a fractal lattice [22].
3.4 Fractal electrodes Le Mchaute [23] has addressed Ihe problem of fractal electrodes and their electrochcmisvy by thc TEISI (Transfert d’Energie sur Interface a Similitude Interne) model. Thc system is probcd by ac impedance methods. Battery electrodes are oftcn fractal and this is seen in the availablc power and in self-affine discharge curves [241.
4. Electrodeposition Electrodeposition has been of much current interest as an experimental example of growth governed by a field that may be controlled [12, 13,25311. At low voltages difrusion governs growth and in two dimensions it is possible to realise the DLS limit of D = 1.7 [12, 283 in agreement with computer simulations [7]. At higher voltages migration in thc elccvic field becomes important and the growth changes from an open DLAlike structure to a dense radial form having more arms but less branching [26,27, 131. The first report of a fractal deposit [12] inferred D from measurements of current and voltage. The observed deposit in three dimensions was a spongey growth with no obvious scaling structure. This work was soon followed by planar deposits in thin Hele-Shaw cells [25] where the tree-like pattern of dendrites could easily be observcd. One drawback of this mcthod lies in the extreme fragility of the deposits. Their form may be observed during growth, but the rncchanical stability is poor. Electron micrographs of a deposit arc not possible, nor would it be possible to replace the electrolyte and perform chemical expcrimcnts on the fractal. We have overcome this problem by supporting the growth in rilter paper [13,30, 311.
4.1 Experimental method of paper support In a typical expcrimcnt laboratory filter papcr is soaked in an acidified solution of a salt of the metal lo be dcpositcd (e.g., 0.75 M CuSO4, 1 M H2SO4,) wilh a reservoir to keep the papcr moist. A cathodc wire is placed centrally to an anode ring and a voltage,
165
Figure 1. A growth on filter paper from 0.75 M copper sulphate, 1 M sulphuric acid and applied potential 0.7 V. The fractal dimension of the deposit is 1.8.
typically 0.5 V to 25 V, applied. By positioning copper reference electrodes in contact with the paper it is possible to monitor potential and resistance of the growing deposit and thus dctcrmine factors due to migration and electron transfer. If a filter paper with growth is removed from the apparatus and immersed in a solution of sodium hydroxide, coppcr ions in the paper are immobilised as blue copper hydroxide. We have used this technique to develop the diffusion layer around the growth [30].
4.2 Morphology of electrodeposits The basic tree-like shape of a deposit is shown in Figure 1. At the lowest voltages and concentrations of metal ions the deposit is open and has a fractal dimension of 1.7, consistent with the DLA form. As the voltage is increased the deposit becomes more compact with fcwer arms and D increases to 2 [13]. At very high voltages a stringy deposit grows. Microscopically (one to one hundred microns) three forms are generally seen. At low voltages a mossy, granular deposit is formed (Fig. 2) with occasional regular dendrites (Fig. 3). As the voltage is raised microscopic trees become more bushy in an irregular microcrystalline habit (Fig. 4). At the highest voltages when hydrogen evolution is vigorous the deposit becomes porous due to the disruption of the growth by gas evolution. It is interesting that the growth form that gives the most open structure at the cmscale is the most compact microscopically.
166
Figure 2. Electron micrograph of a mossy deposit grown at 0.5 V.
Figure 3. Electron micrograph of a dendrite grown at 0.5 V.
4.3 Current, voltage, radius,time relations An ohmic modcl of the growth may be uscd to express the evolution in time of the current and diamctcr. The diamcter of thc envelope of a growth varics with time as d = uib where b = 1/D. Expcrimentally for low applied voltages b was found to be ncar 0.6 which is consistent with a DLA fractal. The implication that the current is constant [251 docs not hold and it is ncccssary to take account of the diffcrcnce betwcen the resistance of the growth and that of the clcctrolytc-soaked paper ignoring thc ovcrpotcntial at the cathode. It is possible to obtain thc cxprcssion for the current at growth radius L, I(L):
167
Figure 4. Electron micrograph of an irregular microcrystalline deposit grown at 5.0 V.
~ ( L O/ I(L) ) = A (1 - rg / rp) In(L,/ L) + B
(1)
A and B are constants, rg and rp are the resistivities of growth and paper respectively. LO is an initial radius of the cathode and L, is the radius of the anode. Equation 1 may be tested and Figurc 5 shows a plot of l/I(L) against ln(L). The ratio rg/rp dctermincd from Figure 5 is consistent with measured resistances of paper and growth. Transport
-0.7
-0.5
-0.3
-0.1
0.1
0.3
0.5
LOG(Ucm)
Figure 5. Graph of l/I(L) against ln(L)for a growth at 15 V showing the fit to equation (1).
168 21 20 1s
"
60
1W
140
180
220
260
300
340
Time Is
Figure 6. Graph of radius ( L ) of a growth against time (points). Applied voltage = 25 V in a Hclc-Shaw cell. Solid line is a fit to the model given in the text.
determines the change in concentration of species in the vicinity of each electrode and thus the velocity of the growing interface. We propose that the rate of advance of the deposit equals the rate of transport of anion from which the interfacial radial vclocity (v(L))at radius L is given by
whcrc t- is the transport number of the anion, C is the concentration of mctal ion of chargc z+, 6 is the cell thickness, and F is Faraday's constant. It may also be establishcd that v(L) = u- E(L) where u- is the mobility of the anion and E(L) the elcctric field at L. Figure 6 is a plot of L against time for a copper sulphate solution compared to, a thcorctical linc from quations 1 and 2. The ratio rg/rp has been used as the only variable in thc model, and t- was calculated from mobility values. A better fit is obtained if t- is corrccted for the ionic strength of thc solution by the Debye-Huckcl-Onsager equation. The above discussion rests on the assumption that the overpotential contribution to the overall voltage is constant. We have measured the polential of the growth against a copper referencc clcctrode (Fig. 7). As the growth approaches the reference elcctrodc thc potential falls as the lower resistance growth replaces the higher resistance electrolyte. As it contacts the reference clcctrodc there is a drop that is ascribcd to the ovcrpotential. Ovcr a wide range of applied voltages this overpotential is 0.3 to 0.6 V. The ovcrpotcntial cntcrs the current equation in an exponential and thus only small increases in ovcrpotcntial arc sufficient to cope with the obscrvcd increase in current.
169 1 0.9
0.8 0.7 0.6
0.5 0.4
0.3 0.2
0
4
8
12 (Thouaanda) Time /a
16
20
24
Figure 7. Voltage (versus copper reference electrode) against time for a deposit grown at 1 V in filter paper support. The sharp fall occurs when the growth contacts the reference electrode.
4.4 Resistance and diffusion layer The dense radial form, seen at higher voltages, has a nearly circular envelope which prompts theories of the stability of such a structure. Grier et al. [32] proposed that the resistance of the growth provides the stability. The inference that the resistivity of the growth increases with applied voltage was in part bourne out by our work; [3 13 for voltages up to 2 V. Beyond 2 V however the resistance of the growth fell. The resistance may be expected to rise with voltage as the resistivity of the growth increases going from the mossy and dendritic forms at low voltage to the irregular microcrystalline form at higher ioltages. However, as the voltage rises still further the proliferation of arms and other extensive factors cause a fall in resistance. It must be noted that any measurement of resistance is done in contact with the electrolyte and so there is always a solution component. Observation of the concentration gradient of copper ions around the growth shows that the layer touches the growing tips at high voltages but there is a discernible region in which there is an absence of copper at lower voltages. There are no copper ions within the growth.
4.5 Morphological transitions with changing voltage A curious metamorphosis occurs when the voltage of a growing deposit is switched down
from a high voltage (e.g., 3 V) to a low voltage. Above a threshold value for the lower voltage of about 0.8 V the growth continues by adopting the habit of the new voltage,
170
leaving the more dense core intact. Below 0.8 V the growth reorganises from the intcrior dissolving and replating until the entire deposit is of the more open form. The threshold voltage coincides with the onset of evolution of hydrogen which is proposed to be of importance in stabilising the growth form. Current still flows during this time but is now distributed both within thc growlh and at the outer interface. Some branches become cut off during this process and we observe the maximum dissolution occurs some way insidc the growth. The use of sodium hydroxide to devclop the presence of copper ions rcvcals that the interior contains copper ions. This is in contrast to normal growth at any potential when coppcr ions arc restrictcd to the paper between the growth and the anode. SEM analysis of the interior of thc growth shows mossy growths typical of low voltage deposits on and around the microcrystalline trees of the higher voltage form. Thcsc are not dirccted outwards and we concludc they result from copper ions released during dissolution of the interior. The electrolyte is at near pH 0 and hydrogen evolution would be expected for potentials of the cathode less than about -0.34 V. Hydrogen evolution may be directly obscrvcd by placing a shwt of glass beneath the filter paper. Bubbles of gas are observed trappcd bctween paper and glass. The threshold for such an observation is 0.75 V which coincides with the threshold for thc dissolution of the 3.0 V growth. The high resistance of the growth (typically a few hundred ohms) causes a vollage drop along a growing arm. There is thercfore the possibility of a local cell in which copper corrodcs from the ends of arms and is plated nearer thc cathode conncction. The presence of hydrogcn evolution from the intcrior of the growth could prevent this process.
4.6 The Hcckcr transition Recently a new phenomenon, sometimes called the Hcckcr transition, has bccn obscrvcd in Hele-Shaw type cells [33, 341. In the dense radial regime, at about half way to the anode, the growth shows a clear change to a form with a grcatcr density of branches. Thc transition is indcpcndcnt of applied voltage and concentration of salt. We have repeated this work and confirm thc invariance of the transition [35]. If the cell is placed in a bath of clcctrolyte and the anode removed into the bath, no transition is seen evcn at the half way point. An acid front movcs out from the anodc that may be associated with Cu2+ production (Cu(H20)4 -> Cu(H20)30H + H+). The transition may be induced by the addition of pH 3 acid. An cstimate of the point at which the growth mects the incoming acid front ( L I I )may be made in tcrms of the anode radius LA and the mobilities of the proton (ull) and anion (u-):
For the sulphate anion we calculate Lkl = 0.45 LA in agreement with expcriment. The
171
front may be made visible by adding methyl orange or bromophenol blue and it is confirmed that the transition occurs as the growth meets the incoming acid front.
5. Conclusion Fractals are found throughout chemistry, in materials, in the kinetics of reactions and in the growth and movement of molecules. Much of the work has been descriptive, but it is now becoming of use in predicting behaviour and in helping our understanding of chemistry.
References 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Mandelbrot RB. The fractal Geometry of Nature. San Francisco: Freeman, 1982. Avnir D, ed. The Fractal Approach to Heterogeneous Chemistry. Chichester: J. Wiley, 1989. Flory PJ. Principles of Polymer Chemistry. Ithaca: Cornell University Press, 1953. Stanley HE. Cluster shapes at the percolation threshold: an effective cluster dimensionality and its connection with critical point exponents. JPhys. A 1977; 10: L211-L220. Elber R, Karplus M. Low frequency modes of proteins. Use of the effective mcdium approximation to interpret the fractal dimension observed in electron spin relaxation measurements. Phys Rev Lett 1986; 56: 394-397. Pfeifer P, Weiz U, Wippennann H. Fractal surface dimension of proteins: lysozyme. Chem PhysLRtt 1985; 113: 535-545. Witten TA, Sander LM. Diffusion limited aggregation. Phys Rev B 1983; 27: 5686-5697. Kadanoff LP Simulating hydrodynamics: a pedestrian model. J Stat Phys 1985; 39: 267-283. Patterson L. Diffusion limited aggregation and two fluid displacements in porous media. Phys Rev Lett 1984; 52: 1621-1624. Niemeyer L, Pietronero L, Wiesmann AJ. Fractal dimension of dielectric breakdown. Phys Rev LRtt 1984; 52: 1033-1036. Daccord G. Chemical dissolution of a porous medium by a reactive fluid. Phys Rev Lett 1987; 58: 479482. Rrady RM, Ball RC. Fractal growth of copper electrodeposits. Nature (Lond.) 1984; 309: 225-229. Hibbert DB, Melrose JR. Copper electrodeposits in paper support. Phys Rev A 1988; 38: 1036-1048. Nittmann J, Stanley HE. Tip splitting without interfacial tension and dendritic growth patterns arising from molecular anisotropy. Nature (Lond.) 1986; 321: 663-668. Meakin P. Computer simulation of diffusion limited aggregation processes. Faraduy Discuss ChemSoc 1987; 83: 113-124. Sander LM. Fractal growth. Scientific American 1987; 256: 82-88. Avnir D, Farin D, Pfeifer P. Chemistry in non-integer dimensions between two and three. 111 Fractal surfaces of adsorbent. J Chem Phys 1983; 79: 3566-3571. Farin D, Avnir D. The reaction dimension in catalysis on dispersed metals. J Am Chem Soc 1988; 110: 2039-2045. Kopelman R. Fractal reaction kinetics. Science 1988; 241: 1620-1626. Kopelman R. Rate processes on fractals: theory, simulation and experiment. J Stat Phys 1986; 42: 185-200.
172 21. Klymko PW, Kopelrnan R. Fractal reaction kinetics: exciton fusion on clusters. J Phys Chem 1983; 87: 4565-4567. 22. Anacker LW. Kopelman R. Steady state chemical kinetics on fractals: segregation of reactants. Phys Rev Lett 1987; 58: 289-291. 23. Le Mehaute A, Crepy G. J Solid State Ionics 1983; 9/10: 17. 24. Fruchter L, Crepy G, Le Mehault A. Batteries, identified fractal objects. J Power Sources 1986; 18:51-62. 25. Matsushita M, Sano M, Hayakawa Y, Honjo H, Y.Swada Y. Fractal structures of zinc nictal leaves grown by electrodeposition. Phys Rev b t t 1984; 53: 286-289. 26. Crier D, Ben-Jacob E, Clarke R, Sander LM.Morphology and microstructure in electrodeposition of zinc. Phys Rev Lett 1986; 56: 1264-1267. 27. Swada Y,Dougherty A. Golub JP. Dendritic and fractal patterns in electrolytic metal deposits. Phys Rev Lett 1986; 56: 1260-1263. 28. Kaufmann JH, Nazzal AI, Melroy 0, Kapitulnik A. Onset of fractal growth: statics and dynamics of diffusion controlled polymerisation. Phys Rev B 1987; 35: 1881-1890. 29. Ball RC. In: Stanley HE, Ostrowsky N, eds. On Growth und Form. Dordrecht: MartinusNijhoff, 1986: 69. 30. Hibbert DB. Melrose JR. Electrodeposition in support: concentration gradients an ohmic model and genesis of branching fractals. Proc Roy Soc London 1989; ,4423: 149-158. 31. Melrose JR, Hibbert DB. The electrical resistance of planar copper deposits. Phys Rev A 1989; 40: 1727-1730. 32. Crier DG, Kessler DA, Sander LM. Stability of the dense radial morphology in diffusive pattern formation. Phys Rev Lett 1987; 59: 2315-2319. 33. Sander LM. In: Guttinger W, Danglernayr G, eds. The Physics of Structure Formation. Berlin: Springer-Verlag, 1987. 34. Garik I?, Barkey D, Ben-Jacob E, Bochner E, Broxholm N, Miller B, Om B, Zamir R. Laplace and diffusion field growth in electrochemical deposition. Phys Rev Letf 1989; 62: 2703-2705. 35. Melrose JR, Hibbcrt DB, Ball RC. Interfacial velocity in electrochemical deposition and the Hecker transition. Phys Rev Lett, in press, 1990.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
173
CHAPTER 15
The Use of Fractal Dimension to Characterize Individual Airborne Particles P. K. Hopkel, G. S. Casuccio2, W. J. Mershon2, and R. J. Lee2 1 Department of Chemistry, Clarkson University, Potsdam, Ny 13699-5810, USA and 2R.J. Lee Group, 350 Hochberg Road, Monroeville, PA 15146, USA
Abstract The computer-controlled scanning electron microscope has the ability to characterize individual particles through the fluoresced x-ray spectrum for chemical composition data and image analysis to provide size and shape information. This capability has recently been enhanced by the addition of the capacity to capture individual particle images for subsequent digital processing. Visual texture in an image is often an important clue to the experienced microscopist as to the nature or origin of the particle being studied. The problem is then to provide a quantitative measure of the observable texture. The use of fractal dimension has been investigated to provide a single number that is directly related to observed texture. The fractal dimension of the object can easily be calculated by determining cumulative image properties of the particle such as perimeter or area as a function of magnification. Alternative methods that may be more computationally simpler have also been explored. The fractal dimension for a variety of particles of varying surface characteristics have been determined using these different computational methods. The results of these determinations and the implications for the use of fractal dimension in airborne particle characterization will be presented.
1. Introduction Computer-Controlled Scanning Electron Microscopy (CCSEM) with its associated x-ray fluorescence analysis system is capable of sophisticated characterization of a statistically significant number of individual particles from a collected particulate matter sample. This characterization includes elemental analysis from carbon to uranium and along with scanning and image processing can analyze an individual particle in less than 2 seconds although longer analysis times are often needed for more complete particle analysis. This
174
automated capability greatly enhances the information that can be obtained on the physical and chemical characteristics of ambient or source-emitted particles. RecenUy improved data analysis methods have been developed to make use of the elemental composition data obtained by CCSEM [l, 21. Howevcr, we are currently not taking full advantage of the information available from the instrument. In addition to the x-ray fluorescence specUum and the size and shape data available from the on-line imagc analysis systcm, it is now possible to store the particle image directly. Major improvements in automated particle imaging by the RJ Lee Group now allows the automatic capturc of single particle imagcs. Thus, a 256 gray level, 256 pixel by 256 pixel images of a large number of particles can be easily obtained for more detailed off-line analyses of shape and particle texture. These analyses may provide much clearer indications of thc particle’s origin and what has happened to it in the atmosphere. In this report, one approach, fractal analysis, will be applied to a series of images of particles with known chemical compositions in order to explore the utility of the fractal dimension for characterizing particle texture.
2. Fractal analysis The texture of a surface is produced by its material componenls and the process by which it is formed. A technique callcd fractal analysis pcrmits detailed quantification of surface tcxlurc comparable to the detail perceived by the human eye. Although the fractal technique is mathcmatically complex, it is simpler than a Fourier analysis and it is wcll-suited for CCSEM. The fractal dimension of the surface is characteristic of the fundamental nature of the surface being measured, and thus of the physics and chemistry of that formation process. It is the objective of this feasibility study to determine whcthcr or not this concept is applicable to the characterization of individual particlcs and to heir classification into groups that can further be examined. Fractal analysis is based on the fact that as a surface is examined on a finer and fincr scale, more fcaturcs bccome apparent. In the same way, if an aggregated propcrty of a surface such as its surface area is measured, a given value will be obtained at a given lcvcl of detail. If the magnification is increased, additional surface elements can be vicwcd, adding to the total surface area measured. Mandelbrot [3] found, however, that the property can be related to the mcasurement element as follows
where P(E) is thc value of a measured property such as length, arca, volume, etc., E is thc fundamental mcasurcment element dimension, k is a proportionality constant, and D is the fractal dimension. This fractal dimension is then characteristic of the fundamcnlal
175
nature of the surface being measured, and thus of the physics and chemistry of that formation process. A classic example of understanding the fractal dimension is that of determining the length of a boundary using a map. If one starts with a large scale map and measures, for instance, the length of England’s coastline, a number is obtained. If a map of finer scale is measured, a larger number will be determined. Plotting the log of the length against the log of the scale yields a straight line. Mandelbrot has demonstrated how one can use the idea of fractal dimensionality in generating functions that can produce extremely realistic looking “landscapes” using high resolution computer graphics. The reason these pictures look real is due to their inclusion of this fractal dimensionality, which provides a realistic texture. Thus, it can be reasonably expected that from appropriate image data incorporating the visual texture, a fractal number can be derived that characterizes that particular textural pattern [4]. The fractal dimension of the surface can be determined in a variety of ways. For example, the length of the perimeter can be determined for each particle by summing the number of pixels in the edge of the particle at each magnification. The fractal dimension can then be obtained as the slope of a linear least-squares fit of the log(perimeter) to log (magnification).Alternatively, the fractal dimension of each image could be calculated by dctcrmining the surface area in pixels at each magnification and examining the log(area) against the log(magnificati0n). In addition, the fractal dimension can be determined from the distributions of the intensities of pixels a given distance away from each given pixel as described by Pentland [41. Another approach for the surface fractal dimension determination has been suggested by Clarke [ 5 ] . Clarke’s method is to measure the surface area of a rectangular portion of an object by approximating the surface as a series of rectangular pyramids of increasing base size. First, the pyramids have a base of one pixel by one pixel with the height being the measured electron intensity. The sum of the triangular sides of the pyramid is the surface area for that pixel. Thcn a two pixel by two pixel pyramid is used followed by a four by four pixel pyramid until the largest 2” size square is inscribed in the image. The slope of the plot of the log of the surface area versus the log of the area of the square is 2 - D where the slope will be negative. Clark‘s program was written in the C computer language. It has been translated into FORTRAN and tested successfully on the data sets provided in Clarke’s paper. This approach appears to be simple to use and computationally efficient and will be the primary method used for this study.
3. Results and discussion To test the ability of fractal dimension to distinguish among several different particle compositions, sccondary elcctron images were obtained for several particles each of sodium chloride, sodium sulfate, and ammonium sulfate. In addition, images of several
176
particles whose composition was not provided to the data analyst were also taken. The particles analyzed are listed in Table 1. To illustrate the Clarke method, Figure 1 shows a 16 level contour plot of the secondary electron intensities from an image of a ammonium bisulfate particle observed at a magnification of 180. The rectangle inscribed in the particle is the arca over which the fractal dimension was determined. The log(surface area) is plotted against the log(area of the unit pyramid) for the four particle types in Figures 2-5. The results of the fractal dimension analysis are also given in Table 1. The “unknown” compound was (NH4)HSOd. It can be seen that there is good agreement of the fractal dimensions determined for a TABLE 1 Particles analyzed for fractal dimension.
No.
1 1 1
2 2 3 3
4 5 5 6 6
7 7 8 8
9 9 10 10 11 11 12 12
Particle Type Na2S04 Na2S04 Na2S04 Na2S04 Na2S04 Na2S04 Na2S04 (NH4hSO4 (NH4hS04 (NH4)2so4 (NH4hS04 (NH4hS04 NaCl NaCl NaCl NaCl NaCl NaCl Unknown 1 Unknown 1 Unknown 2 Unknown 2 Unknown 3 Unknown 3
Magnification
180 1,000 5,000 250 1,000 100 1,000 150 330 3,500 270 1,000 330 1,000 330 2.500 350 3,000 275 8.500 1,600 7,000 2.000 7,000
Fract a1 Dimension
Uncertainty
2.41 2.44 2.33 2.42 2.42 2.36 2.43 2.28 2.23 2.19 2.26 2.20 2.27 2.27 2.29 2.13 2.27 2.15 2.36 2.25 2.38 2.30 2.37 2.29
0.01 0.01 0.02 0.01 0.01 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.02 0.01 0.02 0.02 0.04 0.03 0.03 0.03
f
*
f f f f f f f f f f f f f f f f f f f f f f
a) Excludes Particle 1 at 5,000~and Particle 3 at 1OOx. b) Excludes Particle 5 at 3,500~and Particle 6 at 1,000~. c) Excludes Particle 8 at 2.500~and Particle 9 at 3,000~. d) Excludes Particle 10 at 8 , 5 0 0 ~and Particles 11 and 12 at 7.000~.
Average Value
2.412a
2.233b
2.275c
2.370d
177
Figure 1. Image of an NH4HS04 particle (left) showing inscribed rectangle within which the surface area is determined (right).
series of similar composition particles with some notable exceptions. Most of the high magnification views show dimensions significantly lower than that obtained at lower magnification.It is not yet clear why this result has been obtained, but there appears to be a potential systematic variation related to image magnification. However, for the set of images taken at a magnification such that they fill approximately 60% of the screen area, there is a consistent pattern of fractal dimension for each chemical compound. There is a sufficiently small spread in fractal dimension values that might be useful in conjunction
1-33Ox 1-1ooox v 2-33Ox
1-275x 1-850Ox v 2-1600x v 2-7OOOx
0
0
h
v
2-2500r
rd
0
3-35Ox 3-3000r
k
6 -
o
,. 0
, I
Log ( U n i t Pyramid)
Figure 2. Plot of log(surface area) against log(unit pyramid) for the three N a ~ S 0 4particles at each magnification.
3-2OOOx
I
I
I
I
I
2
3
4
5
Log (Unit Pyramid)
Figure 3. Plot of log(surface area) against log(unit pyramid) for the three (NH&S04 particles at each magnification.
178 I
I
I
v
1
7 ,
I
0
0
1-15Ox 2-33Ox
--
v
2-35OOx
1-180W
1~1OOOX 1-5OOOx 2-250x 2-1000W
3-IOOx 3p1000x
to
I
0
1
2
3
4
‘
5
Log ( U n i t P y r a m i d )
Figure 4. Plot of log(surface area) against log(tnit pyramid) for the three NaCl particlcs at each magnification.
I
30
1
2
3
4
5
Log ( U n i t Pyramid)
Figure 5. Plot of log(surface area) againqt log(unit pyramid) for the three (NH4);?S04 particles at each magnification.
with the fluoresced x-ray intensities in classifying the particles into types for use in a particle class balance [2] where additional resolution is necded beyond that provided by the clcments alone. These results certainly provide a stimulus for further study. It is clcar there is a rclationsliip between the measured fractal dimension and the texture observed in the imagcs. Further work is needed to dcterinine a method for routinely obtaining reliable fraclal dimcnsions from secondary electron images.
Acknowledgements The work at Clarkson University was supported in part by the National Science Foundation under Grant AThl89-96203.
References 1.
2. 3. 4. 5.
Kim DS, Hopke PK. The classification of individual particles based on computer-controllcd scanning electron microscopy data. Aerosol Sci Techno1 1988; 9: 133-151. Kim DS, Hopke PK. Source apportionment of the El Paso acrosol by particlc class balance analysis. Aerosol Sci ‘IkchnoZl988;9: 221-235. Mandelbrot R. The Fractal Geometry of Nature. San Francisco, CA: W.H. Freeman and Co., 1982. Pcntland AP. Fractal-based description of nuturd scenes. SRI Technical Note No. 280, SRI Intcmational, Mcnlo Park, CA, 1984. Clarke KC. Computation of thc fractal dinicnsion of topographic surfaccs using the triangular prism surface area method. Computers & Ceosci 1986; 12: 713-722.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
179
CHAPTER 16
Use of a Rule-Bidding Expert System for Classifying Particles Based on SEM Analysis P. K. Hopkel and Y. Mi2 Department of Chetnistry, Clarkson University, Potsdam, NY 13699-5810 , USA and 2Environmental Engineering Program, Department of Civil Engineering, University of Southern CaliJornia, Los Angeles, CA 90089-0231, USA
Computcr-controllcd scanning clccuon microscopy (CCSEM) permits the characterization of size, shape, and composition of individual particles and thus provides a rich source of data to identify the origin of ambient airborne particles. The problem is how to make best use of this information. The general procedure has been to assign each panicle to a class of similar particles bascd on its x-ray fluorescence spectrum. The initial efforts dcvclopcd thc class charactcristics and classification rulcs in an empirical fashion. Rcccnt studies havc suggcstcd that grcatcr spccilicity and prccision in the subscqucnt class balancc analysis can bc obtaincd if particle classes are more homogcncous. To obtain thc classification of I:ugc numbers of particles in an efficicnt manner, the x-ray intcnsity data havc bccn subjcctcd to an agglomcrativc hicrarchical cluster analysis. The resulting groups of samplcs arc thcn uscd as candidate cxarnplcs for a rule-building expcrt systcm that providcs a decision Lrcc that can bc uscd for subscqucnt particle classification. An crror analysis approach bascd on jackknifing has also bccn dcvclopcd. Thc use of this approach to charactcrizc particlc sourcc emissions from a large coal-fircd elcctric gcncrating shtions was cmploycd. It was found that the samplcs collcctcd in thc stack using a dilution samplcr could bc distinguishcd from the particlcs collcctcd in Lhc plunic cmitlctl by thc plant. It was also found that the plume could be readily distinguished from the particlcs in thc ambicnt air upwind of the plant.
1. Introduction In thc past two dccatlcs, cfforts at rctlucing thc total suspcnded particulate (TSP) conccntrations in thc atrnosphcrc havc rcsultcd in significant rcduction in thc annual avcragc TSP Icvcls. For cxamplc, from 1970 to 19S6 in thc State of Illinois, thc annual TSP conccnuation has bccn rctluccd from 99 to 53 pg/1113 [Illinois EPA, 19863. Howcvcr, rcducLions in TSP may not bc cffcctivc in protccting public hcalth bccausc thcrc is a only wcak
180
rclationship between health effects and the TSP concentrations. Since only a portion of TSP, the smaller particles, can be inhaled into the respiratory tract, only the particles smaller than about 10 pm may be harmful to the public’s health. The new National Ambient Air Quality Standards for Particulate Matter, 10 pm (PM10) rcquires control of particle sources of smaller particles. In order to plan effective and efficient control strategies, studies have been conducted to cxamine the airborne particlc transport, chemical composition, and from these data infcr the source contributions to the measured particle concentrations. The contribution of a source at the receptor site is calculatcd with statistical methods bascd on the measurable features of the sampled particles, including particle size, shape, mass, and chemical composilion [Gordon, 19881. Sometimes, available wind trajectories are uscd to locate particle sources [Rheingrover and Gordon, 19881. Thc principles and some applications of receptor models have been described by Hopke [ 19851. Their fundamental assumption is that properties of material collected in the environment can be used to infer their origins. Receptor models can be classified according to the analysis methods. Among these analysis approaches, methods based on thc chemical analysis of particle samples are the most commonly used techniques. Thc chemical mass balance (CMB) and factor analysis are two well-developed techniques of this type. The basic idea of the CMB is to identify particle sources contributions to the measured concentrations of species in samples collected at receptor sites if the emission sourccs profiles are sufficiently different. A CMB analysis assumes that the measured concentration pattern of an ambicnt sample is a linear sum of the contributions of indcpcndcnt particle sources. Following this basic assumption, there are two further suppositions that sometimes make the CMB calculation difficult. First, a CMB analysis requires that the numbcr of sources is known and the source profiles of elcmcntal compositions are available. Second, all the sources are supposed to be linearly indcpcndcnt. However, collinearity orten exists between the profiles in the actual analysis and limits the accuracy and resolution of the CMB method. One approach to solve lack of linear indcpcndcncc is to increase the number of species included in the source profiles. In earlier studies, the CMB has becn widcly uscd in identifying particle sources [Dzubay et al., 1982; Cheng and Hopke, 19861. Rcccntly further approaches applied the CMB to recognize gaseous or volatile material sources in Olympia, Washington [Khalil and Rasmussen, 19881 and Chicago, Illinois [O’Shea and Sheff, 19883. Another method that has been applied to identifying sources and apportioning particulate mass contribution is computer-controlled scanning electron microscopy (CCSEM). The analysis of microscopic features of individual particles, such as their chemical compositions, will provide much more inlormation from each sample than can be obtaincd from bulk analysis. Thcrefore, the ability to pcrform microscopic analyscs on a numbcr of samplcs pcrmits the use of CCSEM tcchniqucs in receptor modcls.
181
This study focuses on the impact of a power plant on air quality. A necessary step is to determine that the collected particles were emitted from the plant and can be distinguished from the background aerosol particles.
2. Computer-controlled scanning electron microscopy The use of CCSEM in receptor models, is an extension of individual particle characterization by optical microscopy and scanning electron microscopy (SEM). The microscope has long been employed to determine those characteristics or features that are too small to be detected by the naked eye. The use of optical microscopy in receptor models has been described by Crutcher [ 19831. Optical microscopic investigation of particle samples and its application to source apportionment have been illustrated by Hopke [1985]. The ability of the scanning electron microscope equipped with X-ray detection capabilities (SEMPXRF system) to provide size, shape, and elemental compositional data extends the utility of microscopic examinations. Several studies have used the SEM in analysis of samples of coal-fired power plant ash [Fisher et al., 1976; Carpenter et al., 19801 and volcanic ash [Rothenberg et al., 19881. However, these studies are limited in the number of particles detected, since SEM has the disadvantage of being time consuming to examine particles manually. Casuccio et al. [ 19831 and Hopke [ 19851 have surveyed the application of CCSEM in the particle elemental investigation and its ability of identifying particle sources in the receptor model studies. A number of previous studies have shown that CCSEM is capable of detecting the characteristics of individual particles [e.g., Kim and Hopke, 1988a; Bernard el al., 19861. The significant improvement of CCSEM is the coupling of a compuler to control the SEM. Hence, three analytical tools are under computer control in the CCSEM: 1) the scanning electron microscope, 2) the energy dispersive spectrometry Xray analyzer, and 3) the digital scan generator for image processing [Casuccio et al., 19831. CCSEM rapidly examines individual particles in samples and provides their elemental constitutions as well as their aerodynamic diameter and shape factors. Based on these characteristics of each particle, particles can be assigned to a number of welldefined classes. These particle classes bccome the basis for characterizing sources so that accurate particle classification becomes a key step in using CCSEM data in receptor modeling. The approach to the particle classification can be accomplished by agglomerative, hierarchical cluster analysis along with rule-building expert systems. The particles with similar composition are grouped by the cluster analysis. It is assumed that a source emits various types of particles. However, the mass fractions of particles in the various particle classes will be different from source to source and are the fingerprint for that source. The rule-building expert system can help automate the particle class assignments. This idea has been confirmed on samples collected in El Paso, TX [Kim and Hopke, 1988bl and
1x2
particles from a coal fired power plant [Kim et al., 19891. CCSEM analysis of individual particles can apportion the mass of particles to different sources in the airshed. This study is to examine the composition differences between particle samples collected from various locations. The particles are classified by an analysis based on individual particle characteristics. CCSEM is capable of rapidly providing the individual particle’s characteristics for a large number of particles. Therefore, the combination of these two techniques, CCSEM analysis and pattern recognition, can provide a useful approach to particle classification in terms of particle type and abundance. This analysis can provide sufficient information to examine the sample-to-sample differences.
3. Data set description Among various man-made particle sources, large coal-fired electric generating stations are major stationary sources of particulate matter. Although most of the ash produced in the combustion process is generally removed by control devices such as an clcctrostatic precipitator (ESP) or a vcnturi scrubber, the fly ash will be emitted with the plume and may cause health effects to the public as well as other problems in the environment. Previously, a number of studies have bcen made of the pollution problems caused by power plants. Part of diese studies have focussed on particle movement in the air. In gcneral, the concentrations of airborne particles are related to not only the parlicle emission rate, but also to the stack height, the dispersion conditions, and the aerodynamic behavior of‘ the emitted particles. While the emitted particles from a coal-fired power plant are moving in the air, the composition can change. For example, when the stack temperature is much higher than the ambient temperature, the plume will cool as it moves in the air, and volatile materials in the plume will condense onto the particles. In previous studies of the health effects of plume fly ash arising from a coal-fired power plant’s emissions, substantial physical and chemical data have been obtained on particles sampled from inside Lhe power plants. Because of in-plume modifications to the particles, such data may not fully represent the characteristics of plume particles. Plume fly ash may be different from the stack and ESP hopper ash in terms of chemical composition particularly in the surface of the particles. Research suggests that plume particles may carry more adsorbed compounds than particles from inside of the power plant [Natusch et al., 19751. The absorption onto in-plume particles leads to a question: how dirferent are the in-plant particles and the in-plume particles in terms of their biologic activity? In order to determine the biological activity of fly ash emitted from coal-fired power plants, the Electric Power Research Institute (EPRI) initiated a research prograin to examine the nature of plume fly ash. No previous studies have dealt with the problem, and the potential health effccts of plume fly ash from a coal-fired power plant remain unknown. The overall objective of the Biologic Effects of Plume Fly Ash (BEPFA)
183
Figure 1. Map showing the area in which the Tennessee Valley Authority operates electric generating stations and the names and locations of those stations.
project was to develop the data needed to evaluate the biologic effects of plume fly ash [Battelle, 19881. One aspcct of the evaluation is correctly describing the biologic, chemical, and physical differences between samples collected from a coal-fired power plant in the stack and at locations directly in the plume. To study the difference between in-plant particles and in-plume particles, a major effort was made to develop an in-plume sampling technique that would permit the collection of a sufficient quantity of material for toxicological testing. It is also important to determine if thc in-plume particles are distinguishable from the background airborne particles because the samples of in-plume particles are always in a mixture with background aerosols. The Tennessee Valley Authority's (TVA's) Cumberland Steam Plant was chosen as the field site for the research project. The location of the Cumberland Steam Plant is shown in Figure 1. The plant is in the TVA's network of electric generating stations. In order to oblain the samples to examine the differences between in-plant particles and in-plume particles, samples were collected from the plume, simulated-plume (stack emissions diluted in a sampler), and a standard stack sampler. Plume samples were obtained with a helicopter and a specially designed high-volume particle collector [Battelle, 19881. Because the plume particles need to be distinguished from the ambient aerosols, simultancous upwind samples of ambient aerosols were also collected by a small fixed-wing aircraft. Diluted stack sampling was employed to test whether the dilution
184
TABLE 1 Information included in the CCSEM output data. Particle No. 4-Element Type No. 4-Element Type Name Magnification Average Diameter (pm) Maximum Measured Chord Length (pm) Minimum Measured Chord Length (prn) Total X-ray Counts Size Bin X Coordinate of Particle Center Y Coordinate of Particle Center Field No. Presumed Density (Specific Gravity) 26 Elementary Relative Concentrations (%): C, 0,Na, Mg, Al, Si, P, S, C1, K, Ca, Ti,V,Cr, Mn, Fe, Ni, Cu, Zn, As, Se, Br, Cd, Sb. Ba, Pb. Area (pm2) Pcrimeter (pm) Secondary Type No. Sample No.
sampler technique can provide samples that are acceptable surrogates for plume samples. CCSEM was used as the primary analytical method to characterize the particles. This technique eIIdbleS a large number of particles to be analyzed and provides detailed information on the size, shape and elemental composition associated with individual particles. The image of each particle detected during the analysis is digitized and stored by the CCSEM software. The automatic process of collecting and storing a digitized image during the CCSEM analysis is called microimaging [Casuccio et al., 19893. All samples for CCSEM analysis were collected on 0.4 pm pore size polycarbonate filters. The particle loading for CCSEM is important. Heavily loadcd samples will result in incorrect sizing of individual particles because of overlapping particles. Very lightly loadcd samples will lake a much longer time for CCSEM to analyze because so much time will bc used to find lhe few particles present. Size data for each particle detected in CCSEM analysis are expressed in two ways: 1) physical diameter, and 2) aerodynamic diameter. Chemical information of individual particles is provided in the form of X-ray intensities for 26 elements. The output data from the CCSEM analysis are summarized in Table 1. Two sets of CCSEM analysis data were provided for analysis. The two sets of data are the results of CCSEM analyses performed in January and April of 1988, respectively. The same samples were used for the both analyses, The identification information for the two sets of CCSEM analyses are listed in Table 2 and Table 3.
185
TABLE 2 Samples in the CCSEM data set measured in January,1988. Sample No.
EPRI ID No.
Collection Method
Collection Date
7635 7636 7648 7655 7693 7696 7877
5 88-N 1-A/NEA 590-N1-A/BCD 526-Nl-A/NEA 528-N2-B/BCD 548-N2-A/BCD 546-Nl-A/NEA 118-Nl-AIUpwind
9/19/87 9119/87 9/15/87 9/15/87 9116/87 9116/87 lo/ 1/87
7878
124-N1- A/Upwind
7881 7884 7886 7887 7889 7890 7895 7898 7899 7902 7908 7910 7911 7913 7915 7916 7918 7920
140-N1-A/Upwind 155-NI-A/Upwind X 165-Nl-A/Plume 171-N1-A/Upwind 182-Nl-A/Plume 188-N 1-A/Upwind 214-N1-A/Plume 230-N1-A/Plume 236-N 1-A/Upwind 252-N1-A/Upwind 676-Nl-AmEA 688-Nl-ANEA 695-N l-A/NEA 709-Nl-A/NEA 717-N1-A/NEA-X 724-Nl-AmEA 738-N 1-A/NEA 746-Nl-A/NEA
Diluted Stack Stack Diluted Stack Stack Stack Diluted Stack Upwind Background (Helicopter) Upwind Background (Air Craft) Upwind Background Upwind Background Downwind Plume Upwind Background Downwind Plume Upwind Background Downwind Plume Downwind Plume Upwind Background Upwind Background Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack
101 1/87 1012/87 1012/87 (P.M.) 101 3/87 1013/87 lo/ 4/87 1O/ 4/87 1015/87 lo/ 6/87 1O/ 6/87 101 6/87 (P.M.) 1012/87 lo/ 2/87 (P.M.) 1013/87 1O/ 4/87 1014/87 (P.M.) 1015/87 1O/ 6/87 1016/87 (P.M.)
4. Data treatment and analysis
4.1 Data screening and transformation Data screening and transformation are the initial steps in the analysis procedure. Noise reduction is the first step in analyzing the CCSEM data. Determination of elemental chemistry for individual particles is accomplished by collection of characteristic X-rays. The X-ray fluorescence peaks are obtained as a result of a photon counting process having a Poisson distribution. However, there are a number of observed “peaks” in the X-ray spectrum that arise from the statistical fluctuation in the detector background in the particular energy region characteristic of an element. Therefore, those peaks that are idcntified but do not have enough intensity to meet a minimum detection criterion will be
186
TABLE 3 Samples in the CCSEM data set measured in April, 1988. Sample No.
EPRI ID No.
Collection Method
7389 7410 7413 7522 7525 7531 7534 7635 7636 7647 7648 7649 7655 7693 7696 7877
090-N1-A/WC 093-N1-A/WC 096-N2-A/WC 090-N1-C/WC 090-N1-B/WC O9O-Nl-B/WC 090-Nl-B/WC 588-Nl-AmEA 590-Nl-A/BCD 508-Nl-AmEA 526-Nl-AmEA 502-N2-A/BCD 528-N2-B/BCD 548-N2-A/BCD 546-N1-A/NEA 118-N1-A/Upwind
7878
124-N1-Ampwind
7881 7884 7886 7887 7889 7890 7892 7895 7896 7898 7899 7902 7908 7910 7911 791 3 791 5 7916 7918 7920
140-N1-A/Upwind 155-N1-A/Upwind-X 165-N1-A/Plume 171-Nl -A/Upwind 182-Nl-A/”me 188-N1-Ampwind 198-Nl -A/Plume-X 214-N1-A/Plumc 220-N1-A/Upwind 230-N1-A/Plume 236-NI-A/Upwind 252-Nl -A/Upwind 676-N 1-A/NEA 688-N1-AmEA 695-Nl-AmEA 709-Nl-A/NEA 7 17-NI-A/NEA-X 724-N1-AMEA 738-N1-A/NEA 746-N1-AMEA
Worst Case Worst Case Worst Case Worst Case Worst Case Worst Case Worst Case Diluted Stack Stack Dilutcd Stack Diluted Stack Stack Stack Stack Diluted Stack Upwind Background (Helicopter) Upwind Background (Air Craft) Upwind Background Upwind Background Downwind Plume Upwind Background Downwind Plume Upwind Background Downwind Plume Downwind Plume Upwind Background Downwind Plume Upwind Background Upwind Background Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack Diluted Stack
Collecting Date
9/19/87 9/19/87 911 1/87 9/15/87 9/11/87 9115/87 9/16/87 9/16/87 101 1/87 lo/ 1/87
lo/ 2/87 lo/ 2/87 (P.M.) 1013/87
10/ 3/87 10/ 4/87 lo/ 4/87 lo/ 4/87 (P.M.) lo/ 5/87 10/ 5/87 101 6/87 1016/87 lo/ 6/87 (P.M.) 1012/87 10/ 2/87 (1l.M.) 1013/87 10/ 4/87 lo/ 4/87 (P.M.) 1015/87 101 6/87 lo/ 6/87 (I?M.)
considered as noise and thcy are eliminated. In this analysis, the X-ray count was reassigned to zero if thc count for an clcmcnt was lcss than a criterion valuc, N,. This valuc, N,, is given by
187
in which NT is the total X-ray counts in the spectrum for a given particle. After pretreatment of the raw data for noise reduction, the cluster patterns provided much more reasonable particle classifications [Kim et al., 19871. Examination of the distribution of X-ray intensities has shown that there is a strong positive skewness. The highest values in the distribution again cause problems with classification. Thus, a logarithmic transformation is made for the X-ray intensities of the particles. However, CCSEM elemental data include many variables with a value of zero. In some cases, the intensities were measured to be zero. For others, they are assigned zero values in the noise rcduction step. A simple logarithmic transformation, log(x), cannot be directly used for these variables. An alternative approach is to perform a logarithmic transformation after adding 1 count to all values. xjj = loglo( 1 + X i j )
i = l , 2 ,..., 2 6 ; j = l , 2 ,...,
(2)
Where x;i is the ith variable of the jth particle, and Xi, is the transformed value of x;i. The log(1 + x ) transformation avoids the problem of zero values. The addition of a single count to the original total counts, x i j makes a very small perturbation and values that were initially zero remain zero. Kim [ 19873 applied the log( 1 + x ) transforming technology to CCSEM data in his cluster analysis. The quantitative measure of sample dissimilarity used in these analyses was the Euclidean distance (ED). From the CCSEM analysis, the chemistry characteristics of each particle are described by 26 variables that are the X-ray intensities for these 26 elements. Before ED was calculated, each X-ray intensity variable was standardized so that it has a zero mean and unit variance. The result of the ED calculation is an NxN symmetric matrix with zero valued diagonal elements.
5. Particle classification Particle classification was obtained by a cluster analysis of the dissimilarities between every pair of particles. In this study, classification proceeded in two steps. The first step was to initially identify particle classes in individual samples using hierarchical cluster analysis. The second step was to develop the general classification rules by which particles from any samples could be assigned to any of the particle classes idcntified in any of the samples using a rule building expert system.
5.I Hierarchical clustering analysis In hierarchical clustering analysis, the initial particle class patterns were identified for
each sample. The hierarchical clustering begins with N single-particle clusters, merges clusters repeatedly, and finally ends with only a single N-particle cluster. The analysis was performed using the agglomerative hierarchical clustering program AGCLUS [Oliver, 19731. The input data were the Euclidean distance (ED) values for the N particles in a sample. The output of AGCLUS was represented in the form of treeshaped dendrogram. An example of a portion of a dendrogram is presented in Figure 2. The hierarchical method is flexible in determining the particle assignments to clusters by simply shifting the cluster level in the same dendrogram. From the dendrogram, the possible particle patterns at any level were Outlier Outlier readily identified by finding clusters. HowbII Outliers 211 ever, difficulties exist in selecting the signifi100 cant clusters from the dendrogram. llb 211 The initial class assignments were made 211 212 2b6 based on the dendrogram. A program was 1V6 IV7 employed to give each particle in these ini161 21b 188 tially defined particle classes an identifica111 261 tion number, and arrange the particles in the (16 I70 dendrogram in the order that each particle 211 was associated with a class number. The 213 190 Ill original particle-by-particle data were (01 zrn arranged in their order of panicle idcntifica251 170 111 tion number and class number. These output 186 la0 Class 31 data were imported into LOTUS 1-2-3. The 101 181 original particle by particle data were 11 111 119 imported in sequential columns. The nature 221 as1 of each of the identified classes in a sample 21 (10 81 could then bc ascertained. 99 -
-
231
116
162
101 3>b
1v5 220
5.2 Homogeneous particle classes
183
-
-
-
1 20
---
--aoi bb1 -
238 181 216
221 613 241
A class dcvclopcd from the cluster analysis may not represent a fully homogeneous group because the class may include some
b01
201 bll . . I .
--
parliclc class was determined to be homoge- particle classes based on the hierarchical neous only when all particles in the class agglomcrativc cluster analysis.
189
contained only the same elements. For example, an Al-Si-Fe class includes particles showing only Al, Si, and Fc X-ray intensities. In order to create a homogeneous particle class, outlicr particles were removed from each initial particle class obtained by the hierarchical clustcr analysis. The process was accomplished using LOTUS 1-2-3. After deleting outlier particles from each class, the rest of the particles in the class were potential “rcpresentative” examples dcfining the properties of that class.
5.3 General classification rules In order to assign particles from any given sample to any of the classes identified in any of the samples, the general classification rules for the assignments are needed to be devcloped. A rule-building expert system, EX-TRAN [Hassan et al., 19851, was employed to develop the classification rules into a computer program that operated on the particle-byparticle data sets. EX-TRAN is a series of programs designed to generate a set of rules in the form of a decision tree based on examples for which various variables are known and which have known classes. The main program that can produce a self-contained FORTRAN program of decision rules is called the Analog Concept Learning Translator (ACLTRAN). The program searches the elemental concentrations one at a time to idcntify the one for which it can “best” separate one class from the others. The attribute selection is based on the maximum decrease in the entropy derived from information theory between the undivided and the split classes. Hunt et al. [1966] described the information theoretical approach used to determine the entropy of each state. For a given variable, the entropy of the whole and divided states are calculated, and the separation is made based on the maximum entropy decrease obtained. The decision rule is then formulated for the attribute with the largest entropy change. The rule splits the data such that an attribute value greater than or equal to the value midway between the closest points of two data sets are assigned to one group and those less than that critical value are assigned to the other class. The program systematically divides the data set until all of the particles are separated into single class subsets. The use of this rule-building expert system approach has bccn dcscribcd in dctail by Kim and Hopke [1988a]. EX-TRAN gcncratcs a set of rules based on an example data set. The example data sct was composed of the data selected from the defined particle classes. Among the particles in a class, thcrc is a particle that has the maximum concentration for any single elemcnt among the 26 elements, and a particle that has the minimum concentration. Such particles are called boundary particles. The example data set for EX-TRAN consisted of all the boundary particles collected from every particle class. As shown in Figure 3, ACLTRAN produced a FORTRAN 77 code that was then incorporated into a classification program that can assign each particle characterized in each sample to an appropriate particle class. Each FORTRAN code was treated as a
190
CHARACTER*8 decisn REAL b ,c ,d ,e ,f , g ,h R E A L i , j ,k ,1 ,m ,n , o REAL p ,q , r , s , t , u IF(k .LT. 1.9625)THEN I F ( 1 .LT. 1.9775)THEN IF(q .LT. 1.991)THEN decisn=' cfgi' ELSE IF(m .LT. 1.991)TfIEN decisn=' cfgiq' ELSE decisn=' cfgimq' END IF END IF ELSE IF(m .LT. 1.985)THEN IF(q .LT. 1.9695)THEN decisn=' cfgil' ELSE decisn=' cfgilq' END I F ELSE decisn='cfgilmq' END IF END IF ELSE
IF(l .LT. 1.966)THEN IF(q .LT. 2.0135)THEN decisn=' cfgik' ELSE IF(m .LT. 2.018)THEN decisn='cfgikq' ELSE decisn=' cfgikmq' END IF END IF ELSE I F ( m .LT. 1.967)THEN IF(q .LT. 1.9845)THEN de ci sn=' cf gi k 1 ' ELSE decisn='cfgiklq' END IF ELSE decisn=' cfgiklmq' END IF END IF END IF END
Figure 3. An example of FORTRAN-77 code for decision rule.
191
subroutine of a main program. The complete computer program containing general classification rules was created by incorporating these subroutines. Among the particle classes in the whole sample set, a number of classes were labeled by the same elements although these classes belong to different samples. Within the concept of homogeneous class, those common classes were combined into one particle class. Thus, the total number of particle classes will decrease, and one class in the whole sample set corresponds to one specific particle group in which particles contain the same elements. Particles in the two sets of results were reassigned to classes by the general particle Classification program built from these FORTRAN code segments. Thus, class assignments were made for all particles.
6. Calculation of mass fractions and their uncertainties
6.1 Calculation of mass fraction Particle classes were labeled by their elemental compositions. The fractional mass of each class for each sample was calculated based on the aerodynamic diameters of the particles in the class provided by CCSEM analysis. The class total mass is calculated by summing the mass of every particle in the class, my If there are Nk particles in the kth particle class, the class total mass, Mk,is calculated by
C
M ~ =m. j =1
(3)
J
If there are K identified particle classes in a sample and Nk particles in the kth class, the total number of classified particles in the sample is K
N g =
c
Nk
(4)
k =1
If No represents the number of outlier particles in the sample, the total number of particles, N , is equal to
The total mass of the sample, M , is then obtained by summing the mass of every particlc class in the sample as wcll as the mass of all the outlier particles included in the sample
192
M =
EM^+ ;C moj
k =1
=I
In cquation 6, mo;denotes the mass of the jth outlier particle. The fractional mass of a particle class is the ratio of the class mass to the total sample mass. The mass fraction of the kth class,fk, is given by
(7)
fk=Mk/M
6.2 Calculation of uncertainties The uncertainties in the mass fractions were evaluated by jackknifing. Mostcllcr and Tukey [1977] described the principle of jackknifing in detail. The two steps of thc jackknife are: 1) making the desired calculation for all the data, and 2) dividing the data into groups, making the calculation for each of the slightly reduced bodies for data obtaincd by leaving out just one of the groups. Kim [1987] applied the technique to estimate the uncertaintics in the mass fraction. For the 4th particle class that includcs nq particles in the jackknife method, the reduced body average after leaving out the ith particle is defined as
-m . = - x1m q.1 nq - 1
.
j =I j#i
.
'i>l*J
j = 1, 2, ..., i - 1 , i + 1,..., nq
(8)
Then the class average is givcn by
i = 1, 2, ..., nq
(9)
The jackknire estimate of standard deviation is calculated by thc following cquation:
However, it must be noted that if a class includes only one or two particles, 100 percent of the mass fraction will be arbitrarily assigned as its uncertainty bccause jackknifing
193
cannot be applied to these two cases. The determination of uncertainty is essential in the sense of providing a measure of the reliability of the fractional mass for a class.
7. Results and discussion of the analysis 7.1 Results of the cluster analysis Both the January and the April sample sets were subjected to the previously described analysis procedures. ACLTRAN was use to generate a FORTRAN code of classification rules. The ACLTRAN program requires that a name be given to each particle class where a class name must be composed of 8 or less characters. In order to represent the class designations with less than 9 characters, class names were created instead of directly using the elemental symbols. The names were built from a code of lower case letters corresponding to the various observed elements. Some of the defined classes include particles with more than 8 elements in their X-ray spectrum. Thus, some specific multielemental codes were established. From the 26 samples in the January sample set, a total of 257 particle classes were defined. The particles in the 37 samples of the April sample set were also assigned to the 257 particle classes. In each of samples, the mass of the classified particles represents over 85 percent of the total sample mass. The particle class mass fractions were calculated for each sample. These results are given by Mi [1989]. The mass fraction of each identified particle class follows its corresponding class name code in these tables. Only the mass fractions of identified particle classes are listed in these tables. There are some particles in each sample that do not belong to any of the identified classes, but there is not a sufficient number of any one type to define a new class. Such particles are the outlier particles in the cluster analysis and represent a “miscellaneous” group of particles. The mass fractions of the miscellaneous class are not used in further analysis of the results but their mass is included in the calculation of total sample mass. Thus, the summation of the mass fractions listed in each table is less than 1.00. There are also many classes that only contribute a mass fraction less than 0.001%. The entries of 0.000% indicate that particles of that class were observed in the sample, but do not represent a significant fraction of the sample mass. The uncertainties in mass fraction calculations were also estimated.
7.2 Comparison of the two sample sets The two sets of CCSEM analytical results were obtained by two independent scans of the collected filters in January and April of 1989. Comparison of these pairs of samples provides a limited test of the reproducibility of CCSEM and data analytical procedures. There are some significant differences in the samples between the January and April
194
results. Nine carbon containing classes were identified in the January data, but only two such classes appeared in the April data. Moreover, in the January data, 10.1% of the mass wcre found to be in the carbon containing particle classes, whilc in the April data, thc mass fraction in such classes is less than 0.001%. Class cfgiklmq (0-A1-Si-S-K-Ca-Ti-Fe) represents 32.2%of the pruticlc mass in the January data set while it is only 4.7% of the April set. Even combining all of the 0-Al-Si containing particle classes (cdfgikq to cfgmq), the January data show 53.1% of the particle mass in these classes while the April data only show 39.9%. The values are subjcct to unccrtainty and the potential range of thcse sums do overlap. Howcvcr, thcse results do indicate considcrable variability in the analysis. The problem is considcrcd to be caused by the limitcd number of particles in each sample (approximately 450), and is not a sufficiently large sample of the population on the filter to bc statistically rcliablc. An altcmative approach is to combine the data from both CCSEM analyscs to crcate a combined file with twice the total numbcr of particles. These combincd data scts might providc a bcttcr sample of the population of particle types on the filtcr, but thcre is no way to confirm this hypothesis without additional CCSEM analyses. For thc following section, the more complcte April analyses will be used.
7.3 Statistical analysis In order to test the differences between in-plant particlcs and in-plume particlcs, thc vcctors of particle class mass fractions for the two kinds sample that contain the diffcrcnt particles arc subjected to a statistical analysis. The similarity of the vcctors of mass fraction is necessarily invcstigated for quantitatively evaluating these diffcrcnccs. One mcasure that can represent the similarity between vectors is the corrclation cocfficicnt. The corrclation coefficients between each pair of samplcs within the sct of samples takcn at a spccific time wcre calculated. The results of thcsc calculations arc prcscntcd in Table 4. It can be seen from this table that the corrclation cocfficicnt is quite high for the “worst case” storage samplcs. In h i s case, different subsamplcs of the same samplc arc analyzed at different times. Thc obscrved high corrclation cocfficicnt valucs suggcst that thc sampling procedure and the CCSEM methodology do adcquatcly rcflcct thc nalurc of thcsc samplcs. The two types of stack samples, the NEA dilution samplcs and the BCL modificd Mcthod 5 Vain, show considcrablc similarity in most cases. There arc rclatively low correlation cocfficienls between the stack and downwind plume samplcs. Although thcrc should be some effect of the upwind background particles on the plume samplc, it was anticipated that with the high number concentration of power-plant particlcs in thc p l u m that it should more closely reflect the stack sample than the upwind samplc. The corrclntion coelficients between the different types of samples are generally slatistically insignificant. Thus, in general, it appcars that the dilution samplcs arc not good
195
TABLE 4 Correlation coefficients of pairs of samples from the same sampling period using the April CCSEM data. ~
Sample Pair
Sample DatePair
Correlation Coefficient
9111 9/15 9/16 9/19
(Dilutioflethod 5) 7647,7649 0.516 7648,7655 0.446 7696,7693 0.823 7635,7636 0.130
1011
(Upwind (H)/Upwind (A)) 7877,7878 0.029
1012 1012 (P.M.) 1013 1014 1014 (P.M .) 1015
1016 1016 (P.M.)
Correlation Coefficient
Sample Pair
Correlation Coefficient
(Upwind/Dilution) 7881,7908 0.148 7910,7884 0.094
(Plume/Dilution)
(Plume/Upwind (A))
7911,7887 7890,7913 7915,7893
0.078 0.030 0.128
7886,7911 7889,7913 7915.7892
0.3.58 0.011 0.178
7886,7887 7889,7890 7892,7893
0.162 0.058 0.152
7916,7896 7918,7899 7920,7902
0.048 0.108 0.135
7895,7916 7898,7918
0.058 0.522
7895,7896 7898,7899
0.0.52 0.092
7389,7525 7410,7531 7413,7534
0.980 0.508 0.975
7389,7522
0.987
7525,7522
0.980
Worst Case
1 2 3
substitutes for plume samples at lcast with respect to the particle class mass fractions as mcasurcd in this study. Also in Table 4, the corrclation coefficients between the plume and thc upwind samplcs arc very low. These two kinds of sample were taken at the same time. If the particle class pattcrn of plume samples was not significantly different from that of upwind samplcs, there would bc higher correlation coefficient values than are observed. These results support the assumption that the plume particles can be distinguished from the background particles. The same conclusion may be drawn from the low correlation cocfficicnt betwccn the dilution and the upwind samples. To test these conclusions, it is possible to examine the variability of specific samples types from sampling period to sampling period. These correlation coefficients are given in Tablc 5. In this table, it can be seen that there is a reasonable corrclation bctwccn
196
TABLE 5 Correlation coefficients of pairs of samples for the same sample. Type for successive sampling intervals using the April CCSEM data. Diluted Stack
Stack
Upwind
Downwind
Sample Pair r
7635,7647 0.795
7636,7649 0.711
7877,7881 0.466
7878,7886 0.088
Sample Pair r
7647,7648 0.916
7649,7655 0.615
7881.7884 0.047
7886,7889 0.304
Sample Pair r
7648,7696 0.747
7655,7693 0.727
7884,7887 0.229
7889,7892 0.05 1
Sample Pair r
7696,7908 0.658
7887,7890 0.269
7892,7895 0.273
Sample Pair r
7908,7910 0.199
7890,7896 0.493
7895,7898 0.771
Sample Pair r
7910,7911 0.699
7896,7898 0.123
Sample Pair r
7911,7913 0.123
7898,7899 0.193
Sample Pair r
7913,7915 0.012
7899,7902 0.730
Sample Pair r
7915,7916 0.015
Sample Pair r
7916,7918 0.153
Sample Pair r
7918,7920 0.268
samples of the same type between sampling periods. These results suggest that the power plant operating under steady state conditions emits relatively constant composition particles and that the background of particle types is also fairly constant. The consistency of these results again suggests that the CCSEM subsampling, analysis and subsequent classification of the characterized particles are adequate analytical methods to address the question of correctly describing the differences between samples collected from a coalfired power plant in the stack and at locations in the plume.
197
8. Conclusions In order to evaluate the biological, chemical, and physical differences between samples collected from a coal-fired power plant in the stack and at locations directly in the plume, it is necessary to develop a proper method that can correctly describe the changes in the physico-chemicalproperties from in-plant particles to in-plume particles. This study provides a technique that the differences between in-plant particle and in-plume particle can be described by the changes of particle type and abundance patterns that are obtained from the particle classificationof CCSEM data. In the BEPFA study, a total of 257 particle classes were defined. With the results of pairwise comparing the samples, it appears that there are some significant differences in the samples between the January and the April results. The problem is considered to be caused by the limited number of particles analysed. The stack samples are significantly different from the plume samples. It suggests that the physico-chemicalproperties of in-plant particles are substantially changed when the particles cool in the ambient air. Therefore, it appears that the diluted stack sample cannot substitute the plume sample to be used for evaluating environmental effects. The correlation coefficients between the plume and the upwind sample are very low. These results support the hypothesis that the plume particles can be distinguished from the background aerosol from sampling period to sampling period and suggest that the power plant emits relatively constant composition particles when it operates under steady state conditions.
Acknowledgement This work was supported in part by the U.S.National Science Foundation under grant ATM 89-96203. We would like to thank Dr. George Sverdrup of Battelle Columbus Laboratory, the Principal Investigator of the BEPFA, for his efforts in coordinating the project and providing the samples for analysis and Gary Casuccio of the R.J. Lee Group for providing the CCSEM data used in this study.
References Battelle. Biologic Effects of Plume Fly Ash. Research Project PR2482-5, 1988. Bernard PC,van Grieken RE. Classification of Estuarine Particles Using Automated Electron Microprobe Analysis and Multivariate Techniques. Environ Sci Techno1 1986; 20: 467473. Carpenter RL, Clark RD, Su YF. Fly Ash from Electrostatic Precipitators: Characterization of Large Spheres. JAir Pollut. Contr Assoc 1980; 30: 679-681. Casuccio GS, Janocko PB, Lee RJ, Kelly JF, Dattner SL, Mgebroff JS. The Use of Computer Controlled Scanning Electron Microscopy in Environmental Sciences. J Air PolIut Contr Assoc 1983; 33: 937-943.
198
Casuccio GS, Schwoeble AJ, Henderson BC, Lee RJ, Hopke PK, Sverdrup GM. The Use of CCSEM and Microimaging to Study Source/Receptor Relationships, 82nd Annual Meeting, Air & Waste Management Association, Anaheim, CA, June, 1989. Cheng MD, Hopkc PK. Investigation on the Use of Chemical Mass Balance Receptor Modcl: Numerical Computations. Chemometrics and Intelligent Laboratory Systems 1986; 1: 33-50. Crutcher ER. Light Microscopy as an Analytical Approach to Receptor Modeling. In: Dattncr SL, Hopke PK, eds. Receptor Models Applied to Contemporary Pollution Problems. Air Pollution Control Association, Pittsburgh, PA, 1983: 266-284. Dzubay TG, Stevens Robert K, Lewis CW, Hern DH, Courtney WJ, Tesch JW, Mason MA. Visibility and Aerosol Composition in Houston, Texas. Environ Sci Technol 1982; 16: 514-525. Fisher G, Chang DPY, Brummer M . Fly Ash Collected from Electrostatic Precipitators: Microcrystalline Structures and the Mystery of the Spheres. Science 1976; 192: 553-555. Gordon GE. Receptor Models. Environ Sci Technol 1988; 22: 1132-1142. Hassan T, Razzak MA, Michie D, Pettipher R. Ex-Tran 7: A Different Approach for An Expert System Generator. 5th International Workshop for Expert Systems & Their Applicutions, Avignon, France, 1985. Hopke PK. Receptor Modding in Environmental Chemistry. New York: John Wilcy 8: Sons, Inc., 1985. Hunt EB, Marin J, Stone PT. Experiments in Induction. New York: Academic Press, Inc., 1966. Illinois Environmental Protection AgencyDivision of Air Pollution Control. Illinois Annunl Air Quality Report, 1986, Springfield, Illinois, 1987. Khalil MAK, Rasmussen RA. Carbon Monoxide in an Urban Environment: Application of a Receptor Model for Source Apportionment. JAir Pollut Contr Assoc 1988; 38: 901-906. Kim DS. Particle Class Balance f o r Ayporrioning Aerosol Mass. Ph.D Thesis. University of Illinois at Urbana-Champaign, 1987. Kim D, Hopke PK. Classification of Individual Particles Based on Computer-Controlled Scanning Electron Microscopy Data. Aerosol Sci Technol 1988a; 9: 133-151. Kim D, Hopke PK. Source Apportionment of the El Paso Aerosol by Particle Class Balancc Analysis. Aerosol Sci Technol 1988b; 9:221-235. Kim DS, Hopke PK, Casuccio GS, Lee RJ, Miller SE, Sverdnip GM, Garber RW. Comparison of Particlcs Takcn from the ESP and Plume of a Coal-fired Power Plant with Background Acrosol Particles. Atmos Environ 1989; 23: 81-84. Kim DS, Hopke PK, Massart DL, Kaufman L, Casuccio G S . Multivariate Analysis of CCSEM Auto Emission Data. Sci Total Environ 1987; 59: 141-155. Mi Y. Source ldentifcation of Airborne Particles Derived from Scanning Electron Microscopy Data. Environmental Science in Civil Engineering, University of Illinois, Urbana, I L , August 1989. Mostcllcr F, Tukey JW. Data Analysis and Regression: A Second Course in Statistics. AddisonWesley Rib. Co., 1977. Natusch DFS, et at. Characterization of Trace Elements. In: Fly Ash, Proceedings, Internationcil Conference on Trace Metals in the Environment. Toronto, Canada, October, 1975. Olivcr DC. Aggregative Hierarchical Clustering Program Write-up. Preliminary Version, National Bureau of Economic Research, Cambridge, MA, 1973. O’Shca WJ, Sheff PA. A Chemical Mass Balance for Volatile Organics in Chicago. J Air Pollut Contr Assoc 1988; 38: 1020-1026. Rheingrovcr SW, Gordon GE. Wind-trajectory Method for Determining Compositions of Particlcs from Major Air-pollution Sources. Aerosol Sci Technol 1988; 8: 29-61. Kothcngcrg SJ, Denee PB, Brundle CR, Carpenter RL, Seiler FA, Eidson AF, Wcissman SH, Fleming CD, Hobbs CH. Surface and Elemental Properties of Mount St. Hclcns Volcanic Ash. Aerosol Sci Technol 1988; 9: 263-269.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
199
CHAPTER 17
Partial Least Squares (PLS)for the Prediction of Real-life Performance from Laboratory Results P.J. Lewil, B. Vekemans2, and L.M. Gypen' Janssen Research Foundation, Janssen Pharmaceutica NV, B-2340 Beerse, Belgium and 2Dept. Wiskunde en Informatika, University of Antwerp, U.I.A.,B-2610 Antwerpen, Belgium
1. Introduction The origin of the Partial Least Squares method (PLS) is attributed to the work by H. Wold on path modcling [Geladi, 19881. This paper describes an application of multiblock PLS [Gerlach, Kowalski and H. Wold, 19791. We attempt to corrclate the results from various batteries of laboratory scrcening tests with the outcome in real life using a sct of objects which have alrcady been studied in both the laboratory and the real-life situations. The objective of our analysis is to predict the real-life result of new test objects from thc corrcsponding laboratory data only. Our model involves two or more independent blocks of data which are to be correlatcd with one block of dcpcndent data. It is assumed that each block describcs the same number of objects which are represented in idcntical ordcr by thc rows of Lhc blocks. The variables are represented by the columns of the various blocks in the modcl. Thcy usually differ in nature and in numbcr from onc block to another. The mcthod is illustratcd by means of an application from the dcvelopmcnt of novel pharmaceutical compounds in the field of psychiatry.
2. Method The solution of the multiblock PLS model has been described by Wangen and Kowalski [1988]. This approach involves the calculation of latcnt variables, one for each indepcndent block. This is illustrated by Figure 1 for the case of two independent blocks X1 and X2 which are to bc correlated with a depcndcnt block xd. The vectors 11 and f2 represcnt score vectors for objccts which arc stored in the latcnt block X1. A score vector tl for the same sct of objects is computed from the latent block X I which is matched with the scorc vector td from the dcpcndent block Xd. In order to compute the score vectors t l , 12, tl and td we havc to dcfine wcight vectors: w1 and w2 for the different sets of indepcndcnt
200
-i
"; Independent
Latent
Figure 1. Multiblock PLS model with two independent blocks X I , Xz and one dependent block X d . The latent block XIis created during the analysis. It is assumed that the objects represented by the rows of the various blocks are identical and in the same order. The variables in the block may be different in nature as well as in number. Vectors t contain scores for the comon objects, vectors w represent weights for the variables. The arrows indicate that the latent block Xd is constructed from the independent score vectors tl and t2. The latent scores t~ are matched against the dependent scores t p The weight vectors are intermediates in the calculation of the score vectors.
Dependent
variables, wl for the two latent variables, and wd for the dependent variables. The calculation of score vectors t and weight vectors w is performed iteratively starting from an initial estimate for td. This initial estimate can be defined as the column in Xd with the largest variance. At convergence of the set of t and w vectors, a first PLS component is obtained. Object scores for each of the two independent blocks are then defined by the corresponding vectors tl and t2. Object scores for the dependent blocks are defined by the latent scores tl. Variable loading vectors p are defined as follows:
t; .xi
p; = t; .ti
with I = 1,2,...
for the independent blocks, and t; * Xd
Pd = for the dependent blocks. By convention, vectors are defined as column-vectors, transposition is indicated by the prime symbol and the dot represents the inner (matrix) product. Once a PLS component has been computed, one may derive residual blocks by
20 1
Figure 2. ‘Dance step’ diagram indicating a full cycle of the PLS iteration. The backward step generates the latent vectors from the independent blocks. The resulting latent block is matched in the forward step against the dependent block. The cycle starts with an initial estimate for the dependent score vector (e.g. the dependent column with the largest variance). The cycle ends with an update of the dependent score vector. Following this, a new cycle is initiated, until convergence is obtained. See text for more details.
i
’
Forward Step I
.
! :
removing the variance explained by the previously computed component:
X i t X i - t i . p’i
with i = 1,& ...
(3)
for the independent blocks, and
x, t x,-
I , . p’,
(4)
for the dependent block. The process can be repeated for a specified number of components and until exhaustion of the information in the blocks. In order to understand the inner workings of the PLS algorithm we propose a diagram which contains what may appear as a sequence of ‘dance steps’ (Fig. 2). In step 1 we multiply the initial estimate for the dependent scores td with the first independent bIock X1 to produce the weight vector w1. The latter is again multiplied in step 2 with the same block X1 to yield the scores 11. Step 3 stores the scores tl into a column of the latent block’Xb The same sequence is applied to the second independent blockX2 in steps 1 , 2 and 3, resulting into the storage of a second latent vector in Xl. Similar steps arc executcd for each additional independent block in the model, resulting in as many latent variables in X,.This completes the backward phase of the cycle.
202
(We call it backward as we proceeded from the dependent block towards the independent ones.) The forward phase involves the latent block X I and the dependent block xd. In step 4 we once again multiply the initial estimate for the scores td with X I to produce the latent weights wl. In a subsequent step 5 we multiply y with X I to yield the latent scores tl. Then tl is multiplied in step 6 with Xd to produce the dependent weights wd. Finally, in step 7 we multiply wd with X d to obtain a new estimate for the dependent scores fd. We are now brought to the situation before step 1, and we are ready to begin a new cycle. We complctc as many cycles as required by the tolerance that has been defined for convergcncc of the components. Note that all weight vectors w are normalized within each cycle of the algorithm as indicated by circular paths on Figure 2. This way, numerical values cannot become extremely small or extremely large during the numerical computations. ‘Dance step’ diagrams can be constructed for any type of path model in PLS. The idea has been inspired by a flow diagram used by Geladi [1988] for explaining the steps in the itcration of the NIPALS algorithm.
3. Application The development of a novel therapeutic drug requires the synthesis of some 4,000 new chemical compounds of which one eventually will be selected as a therapeutic drug. The selection process is complex and lengthy and takes on the average about eight years to complctc. For these reasons it is important to be able to make good predictions from available intelligence at an early stage. We show how PLS can be applied to the prediction of so-callcd neurolcptic activity. Neuroleptics are used in psychiatry for the control of severe mental disorders, such as schizophrenia and mania. The first ncuroleptic compounds that were discovered by Delay and Deniker in 1952 belong to the class of phenothiazines. Subsequently in 1958, Janssen found strong neurolcptic activity in the class of butyrophenoncs, later to be followed by other classes of chemical compounds. The search for the ideal ncuroleptic has not ended yet. It must be at the same time be highly potcnt, very specific, devoid of side effects, easy to administer and simple to produce. Presently, thcrc are two basic approaches in the search for neurolcptic activity of newly synthesized compounds. The classical way is to observe behavioral effects in live animals such as rats in pharmacological screening. Neurolcptics are known to counteract agitation, stereotyped behavior, seizures, tremors, as well as the lethal effects produced by adrenalin. The numbers in Table 1 show the results in these pharmacological tests as observed from a set of 17 reference neurolcptics [Niemegeers and Janssen 19791. The numbers in the table represent the doses required to produce a given effect in half of the number of animals. Note that the smaller numbers are associated with the more potent compounds. More recently, biochemical tests are used in which homogenized brain tissue from rats is incubated with a radio-active marker keysen e.a 19811. These markers are known
203
TABLE 1 Pharmacological block of independent data. Median effective doses ( m a g ) after subcutaneous aminstration of 17 reference neuroleptics to rats as observed in 5 behavioral tests [Niemegeers C. 19791.
BENPERIDOL CHLORPROMAZINE CHLORPROTHIXENE CLOTHIAPINE CLOZAPINE DROPERIDOL FLU PH ENAZINE HALOPEKIDOL PENFLUIUDOL PERPHENAZINE PIMOZIDE PI PAM PERON E PROMAZINE SULPIRIDE THIOKIDAZINE THIOTIXENE TRIFLUOPERAZINE
Agitation
Stereotypy
Seizures
Tremors
.01 .25 .17 .10 6.15 .01 .06 .02 .22
.01 .29 .38 .13 10.70 .01 .06 .02 .33
.04
.04
.05 3.07 3.07 21.40 4.06
.05 5.35 4.66 21.40 5.35 .22 .06
.33 .88 .5 1 .38 3.07 .5 1 .58 .77 160.00 .5 1 40.00 .5 8 9.33 320.00 10.70 21.40 1.77
.33 .5 8 .77 1.54 4.66 .67 1.77 5 .OO 160.00 1.34 40.00 2.03 18.70 320.00 28.30 320.00 4.67
.22 .04
Mortality 1.34 1.54 .5 1 3.10 14.10 .77 2.33 5 .OO 160.00 3.08 40.00 6.15 4.06 1,280.00 1.16 4.06 40.00
to bind to so-called receptors, i.e., proteins within the membranes of nerve cells that mediate in the transmission of brain signals. Some compounds such as Haloperidol and Apomorphine attach to dopamine-sensitive receptors. An excess of dopamine is associated with psychotic symptoms such as hallucinations and mania. Compound WB4 101 adheres to receptors that are specifically stimulated by so-called adrenergic substances which are related to norepinephrine and which are involved in severe agitation and anxiety. Finally, Spiperone binds specifically to receptors that are sensitive to serotonin which mediates in the regulation of mood and sleep. Results from biochemical binding studies on the set of 17 reference neuroleptics are presented in Table 2. The numbers in this table are related to the amount of test substance that is required in order to displace half of a marker substance which has bound specifically to a type of receptor. Note that the smaller numbers also indicate the more potent compounds. Thc clinical scores of Table 3 have been obtained from patients that have been trcatcd by one of the 17 aforementioned refcrcnce neuroleptics [Bobon e.a. 19721. These scorcs range from 0 to 5 and express various clinical properties of neuroleptics, including ataractic (tranquillizing), antimanic, antiautistic and antidelusional effects. In addition, two undesirable side effecls of neuroleptics have been scored on the same 0 to 5 scale, namely stimulation of extrapyramidal nerve cells (which cause rigidity, restlessness
204
TABLE 2 Biochemical block of independent data. Inhibition constants (nM) derived from median inhibitory concentrations of 17 reference neuroleptics as observed in binding experiments on homogenized brain tissues of rats with 4 radioactively labeled markers [Leysen I. 19811.
BENPERIDOL CHLORPROMAZINE CHLORPROTHEENE CLOTHIAPINE CLOZAPINE DROPERIDOL FLUPHENAZINE HALOPERIDOL PENFLURIDOL PERPHENAZINE PIMOZIDE PIPAMPERONE PROMAZINE SULPIRIDE THIORIDAZINE THIOTIXENE TRIFLUOPERAZINE
Haloperidol
APmorphine
Spiperone
WB4101
.4 49.6 11.0 15.7 156.0
.2 4.5 5.6 5.6 56.0 .4 2.2 1.3 8.9 3.6 1.4 16.0 38.0 44.7 8.9 2.2 2.2
6.6 20.2 3.3 6.0 15.7 4.1 32.8 48.0 232.0 33.0 32.8 5.o 328.0 26,043.0 36.0 9.2 41.2
2.3 1.7 1 .o 14.4 7.3
.8
6.2 1.2 9.9 3.9 1.2 124.0 99.0 31.3 15.7 2.5 3.9
-8
8.9 3.1 363.0 91.3 41.0 46.0 2.5 1,000.0 3.2 12.9 20.4
among other neural symptoms) and adrenolytic effects (which result in a decrease of blood pressure and other neurovegetative symptoms). Tables 1 and 2 document the two laboratory situations that are comparatively easy to assess, while Table 3 represent the real-life situation, which can only be evaluated after a laborious preclinical process. PLS allows to find out what elements of the laboratory studies are most predictive for the clinical result. Vice versa, PLS also identifies those elements of the clinical study that correlate or that lack correlation with the laboratory observations. Prior to performing the PLS analysis, we have applied a number of preprocessing steps to the various blocks of data. The independent blocks (Tables 1 and 2) were transformed into negative logarithms and double-centered (by rows and columns). The dependent block (Table 1) was only double-centered. The effect of negative logarithms is to correct for severe positive skewness in the data and to transform dose values into suitable measures of therapeutic activity. Double-centering removes differences in sensitivity bctween tests and corrects for differences in potency between compounds. These preprocessing steps are also part of spectral map analysis, a variant of principal components analysis Lewi 1976, 19891.
205
TABLE 3 Clinical block of dependent data. Scores of 17 neuroleptics observed in psychiatric clinics on 0 to 5 scales according to 4 therapeutic effects and 2 side effects [Bobon J. 19721. Ataractic
manic
3 4 3 3
4 2 2 4
4 3 2 2 1 2 2 4 4 1
3 5 4 4 2 3 2 2 1 0 1 1 4
BENPERIDOL CHLORPROMAZINE CHLORPROTHIXENE CLOTHIAPINE CLOZAPINE DROPENDOL FLUPHENAZINE HALOPERIDOL PENFLURIDOL PERPHENAZINE PIMOZIDE PIPAMPERONE PROMAZINE SULPIRIDE THIORIDAZINE THIOTIXENE TRIFLUOPERAZINE
4 1
2
Anti-
Antiautistic
Antidelusion.
Extrapyramid.
Adrenolytic
0 1 1 0 3 0 2 2 3 2 3 4 0 4 1 3 2
2 3 2 2 4 0 4 5
3 2 1
2 3 2 3 3 1 1 1 1 1
4
3 4
4 1 3 2 3 4
3 1
5 4 4 2 4 2 2 0 2 0 2 4
1
4 3 1 2 1 1 -
4. Result and discussion The first two PLS components of the pharmacological situation are represented in the biplot of Figure 3. A biplot is a joint representation of the objects and the variables of a given data table by means of their corresponding scores and loadings [Gabriel, 19721. The score and loading vectors for each PLS component have been obtained in the t and p vectors described above. The two components are represented by the horizontal and vertical dimensions of the biplot respectively.The small cross at the center indicates the origin of PLS components space. Circles in this biplot represent the 17 reference compounds. Squares indicate the 5 different tests. Areas of circles reflect the average potency of the compounds. Areas of squares express the average sensitivity of the tests. Average potency and average sensitivity are defined here as the means of the numbers of Table 1, computed row-wise and column-wise respectively. The positions of the circles and the squares are defined by the score and loading vectors extracted by PLS from the pharmacological block of data. Looking at the squares on the pharmacological biplot of Figure 3 one observes immediately three polarizing groups of tests: agitation and stereotypy on the right, tremors and seizures in the upper left comer, and mortality in the lower left corner. Compounds displayed at the right, among which Penfluridol and Pimozide, are specific
206
ne
Figure 3. PLS biplot of the pharmacological block of independent data (Tablc 1). Data have been transformed into negative logarithms and double-centered prior to PLS analysis. Circles represent compounds and squares represent tests. Areas of circles are proportional to the average potency of the compounds. Areas of squares are proportional to the average sensitivity of the tests. A three-fold interaction appears in this biplot. This is related to three types of receptors in the brain which mediate in the control of three neurotransmitter substances: dopamine, serotonin and norepinephrine (which is a synonym of adrenalin). See text for a discussion.
inhibitors of agitation and stereotypy. Those in the upper left area, including Pipamperone and Clozapine, are specific blockers of tremors and seizures. Finally, those at the lower left, especially Promazine and Thioridazine, counteract most spccifically mortality induced by adrenalin. (Norepinephrine is a synonym of adrenalin.) On the biochemical biplot of Figure 4 we observe the same polarizing pattern. The reading rules for this biplot are the same as above. At the right we see selective binding to the dopamine receptor by radio-actively labeled Halopcridol and Apomorphine. At the upper left we have binding to the serotonic receptors which are specifically labeled by Spipcrone. Finally, in the lower left comer we obtain binding to the adrenergic rcceptor which is specifically marked by compound WB4101. Among thc specific dopamineblocking neuroleptics we find Penfluridol and Pimozide. Specific serotonin antagonists in this set are Pipamperone and Clozapine. Finally, Promazine and Thioridazinc are displayed as specific adrenergic blocking neuroleptics. The degree of correlation betwcen the positions of the compounds on the two biplots of Figures 3 and 4 is rcmarkable. What emerges is a classification of these compounds according to a biological model of interactions with three receptors which has not been specified in advance. These two views of the set of neuroleptics have been produced through the eycs, so to speak, of thc clinicians, the PLS biplot of which is shown in
201
Serotonin I
mine
Norepinephrine Figure 4. PLS biplot of the biochemical (receptor binding) block of independent data (Table 2). The conventions and transformations relating to this biplot are the same as in Figure 3. The same polarizations appear as observed in the pharmacological biplot. They reflect binding of compounds with the three types of receptors that are sensitive to dopamine, serotonin and norepinephrine. See text for a discussion.
Serotonin
lamine
Figure 5. PLS biplot of the clinical block of dependent data (Table 3). Data have been doublecentercd prior to PLS analysis. The pattern of compounds appears in close correlation with those obtained from the pharmacological and biochemical results (Figs. 3 and 4). The main polarization produced by the clinical variables accounts for specificities either for the dopamine or the norepinephrine receptor. Thc specificities for the serotonin receptor are predicted from the pharmacological and biochemical data, although a correlating variable is lacking in the clinical data. See text for a discussion.
208
Figure 5. At the right of the dependent biplot TABLE4 we find typical symptoms of dopamine recep- Percentages of variance explained by the tor blockade: antidelusional, anti-autistic, anti- first two PLS components of each of the three blocks of data and of the variables manic effects and side effects resulting from that appear in each of these blocks. extrapyramidal stimulation. At the lower left we find the typical symptoms produced by blockade of the adrenergic receptor, i.e., atarac- Pharmacological tests 96 tic (tranquillizing) effect and adrenolytic (neuAgitation 99 rovegetative) side effects. There is no clinical Stereotypy 100 Seizures 89 counterpart in this biplot, however, for the Tremors 86 effects of serotonin receptor blockade which Mortality 99 has been demonstrated on the pharmacological (Fig. 3) and biochemical (Fig. 4 ) biplots. Indeed, adequate clinical scores for this inter- Biochemical tests 98 action of neuroleptics are lacking in this data H aloperidol 98 set. (The relevance of serotonin receptors in Apomorphine 88 the brain was not generally recognized by psySpiperone 99 chiatrists at the time when these clinical scores WB 410 100 wcrc compiled.) Notwithstanding this, Pipamperone and Clozapine stand out in the upper Clinical Observations 35 left corner of Figure 5 , exactly at the place whcrc we would expect them from the pharAtaractic 75 Antimanic 12 macological and biochemical predictions of Antiautistic 7 Figures 3 and 4. Penfluridol and Pimozide Antidelusional 14 appear at their proper place on the far right. 47 Extrapyramidal Promazine and Thioridazine are displayed at Adrenolytic 60 the lower left in agreement with their known pharmacological and receptor binding profiles. Table 4 shows the percentages of variance explained in each of the three blocks (Tables 1, 2 and 3). While these percentages appear to be very high for the two independent blocks (96 and 98 percent for the pharmacological and biochemical blocks respectively), it is much lower for the dependent block (only 35 percent). Most of the independent variables are well-represented in the two-components PLS biplots of Figures 3 and 4.Some of the clinical variables, however, are poorly represented on the biplot of Figure 5, especially the anti-autistic effect for which there seems to be no good predictor variables in either the pharmacological or biochemical test batteries. Most strongly represented are ataractic effects and adrenolytic side effects at the adrenergic blocking side, and the antidelusional effect, antimanic effect and extrapyramidal side effects at the dopamine blocking side of the biplot.
209
The result of multiblock PLS analysis can be applied in different ways. It can be used to optimize the variables within the independent blocks, by removing redundant ones and by adding new ones that improve the clinical prediction, This type of analysis also points the way in which the dependent real-life variables can be optimized in order to account for important phenomena that are discovered independently in the laboratory.
References Bobon I, Bobon DP, Pinchard A, Collard J, Ban TA, De Buck R. Hippius H, Larnbert PA, Vinar 0. A new comparative physiognomy of neuroleptics: a collaborative clinical report. Acta psychiat Berg 1972;72: 542-554. Delay J, Deniker P. Mkthodes chimiothkrupiques en psychiatrie. Paris: Masson, 1961. Gabriel KR. The biplot graphic display of matrices with applications to principal components analysis. Biometrika 1971;58: 453467. Geladi P. Notes on the history and nature of partial least squares (PLS) modelling. J Chemometrics 1988;2: 231-246. Gerlach RW, Kowalski BR. Wold HO. Partial least squares path modelling with latent variables. A n a l Chem Acta 1979;112:417421. Janssen PAJ, Niemegeers CJE, Schellekens KHL. Is it possible to predict the clinical effects of neuroleptic drugs (Major tranquillizers) from animal data? Arzneim Forsch. (Drug Res.) 1965;15: 104-117. Lewi PJ. Spectral Mapping, a technique for classifying biological activity profiles of chemical compounds. Arzmim Forsch (Drug Res) 1976;26: 1295-1300. Lewi PJ. Spectral Map Analysis, Analysis of contrasts, especially from log-ratios. Chemometrics IntellLub Syst 1989;5: 105-116. Leysen JE. Review of neuroleptic receptors: specificity and multiplicity of in-vitro binding related to pharmacological activity. In: Usdin E, Dahl S, Gram LF, Lingjaerde 0, eds. Clinical pharmacology in psychiatry. London: Macmillan, pp. 35-62,1981. Niemegeers CJE, Janssen PAJ. A systematic study of the pharmacological activities of DA-antagonists. Life Sciences 1979;24: 2201-2216. Wangen LE, Kowalski BR. A multiblock partial least squares algorithm for investigating complex chemical systems. J Chemometrics 1988;3: 3-10.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Auromarion (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
21 1
CHAPTER 18
Dynamic Modelling of Complex Enzymatic Reactions E.C. Ferreira and J.C. Duarte Laborutdrio Nacionul de Engenharia e Tecnologia Industrial, Departamento de Tecnologia de Inddstrias Quimicas, 2745 Queluz de Baixo, Portugal
Abstract The purpose of the work is to extend the application of non-linear modelling techniques to complex enzymatic systems aiming for the optimization and the adaptive control of a process reactor. A state space model based on mass balances considerations is used for the design of observers as ‘software sensors’ for the on-line estimation of nonmeasured state variables from the measured ones which constitutes a valuable alternative for the lack of reliable sensors for on line measurement. The biochemical system under consideration is the enzymatic synthesis of an antibiotic (ampicillin) performed by an enzyme normally used as an hydrolase (P. acylase), The substrates arc the 6-aminopenicillanic acid (6-APA) and the methyl D-phenylglycine esther (MPG). Reactions were performed on a continuous stirred vessel reactor. Substrate and product concentrations were determined by HPLC.
1. Introduction The utilization of enzymes for the production of fine chemicals is highly advantageous due to the very high chemical and stereospccificity of enzymatic reactions when compared with most chemical reactions. The use of enzymcs for performing synthetic chcmical reactions allow: the use of mild environmental conditions of mixing, temperature and pressure, which are features of particular importance in systems with unstable substrates and/or products. Enzymatic reactions will normally be conducted in the range of 20 to 50°C and at atmospheric pressure. However some limitations have hampered a more gcnera1 use of enzymes: lowcr volumetric rates, problems with enzyme stability and reuse, the poor solubility of most organic chemicals in watcr, and the relative unfamiliarity of most chemists with cnzyme kinctics. In spite of these limitations enzyme technology has becn used successfully in some important industries such as thosc for dctcrgcnts and starch whcrc mainly hydrolytic cnzymcs, c.g., protcascs and amylases, are required. Howevcr, in a growing number of cases enzymes are also being used for the synthesis of
212
HB
HN
kl
HN
I
dl
k-1 H2°
K'N
3
EH + A N
k,
EH + AOH HN**EH
Figure 1. Mechanism of a hydrolase-catalyzed group transfer reaction MPG + 6APA -> Amp + MeOH via an acyl-enzyme that is deacylated by the nucleophiles H20 or 6-APA [Kasche et al., 19841.
antibiotics, steroid hormones and peptides. Improved application of enzymes in organic synthetic chemistry will come along as methods for enzyme immobilization and reuse, are improved. More active enzymes (e.g., those from some thermophilic organisms) and screening of enzymes for use in organic solvents will also contribute to a more general use of enzymes in chemistry. Interesting applications are the enzyme-catalyzed kinetic resolutions of racemates. For example hydrolysis of chiral esters will yield mainly optically active acids or alcohols. This method is particularly attractive when it is accompanied by a spontaneous in situ racemization of the two unwanted isomers. This is the case with the enzymatic resolution of several amino acids by acylases, aminopeptidases and esterases [Sheldon, 19901. Another interesting application of synthetic enzyme catalysis is in the aspartame synthesis via an enantioselective enzymatic coupling of DL-phenylalanine with an aspartic acid derivative (DSM-Toyo Soda process). It is obvious that to run these complex enzymatic reactions in a reactor a good description of the reaction kinetics must be available. However, enzyme classic kinetics approach is based mainly on the Michaelis-Menten approximation of initial rates that can be of little utility to use in continuous and semi-continuousreactor regimes. Let us consider here the penicillin amidase catalyzed-synthesisof semi synthetic penicillins, as is the case under study of the ampicillin synthesis: PG
+ 6-AF'A
->
Amp+H20
The hydrolase catalyzed group transfer reaction may be described by more than one mechanism [Kasche et al., 19841. In the case with only one acyl-enzyme intermediate (Fig. 1) the expressions for kcd and K, (constants of the Michaelis-Menten equation) for the formation of ampicillin and phenylglycine represented in Table 1 seem to be dependent on the rate constants of the hyphothetisised mechanism and on the concentration of
213
TABLE 1 Expressions fork,,, and Km for the formation of PG and Ampicillin. Product
PG
Amp.
Km
kcal
1'
k3[H2
k2
k2+ k3[H20]+ k;[6APAI k:J x kJ6AF'Al L
( k3[H20]+ kj[6AF'Al)( kl( k,+ k3[H20]
k-l
+ k,J( KL + [6APA I)
+ kj[6APAl)
x KL
k2 + k3[ H, 0 3 + kj[6AF'Al
6-APA. Furthermore the integration of the Michaelis-Mentenequation may become even more difficult and dependent on the reactants instant concentrations. Otherwise expressions for the enzyme reactor kinetics are not easily available. It is evident that to describe the system dynamics and to be able to do reactor design and to implement control strategies the use of methodologies other than traditional kinetics may be of advantage mainly if they do not require a previous rigorous knowledge of the mechanism and kinetics of the system. This is exactly the objective of methodologies to approach the modelling of biological systems by the use of on-line estimation techniques as described by Bastin and Dochain [1990]. In principle these methodologies may be extended to enzymatic reactions that will be developed in the following sections, using the ampicillin enzymatic synthesis as a typical case study.
2. Description of the enzymatic process The reaction scheme under consideration (Fig. 2) is based on the work of V. Kasche [19861. The more important reaction of this scheme is the synthesis of ampicillin: i)
MPG
+
cp1
6-APA -> Amp + MeOH
In the case here considered the enzyme substrate, phenylglycine (PG), was substituted by its methyl ester (MPG), because when using MPG the reaction proceeds much more quickly. However MPG is also a substrate for the enzyme and we must consider its hydrolytic reaction which can also proceed chemically:
ii)
MPG
+
rpz H20 -> PG+MeOH
To complete the description of the system we have yet to write a reaction for the
21 4
6-APA
+ MPG + H 0
(91
Amp + MeOH + H20
PG + MeOH + 6-APA Figurc 2. Rcaction scheme for ampicillin synthesis. 6-APA = aminopenicillanic acid, Amp = Ampicillin, MPG = ester of methyl phenylglycine, PG = phenylglycine, MeOH = mcthanol.
enzymatic deacylation of ampicillin by pcnicillin acylase: iii)
Amp
+
(P3
H 2 0 ->
6-APA+PG
The process could be operated in continuous mode on a continuous stirred tank type rcactor with the enzyme penicillin amidase (E.C. 3.5.1.11, from E. coli) immobilized with glutaraldchyde (a gift from CIPAN-Antibioticos de Portugal). The two substrates MPG and 6-APA were added at various feed rates at constant initial concentration values (respectively 100 mM and 40 mM). The fact that PG is also present in thc feed due to the chemical hydrolysis of MPG in water solution was also considered. Experimcntal conditions wcrc: temperature 37"C, pH 6.5, enzyme activity 19.41 UI, rcactor volume 50 ml.
3. Kinetics state space model The design of the observcr derived in section 4 is based on a kinetics state space model obtained from mass balances considerations. In stirred tank reactors (STR), the process is assumed to be in a completely mixed condition (compositions homogeneous in the rcactor). Therefore the following standard Continuous STR state space model can be written:
with:
21 5
S1 the 6-APA concentration S2 the MPG concentration
P1 the PG concentration P2 the Ampicillin concentration P3 the methanol concentration cpi the reaction rates ki the yield coefficients D the dilution rate, defined as the quotient between the influent volumetric flow rate and the volume of the medium Fi substrate feed rates.
4. Design of observers for biochemical processes Equations (1.a) to (1.e) may be seen as a special case of the following general nonlinear state space model, written in matrix form
whcrc: 5 (dimension n) is the state vector; K (dimension n x p ) is a matrix involving either yield coefficients or “0” or “1” entries; cp (dimension p ) is a vector of reaction rates; U (dimension n) is a vector representing inlet flow rates of substrates.
For the modcl (l), these notations may be written as:
-kl
xT_[ 0 k2
-1
0
-1
k3
0
k,
UT =
-1
[Fi F2 Fg 0 01
0
It is considered that the matrix K and U are known, the vector cp is unknown and p state variables is measured on-line (51 denotes the vector of these measurements). The problem addressed here is to design an observer for the on-line estimation of “n - p ” nonmeasured state variables (52 is the representative vector). A basic structural property of the state space model (2) [Bastin, 19881 allows the following rearrangement:
216
with obvious definitions of K1,K2, U1, U2.It must be noted that K1 is full rank (rank(K1)= PI. There exists a state transformation:
where A is a solution of the matrix equation
AK1+ K2= 0
(5)
such that the state-space model (2) is equivalent to
&/dt = -DZ + (AUI
+ Ud
and equation (3.a). Then the following asymptotic observer can be derived from (4) and (6):
d$!/dt=- D$!+ ( A U I -t U2)
(74
t = ZA - A G
where the symbol A denotes estimated values. The asymptotic convergence of this estimation algorithm may be found in Bastin and Dochain [1990]. The main advantages of this algorithm are its simplicity by comparison, for instance, with extended Kalman filter and, its properties which allows the on-line estimation of state variables without the knowledge (nor the estimation) of reaction rates being necessary. These properties were tested previously to be very efficient for various fermentation processes [see e.g., Dochain et al., 19881.
5. Experimental validation The application of the estimator algorithm defined above was considered for the case of ampicillin synthesis. As the on-line measurements of MPG and 6-APA (p = 2) are available by high performance liquid chromatography (HF'LC) the state reconstruction is made possible.
21 7
Hence:
and
- k3/kl A = [ kdk, ( k6 - k7)/ k1
"1
(9)
k7
The transformed state is then defined as follows: (l0.a) (lO.b) (1O.c) Then the estimator algorithm (7.a-b) is as follows: h
A
h
A
d Z 1 / d t = - DZ1 - ( k3/ kl)Fl + k 3 F 2 + F3 dZ2/ dt = - DZ2 + ( ks/ kl)Fl A
d i 3 / dt = - DZ3 + [ ( k6 - k7)/ k l ] F1+ k f 2 A
h
A
h
A
h
Pi = Z1 + ( k3/kl) Sl- k 3 S 2 P2 = Z2 - ( ks/k l ) Sl
P1 = Z 3 - [( k 6 - k7)/ kl] S l - k7S2
(1l.a) (1l.b) (1l.c) (1l.d) (1l.e) (1l.f)
The computer implementation of the estimator requires a discrete-time formulation. This can be done simply by replacing the time derivative by a finite difference. A first order forward Euler approximation was used:
The yield coefficients used are listed in Table 2. Coefficients kl to k5 have been obtained with an identification study from various batch experiments, while k6, k7 have been obtained from stoichiometric considerations.
21 8
TABLE 2 Yield coefficients.
3.36
0.250
1.41
12.0
0.300
0.300
1.41
The on-line state variables used by the observer are shown in Figure 3. The dilution rate and the inlet flow rates are shown in Figure 4 and Figure 5, respectively. The cstimates of PG, Ampicillin and methanol are shown in Figures 6, 7, 8. A good agreement between the estimates and validation data (not used by the observer) has been observed. For the estimate of methanol concentration a validation data set was not available.
6. Conclusions The experimental validation of an asymptotic observer for the estimation of ampicillin, phenylglycine and methanol concentrations was presented in this paper. As it was demostrated an important feature of this algorithm is to provide on-line estimates without requiring a previous rigorous knowledge of system kinetics. Furthermore, the (specific) reaction rates, considered as time varying parameters, can be estimated on-line [Ferreira et al., 19901. The main practical interest of these observers used as ‘software sensors’ is the fact that they constitute a valuable alternative for lack of reliable sensors for on-line measurcments of thc main statc variables. They are vcry cheap compared to the expensive and complex analytical methods usually used for the measurement of compounds like antibiotics. This software tool constitutes an important step for the use of adaptive control methodologies on these enzymatic processes. o,020,Dilulion rate D (Ilmin) 90 -
s
I,
L 0
80 -do
E
5
E
D
70-
60-
fi
50-
0
0
40
30
0
100
200 T i m (min)
Figure 3. On-line measured statc variables (T = 5 min).
Figure 4. Dilution rate profilc.
300
400
219 Concontrutlon, P l (mmol/l)
Feed rates, U (mmol/Vrnin)
60,
1.6 1.4 1.2 -
--
'!
MPG
- PGestlmated
10
+
0
0
100
200
PO off-llne dat
300
0
Tlme (mln)
Figure 6 . Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
Figure 5. Inlet flow rate.
Concentration, P2 (rnmolll)
Concentrdion, P3 (mmolll)
r
1
0
1
:I 0
50
10 -
- A m p estimated t Amp. off-line data
0
100
200 Time (min)
300
I
400
Figure 7. Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
I0
Figure 8. Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
This study has been supportcd by Biotechnology Action Programme of Commission of thc European Communities and Programa Mobilizador C&T, JNICT-Portugal.
Acknowledgements The authors thank E. Santos, L. Tavares e V. Sousa for experimental work on ampicillin kinetics synthesis. We also thanks G . Bastin for his advice and discussions on the work.
References Bastin G. State estimation and adaptive control of multilinear compartmental systems: theoretical framework and application to biotechnological processes. In: New Trends in Nonlinear Systems Theory. Lecture Notes on Control and Information Science, no. 122. Springer Verlag, 1988: 341-352.
220
Bastin G, Dochain D. On-line estimation and adaptive control of bioreactors. Elsevier. 1990 (in press) Dochain D, De Buy1 E, Bastin G. Experimental validation of a methodology for on line estimation in bioreactors. In: Fish NM, Fox RI, Thornhill NF, eds. Computers Applications in Fermentation Technology: Modelling and Control of Biotechnological Processes, Elsevier. 1988: 187-194. Ferreira EC, Fey0 de Azevedo S,Duarte JC. Nonlinear estimation of specific reaction rates and state observers for a ampicillin enzymatic synthesis. To be presented at 5th European Congress on Biotechnology, Copenhagen, 1990. Kasche V. Mechanism and yields in enzyme catalyzed equilibrium and kinetically controlled synthesis of p-lactam antibiotics, peptides and other condensation products. Enzyme Microb Techno[ 1986; 8. Jan: 4-16. Kasche V, Haufler U, Zollner R. Kinetic studies on the mechanism of the penicillin amidasecatalysed synthesis of ampicillin and benzylpenicillin. Hoppe-Seyler's Z Physiol Chem 1984; 365: 1435-1443. Sheldon R. Industrial synthesis of optically active compounds. Chemistry & Industry 1990; 7 : 212-2 19.
E.J. Karjalainen (Editor), Scientific Computing and Auromarion (Europe) 7990
221
0 1990 Elsevier Science Publishers B.V., Amsterdam CHAPTER 19
From Chemical Sensors to Bioelectronics: A Constant Search for Improved Selectivity, Sensitivity and Miniaturization P.R. Coulet Laboratoire de Gknie Enzymatique, UMR 106 CNRS Universitt! Claude Bernard, Lyon 1 , 43 Boulevard du 11 Novembre 1918, F-69622, Villeurbanne Cedex, France
Abstract The need in various domains for real time information, urgently requires the design of new sensors exhibiting a high selectivity and a total reliability in connection with smart systems and actuators. This explains the strong interest dedicated to chemical sensors and particularly to biochemical sensors for such a purpose. These sensors mainly consist of a highly selective sensing layer capable of highly specific molecular recognition, intimately connected to a physical transducer. They can be directly used for the analysis of complex mediums. When the target analyte to be monitored is present and reaches the sensing layer, a physical or chemical signal occurs which is converted by a definite transducer into an output electrical signal. This signal treated in a processing system leads to a directly exploitable result. Enzyme electrodes are the archetype of the first generation of biosensors now commercially available. New generations are based on novel and promising transducers like field effect transistors or optoelectronic devices. Efforts are made for improving selectivity and sensitivity of the sensing layer, for exploring new concepts in transduction modes and for miniaturizing both the probes and smart signal processing systems. Groups including specialists of biomolecular engineering, microelectronics, optronics, computer sciences and automation capable of developing a comprehensive interdisciplinary approach will have a decisive leadership in challenging areas for the next future especially in two of them where a strong demand exists: medical sensing and environmentalmonitoring.
1. Introduction Improvements in the control and automation of industrial processes are urgently needed especially in biotechnology processes [ 13 not only to increase both quality and productivity
222
but also to favor waste-free operations. However, analyses performed on line or very closc to the process still remain difficult to perform. Process analytical chemistry which is an alternative to time-consuming conventional analysis performed in central laboratorics is raising a lot of interest and new generations of sensors mainly chemical and biochcrnical appear as promising tools in this field [2]. In the domain of health, monitoring vital parameters in critical care services, using for example implantable sensors still remains a challenging goal. The growing consciousness throughout the world of our strong dependence on environmental problcms appears also as a powerful stimulation for developing new approaches and new concepts. All these different factors are a rcal chance for boosting a fruitful interdisciplinary approach towards the “bioelectronics frontier".
2. Chemical sensors A chcmical sensor can be defined as a device in which a sensing layer is intimately inte-
grated within, or closely associated to a physical transducer able to detect and monitor specifically an analytc [3]. As a matter of fact, it is quite difficult to stick to a fully unambiguous definition and we will consider a chemical sensor as a small probe-type device which can be associatcd with a signal processing system more or less sophisticated including for instance result digital display, special output for computers etc... This devicc must provide direct information about the chemical composition of its environment. Ideally, it must exhibit a large autonomy, be reagentless, respond rapidly, selectively and reversibly to the concentration or activity of chemical species for long periods ... Stability and sensitivity to interferent specics are in fact thc two main bottlenecks to overcome in designing chemical sensors and particularly biosensors.
3. Biosensors Biosensors can be considered as highly sophisticated chemical sensors which incorporate in their sensing layer some kind of biological material conferring to the probe a very high sclecLivity.
Affinity and specific requirements for the biomolecules to be fully active As a matter of fact, it is interesting to take advantage of the different typcs of biomolccules which arc capable of molecular recognition and may present a strong affinity for other compounds. Among them, the most interesting couples are: - enzyme / substrate, - antibody / antigen,
223
-lectin / sugar, -nucleic acids /complementary sequences to which we can add chemoreceptors from biological membranes. Microorganisms, animal or plant whole cells and even tissue slices can also be incorporated in the sensing layer. Up to now, the two main classes widely used in the design of biosensors are enzymes and antibodies. Enzymes are highly specialized proteins which specifically catalyze metabolic reactions in living organisms. They can be isolated, purified and are used in vitro for analytical purposes in conventional methods. Antibodies are naturally produced by animals and human beings reacting against foreign substances. They can be obtained by inducing their production for instance in rabbits or mice and collected for use as analytical reagents in immunoassays. It must be kept in mind that most of these biological systems have extraordinary potentialities but are also fragile and must be used in definite conditions. For instance most enzymes have an optimal pH range where their activity is maximal and this pH zone has to be compatible with the characteristics of the transducer, Except for very special enzymes capable of undergoing temperatures above 100°C for several minutes, most of the biocatalysts must be used in a quite narrow temperature range (15"C4OoC). In most situations, an aqueous medium is generally required and this has to be taken into account when specific applications are considered in gaseous phases or organic solvent for instance. Stability of the bioactive molecule is certainly the main factor to consider and is in most cases an intrinsic property of the biological material very difficult to modify.
Biomolecular sensing and transducing mode Two main phenomena acting in sequence have to be considered for designing a biosensor (Fig. 1): - the selective molecular recognition of the target molecule and - the Occurrence of a first physical or chemical signal consecutive to this recognition, convencd by the transducer into a second signal generally electrical, with a transduction mode which can be either - electrochemical, - thermal, -based on a mass variation, - or optical.
Molecular recognition and selectivity Prior to examining the different possible combinations between bioactive layers and
224
information - measurement electrlc output slgnal
transistors (FET) optical fibers photodiodes CCD thermistors piezo devices
I
fransduca
physlco chemlcal
d tections: electrochemical * optical thermal ' mass variation
Sl' II
grafting
interface
-
chemical
t 1
biochemical
/
I
I
complex medium
Figure 1. Schematic configuration of a chemical or biosensor.
transducers, two points must be underlined: the first one concerns the intrinsic specificity of the biomolecule involved in the recognition process. For instance, if enzymes are considered, this specificity may strongly vary depending on the spectrum of the substrates they can accept. Urease is totally specific for urea, glucose oxidase is also very specific for P-D-glucoseand oxidizes the alpha anomer at a rate lower than 1% but other systems
225
like alcohol oxidases or alcohol dehydrogenases accept several primary alcohols as substrates and amino acid oxidases will respond to a large spectrum of amino acids as well. For antibodies this specificity can be strongly enhanced by using monoclonal antibodies now widely produced in many laboratories. The second important point is the degree of bioamplification obtained when molecular recognition occurs. If the bioactive molecule present in the sensing layer is a biocatalyst, a variable amount of product will be obtained in a short time depending on its tumover: this corresponds to an amplification at the step generating the physicochemical signal. By contrast, using antibodies to detect antigens or vice versa, is not normally a biocatalysis phenomenon and this will have to be taken into account for the choice of the transducer.
Immobilization of the biological system in the sensing layer The simplest way in retaining bioactive molecules in the immediate vicinity of the tip of a transducer is to trap them on its surface covered by a permselective membrane. This has been used in a few cases but in most of the devices which have been described in the literature the bioactive molecules are immobilized in the sensing layer. Different methods of immobilization have been available for several years [4] derived from the preparation of bioconjugates [ 5 ] now widely used in enzyme immunoassays and also from the development of heterogeneous enzymology [6]. The two main methods consist either in the embedding of the biomolecule inside the sensing layer coating the sensitive part of the transducer or in its covalent grafting onto a preexisting membrane maintained in close contact with the transducer tip. Details will be given with the description of enzyme electrodes. Several classes of biosensors have been described based on the different types of transducers. A short description will be made for each of them with some emphasis on elcctrochernical transduction and enzyme electrodes, historically the first type of biosensor now on the market.
3 .I Biosensors based on electrochemical transduction. Enzyme electrodes Associating an enzyme with an electrochemical transduction greatly increasing the selectivity of an amperometric electrode has been proposed by Clark and Lyons more than 25 years ago [7]. Since this first attempt, a large variety of enzyme electrodes have been described and a very abundant literature has been published on this subject periodically reviewed [8, 93. Glucose determination using glucose oxidase in the sensing layer is obviously the most popular system in the enzyme electrode field. Several explanations can be found to
226
this: a high selectivity and the fact that glucose oxidase contains FAD. This cofactor involved in the oxidoreduction cycle is tightly bound to the enzyme, which appears as a real advantage for designing a rcagentless probe when compared to NADH based reactions with specific dehydrogenascs. In the latter case NADH acts as a cosubstrate which must be supplicd in the reaction medium. Beside the wide demand for glucose tests not only in clinical biochemistry but also in fermentations or cell cultures, the very high stability of this enzyme appears as the key factor for its wide use in the design of most of the biosensors described today. The principle of glucose determination using the enzyme glucose oxidase as biocatalyst is thc following: glucose oxidase P-D-glucose + 0 2 + H20 > gluconic acid + H202 When considering first order kinetics conditions, glucose oxidase activity is directly proportional to glucose concentration according to the simplified model of Michaclis Menten for enzyme kinetics. Theoretically this activity can be followed by eithcr the consumption of 0 2 , the appearance of H+ from gluconic acid or H202. Practically, appearance of H+ is very difficult to monitor due to the use of buffered mediums, so the systems which have been described and which lead to the design of commercially available insuumcnts are based on either 0 2 or H202 monitoring or the use of mediators. OLhcr oxidases specific for diffcrcnt metabolites and leading to hydrogen pcroxide can also bc uscd making the system really versatile.
Enzyme electrodes based on 0 2 detection Thc consumption vcrsus time of 0 2 which is involved as a stoichiomeuic coreactant in thc glucose oxidation process can be measured with a p02 Clark electrode. A very easy to prepare sensing layer comprises glucose oxidase immobilized by copolymerization with bovine serum albumin inside a gelatin matrix using glutaraldehydc as cross linking reagent. The bioactive layer is then used for coating a polypropylene selectivc gas membrane covering a platinum cathode [lo]. This typc o€detection has been extended to other oxidases for thc mcasurement of various metabolites. Its main advantage is that this typc of dctcction is insensitive to many intcrfcrent species.
Enzyme electrodes based on H 2 0 2 detection This type of detection generally leads to more sensitive devices but is subject to intcrfcrences. In our group we have chosen to use immobilized enzyme membranes closely associated to an amperomeuic platinum electrode with a potential poised at +650 mV vs Ag/AgCl refcrcnce [ll]. Hydrogen peroxide is oxidized at this potential and the currcnt
221
Figure 2. Amperometric enzyme membrane electrode.
thus obtained can be directly correlated to the concentration of the target molecule, here the oxidase substrate (Fig. 2). Enzymes have been first immobilized on collagen membranes through acyl-azide groups. Collagen is made of a triple helix protein existing naturally in the form of fibrils which may be rearranged artificially under a film form. The activation procedure was performed as follows: lateral carboxyl groups from aspartate and glutamate residues involved in the coIIagen structure were first esterified by immersion in methanol containing 0.2 M HCl for one week at 20-22°C. After washing, the membranes were placed overnight in a 1% (v/v) hydrazine solution at room temperature. Acyl azide formation was achieved by dipping the membranes into 0.5 M NaNOd0.3 N HC1 for 3 min. A thorough and rapid washing with a buffer solution (the same as used for the coupling step) provides activated membranes ready for enzyme coupling. Two types of coupling could be performed: at random or asymmetric. For random immobilization, membranes are directly immersed in the coupling buffer solution where one or several enzymes have been dissolved and a surface covalent binding occurs on both faces with randomly distributed molecules [12, 131. It is also possible to immobilize different enzymes on each face using a specially designed coupling cell [14]. Before use, the cnzymic membranes are washed with 1M KCl for 15 min. Such a procedure allows to prepare special sensing layers for bienzyme electrodes. for instance a maltose electrode could be prepared using glucoamylase bound onto one face and glucose oxidase on the other face. The hydrolysis of maltose which is the target analyte occurs on one face leading to thc formation of glucose which crosses the membrane and is oxidized by glucose oxidasc bound on thc othcr face in contact wilh the platinum anode detecting hydrogen peroxide. With this approach, an improverncnt in sensitivity could be obtained with airdried electrodes [15].
228
More recently new types of polyamide supports from Pall Industry S.A., France have bccn selected to prepare bioactive membranes efficiently. One called Biodyne Immunoaffinity membrane is supplied in a preactivated form allowing an easy and very fast coupling of enzymes. The different methods recommended by the supplier for coupling antigens and antibodies have been adapted in our group for designing tailor-made sensing layers [16]. Briefly in the routine procedure, enzyme immobilization is achieved by simple membrane wetting: 20 microliters of the enzyme solution are applied on each side of the membrane and left to react for 1 min. Prior to their use, the membranes are washed with stirring in the chosen KC1 containing buffer.
Measurements in real samples. The problem of interferences, A lot of papers deal in fact with laboratory experiments in reconstituted mediums and in this case, enzyme elcctrodes work quite well. However when immersed in complex mixtures of unknown composition, which is the normal situation for operation, many problems will occur. For instance, at the fixed potential, other substances like ascorbic acid, uric acid etc... can be oxidized thus yielding an undesirable current and biased results. More drastic situations can be encountered with rapid inhibition or inactivation of the enzyme by undesirable substances like heavy ions. thiol rcagents etc... In this case only a pretreatment of the sample will be efficient to protect the bioactive probe. Let us focus on the problem of electrochemical interferences which can been overcome without pretreatment of the sample by different approaches. As alrcady mentioned, 0 2 electrodes are not subject to interferences but their detection limit is rather high which may be a real drawback in many cases.
Use of a dflerential system. When using hydrogen peroxide detection, a differential measurement with a two-electrode system for the automatic removal of interferent currents can be used. The two electrodes are equipped with the same type of membrane but the active electrode bears the grafted glucose oxidase whereas a plain membrane without enzyme is associated with the compensating electrode. The current at the active electrode is due to the oxidation of hydrogen peroxide generated by the enzyme and possibly to the presence of electroactive interferent species. The current at the compensating electrode is only due to the interfcrences and can be continuously subtracted leading to a signal dircctly correlated to the actual analyte concentration. A microprocessor-based analyzer using this principal is now on the French market [17]. Beside glucose, using the same approach, different analytes like L-lactate for instance could be assayed in complex mixtures with lactate oxidase leading to hydrogen peroxide production [ 181. The bioactive tip of such enzyme electrodes can be considered as a compartmentalized enzyme system where membrane characteristics and hydrodynamic conditions play a prominent role in the product distribution, here the electroactive species, on both sides of the membrane.
229
This can be of fundamental interest for improving performance of these sensors and models have been explored in this direction [191.
Muftilayer systems. Interferencescan also be removed using a multilayer membrane associated to hydrogen peroxide detection. This has been described for glucose and also extended to other analytes like L-lactate with an analyzer based on the work of L. C. Clark Jr. and his group [20] using membranes prepared by sandwiching a glutaraldehyde treated lactate oxidase solution between a special cellulose acetate membrane and a 0.01 micron pore layer of polycarbonate Nuclepore membrane. The main role of this membrane is to exclude proteins and other macromolecules from passing into the bioactive layer. Cellulose acetate membrane allows only molecules of the size of hydrogen peroxide to cross and contact the platinum anode thus preventing interferences by ascorbic acid, uric acid etc... Glucose oxidase can be used as the final enzyme of a multienzyme sequence for the assay of analytes which cannot be directly oxidized. If sucrose has to be measured for instance, it must be first hydrolyzed into a-D-glucose and fructose by invertase. After mutarotation, P-D-glucose is oxidized by glucose oxidase. In this case, endogenous glucose present in the sample will interfere and an elegant approach has been proposed by Scheller and Renneberg to circumvent this drawback [21]. A highly sophisticated multilayer system has been designed by these authors. Briefly, an outer enzyme layer acting as an anti-interference layer with entrapped glucose oxidase and catalase degrades into non responsive products hydrogen peroxide which is formed if glucose is present. A dialysis membrane separates this anti-interference layer from the sucrose indicating layer: thus sucrose can reach the second active layer where it is converted by the bienzyme system into hydrogen peroxide which is measured by the indicating electrode (for review see ref 221. Mediated systems. 0 2 involved in oxidase reactions can be replaced by electron mediators like ferrocene or its derivatives. A pen-size biosensor for glucose monitoring in whole blood (ExacTech) has been recently launched, based on this principle. It mainly involves a printed carbon enzyme electrode with the mediator embedded in it and can be operated whatever the 02 content and is not subject to interferences at the chosen potential (+160 rnV vs SCE) [23].
3.2 Biosensors based on thermal and mass transduction As already underlined, when the molecular recognition of the target molecule occurs, the consecutive physical or chemical signal must be converted by the transducer into an electrical signal processed to obtain a readable displayed result. Numerous attempts for finding a “universal” transducer i.e., matching with any kind of reaction, have been reported.
230
Hcat variation appears as a signal consecutive to practically any chemical or biochcinical reaction. Danielsson and Mosbach 1241 have developed calorimetric sensors based on tempcrature scnsing transducers arranged in a diffcrential setting. Enzymatic reactions occurring in microcolumns allowed heat variation to be measured between thc inlct and outlct with thermistors. Mass variation following molecular recognition appears also very attractive as a universal signal to be transduced cspccially for antigcn antibody reaction when no biocatalysis occurs. Piczoelectric devices sensitive to mass, density or viscosity changes can bc used as transducers. Briefly the change in the oscillation frequency can be correlated to the change in intcrfacial mass. Quartz bascd piezoelectric oscillators and surface acoustic wave devices (SAW) are the two types currently used [25, 261. Beside scnsing in gas phasc, rccent works dcal now with the possibility of using such dcvices in liquid phase with bound immunoreagcnts for improving selectivity.
4. The new frontier: microelectronics and optronics Solid-state microsensors Trials to miniaturize biosensors are regularly rcportcd and field effect transistors (FETs) appear as excellcnt candidates for GOLD LAYER PHOTORESIST achieving such a goal [27]. Sevcral methods for sclcctivcly depositing immobilized enzymc membranes on the pH scnsitivc gate of an ion sclcctive field effcct ENZYME IMMOBILIZED MEMBRANE transistor (ISFET) havc bcen described as cxcmplified in Figure 3. Kuriyama and coworkers [ 2 8 ] have proposed for instancc a lift-off technique which can be briefly summariLed as follows. A layer of photoresist is first depositcd on the whole (4) wafcr but rcmoved selectively from the gate rcgion. The surface is then silanized and an albumin glutaraldehyde enzyme mixture is spin coated. After enzyme cross-linking, the photorcsist layer is rcmoved by acetone treatment and only thc enzymic membranc remains on the Figure 3. Lift-off method for enzyme membranc scnsitivc arca. In the example rcportcd deposition in FET-based biosensors [Rcf. 281. "
23 1
here, a silicon on sapphire (SOS) wafer was used. The chip dunensions were only 1.6 x 8 mm with a membrane thickness of about 1 micrometer.
Optical transduction and fiber optic sensors The design of sensors based on fiber optics and optoelectronics raised a growing interest in the recent years, especially for waveguides associated with chemical or biochemical layers [29-341. The two main methods involve either the direct attachment of the sensing layer to the optical transducer or the coupling on or in a closely associated membrane. For this purpose, various types of supports have been utilized: cellulose, polyvinylchloride, polyamide, silicone, polyacrylamide, glutaraldehyde-bovine serum albumin membranes, and more recently Langmuir Blodgett films.
Membrane biosensor based on bio- and chemiluminescence. In our group, a novel fiber optic biosensor based on bioluminescence reactions for ATP and NADH and chemilumincscence for hydrogen peroxide has been recently developed [34]. For example, in the reaction catalyzed by the firefly luciferase, ATP concentration can be directly correlated to the intensity of light emitted according to the reaction: ATP + luciferin
> AMP + oxylucifcrin + PPi + C02 + light
~
Based on the same principle, NAD(P)H measurements can be performed with a marine bacterial system involving two enzymes, an oxidoreductase and a luciferase respectivcly: NAD(P)H + H+ + FMN FMNH, + R-CHO + 0 2
~
~
> NAD(P)+ + FMNH2 > FMN + R-COOH + H20 + light
The specific enzymic membrane was prepared from polyamide membranes according to the method already described for enzyme electrodes. The membrane is maintained in close contact with one end of a glass fibre bundle by a screw-cap (Fig. 4). The other end is connected to the photomultiplier of a luminometer (Fig. 5). Providcd the measurement cell is light-tight, concentrations of ATP and NADH in the nanomolar range could be easily detccted with such a device which can be equipped with a flow trough cell [35]. The system is specific and very sensitive. There is no need for a light source or monochromators in contrast with other optical methods and extension to several metabolites with associated NADH dependent dchydrogenases has also been realized [36].
232
I
to the photomultiplier tube
.a
.b
Figure 4. Schematic representation of the luminescence fiber-optic biosensor; (a) septum; (b) needle guide; (c) thermostated reaction vessel; (d) fiber bundle; (e) enzymatic membrane; (f) screw-cap; (g) stirring bar; (h) reaction medium; (i) black PVC jacket; (j) O-ring; [Ref. 341.
,C
d
e f
I
c
fiber-optic bundle
screw-cap mirror-
enzymatic membrane
Figure 5. Overall setup of the fibre optic biosensor [Ref. 361.
233
5. Trends and prospects Improvements of existing sensors or design of new ones will depend on advances in the different domains related to sensing layers, transducers and signal processing systems. For the sensing layer the demand is still for more selectivity and more stability. This can be achieved by different approaches rapidly but not exhaustively evoked below:
Protein engineering The possibility of modifying the structure of proteins by site directed mutagenesis is actively studied in many laboratories. Such modifications can increase the thermostability which may be a decisive advantage for dcsigning highly reliable biosensors [371.
Catalytic monoclonal antibodies (catMABs) CatMABs are new biologically engineered tools capable of both molecular recognition and catalytic properties. For obtaining such biomolecules, it is required that a molecule which resembles the transition state in catalysis is stable enough to be used as an hapten for inducing the antibody formation. The complementary determining region of the antibody may then behave as an enzyme-like active site. These tailor-made biocatalysts appear very promising when target molecules cannot be sensed by available enzymes. [38, 391.
Biological chemoreceptors Recently, Buch and Rechnitz [40], have described a chemoreceptor based biosensing device derived from electrophysiology experiments using antennules of blue crab. A signal could be recorded for a concentration of kainic acid and quisqualic acid as low as M. In an excellcnt review, Tedesco et al. pointed out that one of the central problems with the use of biological membrane receptors was the need for a transduction mechanism. In living organisms, the nicotinic reccptor (ion-permeability mechanism) and the P-adrenergic receptor system (with cyclic AMP acting as a messenger within the cell) are the two major classes involved in complex sequences. To incorporate reconstituted molecular assemblies with the natural receptor as a key component in a suitable bioactive layer associated with a transducer still appears as extremely difficult not only taking into account the extraction-purification of the receptor itself but also its stability and functionality after reconstitution [41].
234
Sirpramolecular chemistry Supramolecular chemistry has bcen widely developed by Lehn and his group in the past dccadc [42]. It has bcen possible to design artificial receptor molcculcs containing intramolecular cavitics into which a substrate may fit, leading to cryptatcs which are inclusion complexes. The molccular behaviour at the supramolecular level held promiscs not only in molecular recognition but also in mimicked biocatalysis and transport. This appears as the first age of “chcmionics” with the expected development of molccular clcctronics, photonics or ionics.
Novel transducing modes; miniaturization A lot of efforts are also dcdicatcd to miniaturization. Solid-state microsensors using the
integrated circuit fabrication tcchnology appcar as excellent candidates to bridge the gap in an elegant way between biology and electronics and enzyme field effect transistors (ENFETs) arc promising. Howcver problems of stability of the transducer ilself and of cncapsulation arc not complctcly solved. It must be pointed out that miniaturization of the biosensing tip or array is also a prerequisite for most of the in vivo applications as wcll as the use of reagentless tcchniques or at lcast of systems incorporating previously added reagents with possible rcgcncration in situ. On this track, a vcry elegant approach was the use of a hollow dialysis fiber coupled with an optical fiber as rcportcd by Schultz and his group for glucose analysis. A continuous biosensor wilh a lectin as bioreceptor and fluorescein labclcd dexwan competing with glucose could bc obtained by placing the bioreagcnts within the miniature hollow fiber dialysis compartment [43].
Signal processing system, multifunction probe or probe array Finally, improvcmcnts can also bc expected in treatment of the signal wilh smart systems adapted from other techniques to chcmical or biosensors. As an example, an array of scnsors with dilfcrent specificities lcading to a specific result pattern Lrcated in real timc would certainly bc of major interest in urgcnt situations when a prompt decision has to be lakcn.
6. Conclusion Thc interdisciplinary approach necessary for decisive breakthroughs expected in analylical monitoring is now rccognized as compulsory. This is not so easy since specialists from apparcnlly distant disciplincs must share a common language and have partially
23 5
overlapping intcrfaccs for developing in a synergistic way new concepts and new analytical tools. In the near future, biomimetic devices taking advantage of the extraordinary possibilities of Nature in the domains of selectivity via biomolecular recognition and on the tremendous evolution in the past decades of electronics and optics together with the unequalled rapidity in the treatment in real time of information will certainly be a milestone of the “Bioelectronics era”. Some years may still be necessary to know if the golden age is ahead...
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 1.5. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
26. 27.
Glajch JL. Anal Chem 1986; 58: 385A-394A. Riebe MT, Eustace DJ. Anal Chem 1990; 62: 65A-71A. Janata J, Bezegh A. Anal Chem 1988; 60: 62R-74R. Mosbach K, Ed. Methods in Enzymology, Vol44. New York: Academic Ress, 1976: 999 p. Avrameas S , Guilbert B. Biochimie 1972; 54; 837-842. Wmgard LB Jr., Katchalsky E, Goldstein L, Eds. Immobilized enzyme principles. Applied Biochemistry and Bioengineering, Vol 1. New York: Academic Ress, 1976: 364 p. Clark LC Jr, Lyons C. Ann NYAcad Sci 1962; 102: 2945. Guilbault GG. Analytical uses of immobilized enzymes. Marcel Dekker, 1984: 453 p. Turner APF, Karube I, Wilson GS, Eds. Biosensors, Fundamentals and applications. Oxford University Press, 1987: 770p. Romette JL. GBF Monographs 1987; Vol. 10: 81-86. ThCvenot DR, Stcmberg R, Coulet PR, Laurent J, Gautheron DC. Anal Chem 1979; 51: 96100. Coulet PR, Julliard JH, Gautheron DC. French Patent 2,235,153, 1973. Coulet PR, Julliard JH, Gautheron DC. Biotechnol Bioeng 1974; 16: 1055-1068. Coulet PR, Bertrand C. Anal Lett, 1979; 12: 581-587. Bardeletti G, Coulet PR. Anal Chem 1984; 56: 591-593. Assolant-Vinet CH, Coulet PR. Anal Left 1986; 19: 875-885. Coulet PR. GBF Monographs 1987; Vol. 10: 75-80. Bardeletti G, Stchaud F, Coulet PR. Am1 Chim Acta 1986; 187: 47-5. Mai’sterrenaR , Blum LJ, Coulet PR. Biotechnol Letters 1986; 8: 305-310. Clark LC Jr, Noyes LK, Grooms TA, Gleason CA. ClinicalBiochem 1984; 17: 288-291. Scheller F, Kenncbcrg R. Anal Chim Acta 1983; 152: 265-269. Scheller FW, Schubert F, Renneberg R, Muller H-S, Jhchen M, Weise H. Biosensors 1985; 1: 135-1 60. Cass AEG, Francis DG, Hill HAO, Aston WJ, Higgins IJ, Plotkin EV. Scott LDL, Turner APF. Anal Chem 1984; 56: 667-671. Danielsson B , Mosbach K. In: Turner APF, Karube I, Wilson GS, Eds. Biosensors, Fundamenfals and applications. Oxford University Press, 1987: 575-595. Guilbault GG, Ngeh-Ngwainbi J. In: Guilbault GG, Mascini M, Eds. Analytical uses of immobilized compounds for detection, medical and industrial uses. NATO AS1 Series, Dordrecht, Holland: D. Reidel Publishing Co., 1988: 187-194. Ballantine D, Wohltjen H. Anal Chem 1989; 61: 704 A-715 A. Van der Schoot BH, Bergveld P. Biosensors 1987/88; 3: 161-186.
236
28. Nakamoto S, Kimura I, Kuriyarna T. GBF Monographs 1987;Vol. 10: 289-290. 29. Wolfbeis 0 s . Pure Appl Chem 1987;59: 663672. 30. Arnold MA, Meyerhoff ME. CRC Critical Rev Anal Chem 1988;20: 149-196. 31. Scitz WR. CRC Critical Rev Anal Chem 1988;19: 135-173. 32. Dessy RE. Anal Chem 1989;61: 1079A-1094A. 33. Coulet PR, Blum LJ, Gautier SM. J Pharm & Biomed anal 1989;7: 1361-1376. 34. Blum LJ,Gautier SM, Coulet PR. Anal Lett 1988;21: 717-726. 35. Blum LJ,Gautier SM, Coulet PR. Anal Chim Acta 1989;226: 331-336. 36. Gautier SM,Blum LJ. Coulet PR. J Biolwn Chemilum 1990;5 : 57-63. 37. Nosoh Y,Sekiguchi T.Tibtech 1990;8:16-20. 38. Schultz, P.R. Angew Chem Int Ed Engll989;28: 1283-1295. 39. Green BS, Tawfik DS. Tibtech 1989;7:304-310. 40. Buch RM,Rechnitz GA.Anal Lett 1989;22: 2685-2702. 41. Tedesco JL,Krull UJ,Thompson M. Biosensors 1989;4: 135-167. 42. Lehn J-M. Angew Chem Int Ed Engi 1988;27: 89-112. 43. Schultz JS, Mansouri S , Goldstein I. Diabetes care 1982;5: 245-253.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
231
CHAPTER 20
A Turbo Pascal Program for On-line Analysis of Spontaneous Neuronal Unit Activity L. GaAl and P. Molnir Pharmacological Research Centre, Chemical Works of Gedeon Richter Ltd., 1475 Budapest, P.O. Box 27, Hungary
1. Introduction Extracellular unit activity measurement is a widely used method in neuroscience and especially in pharmacological research. A large number of papers proves the power of this and related methods in the research [1-4]. Albeit the extracellular unit activity measurement is one of the simplest method of microelecuophysiology, the experimenter should have extensive experience using the conventional experimental setup and way of evaluation [ 11. There are no suitable programs available for IBM PCs to provide full support in cell identification and in recording of extracellular neuronal activity together with experimental manipulation and evaluation of data. On the other hand the experimental conditions should be varied according to the needs of tasks. A data acquisition and analysis program should fulfill the above mentioned requirements as much as possible [ 5 ] . The simplicity and high capacity of a measurement and the regulations of Good Laboratory Practice (GLP) require further solutions in the pharmacological research. These reasons prompted us to create a program which meets most of our claims. The main goals in the development of the program were: - to support the cell identification, - to allow real-time visualization and analysis of neuronal activity, - to record experimental manipulation, - measurement and evaluation according to GLP, - to crcate an easy-to-use and flexible program.
2. Methods and results The program, named IMPULSE, was written in Turbo Pascal 5.5, because it is a highlevel language, supports assembler-language routines and its graphical capabilities are
238
excellent [6]. The source code is wcll structured TABLE 1 and documcnted, thus, it may be modified to suit Hardware requirements of IMPULSE particular research needs. It contains carefully 3.0. optimized assembly-language routines for the IBM PC,XT, AT Compatible critical acquisition to allow high-speed sampling. computer A new fully graphic command interface was Graphics adapter and monitor developed that can be used by single-keystroke Hard disk commands or keyboard menu selection, in addi- 640 K memory tion, it contains a sophisticatcd context-sensitive TL-1 scries acquisition systcm help available at all times. This command inter- Epson compatible printer lace provides the easy creation of different menu slructurcs containing not only menu points but fill-in form ficlds, as well. The hardware requirements of the program are summarized in Table 1. The inputs of the program arc the following: - analogue signal of the exlracellular amplifier, - TTL pulse of an event detector, - TTL pulses of any instrument (e.g., iontophorctic pump), - keyboard.
A
/,DISCRIMINATION LEVEL
POSllRlGGER SAMPLING
TRIGGER
PRE- AND POSTTRIGGER SAMPLING
Figure 1. The problem of spontaneous spike sampling. Action potentials of spontaneously active neurons arc random. The detection of a spikc is done by an cvcnt dctcctor (or window discriminator). This device sends a triggcr, whcn the input signal cxcccds the discrimination level. Using the convcntional way of sampling (i.e., sampling startcd by the trigger), only a part of a spike could bc sampled (upper right figurc). To sample the whole spike, the prctriggcr part of the signal should also be acquisited (bottom right figure).
239
Y
S t a r t spike sampling
Figure 2. Screen dump of IMPULSE <Spike> menu and screen. The left windows contain (downwards) the already sampled spike, firing pattern and interspike time distribution (IHTG) of a cerebellar Purkinje neuron. The right window is the active window showing the current spike. The amplitude of the current spike and the status of sampling mode are displayed in a dataline. (See text for the details of <Spike> menu).
The first technical problem to be solved was the acquisition of spontaneous spikes by sampling the full signal (involving the pre-spike signal, as well). The principle of the problem is featured in Figure 1. Our solution was the use of a ring buffer. When the user starts the sampling of a spike, the output of the AD converter is loaded continuously into the ring buffer. The signal of the evcnt detector with certain delay stops the acquisition. At this time the ring buffer contains the pre- and posttrigger part of the signal. IMPULSE can be characterized by the following functions: 1. SPIKE: shape and amplitude of spontaneous action potentials with pretrigger intervals, averaging possibility. 2. PATTERN: time-series of spikes with real amplitudes. 3. IHTG: interspike time distribution of action potentials. 4. FHTG: frequency histogram of neuronal activity. 5. CONFIGURATION: parameters of hardware environment, sampling, displaying and protocols. The first thrcc functions serve the cell identification.
240
IHTG:
Store
ZOOH
Quit
S t a r t I H T G sanpling
Figure 3. Screen dump of IMPULSE dHTG> menu and screen. (Skc legend of Fig. 2 ) The active window shows the proceeding of IHTG sampling. Neurons of the same type can be characterized by certain distribution of interspike times, thus, IHTG is very helpful in the cell identification.
SPIKE function calls a menu (Fig. 2) which contains certain instructions for spike sampling. Spikes can be sampled continuously or one by one, and could be averaged and displayed with real or normalized amplitude in a window or full screen (Zoom). The amplitude is measured for every spike. Selected spikes can be stored for further evaluation. PATTERN function serves the visualization of the firing pattern of the neuron which is a very helpful information in cell identification. Spikes are represented by vertical lines (thc length is proportional with the amplitude of the spike), the time between successive spike is represented by the distance between lines. Sampled pattern can be stored for further evaluation. IHTG function (Fig. 3) is a new power in cell identification and also in the study of neuronal activity. It measures the time between successive spikes (interspike time) and displays the currcnt distribution of that. Sampled IHTGs can be stored for further evaluation. The main line of the experiment is the registration of the activity (i.e., the firing rate) of Uie neuron studicd, thus, ETRG function is focused. The number of spikes is integrated on a ccrtain timebase (bin) and is displayed continuously in form of frequency histogram (Fig. 4).
24 1
16.
0.
a
0 Rou: 324 Euent: 38 N.Drug: 9 HTG: m C l e a r scRoll Spike P a t t e r n Quit
75
Start thc w a s u r e n e n t o f spiking actiuity
Figure 4. Screen dump of IMPULSE menu and screen. Extracellular unit activity measurement of a noradrenergic locus coeruleus neuron. Upper windows show the spike and the firing pattern sampled simultaneously with the recording of FHTG. Markers above the histogram and the lines of different patterns in the histogram indicate the onset of drug administrations. See the text for details.
Other functions can be used simultaneously (e.g., the variation of spike shape and pattern can be displayed and stored as illustrated in Fig. 4). It means that the measurement is not restricted to the recording of firing rate but the registration of spike shape and variations of firing properties are also included. Expcrimental manipulation (generally drug administration) can be marked and connected to the FHTG record by two ways: 1. Keyboard input: the number keys represent the onset of different types of drug administrations. 2. ITL pulses of any instrument: the switch on and switch off of maximum five devices (e.g., iontophoretic pump). The markers are stored together with the FHTG data. They are displayed over the FHTG graph to facilitate the exact evaluation of effects (Fig. 4). The length of FHTG recording is limited by the free memory of the computer only (e.g., if the bin is one second, more than 5 hours continuous recording is possible). The parameters of the experiment (e.g., sampling rate, time base of FHTG, etc) can be varied in a wide range and stored in a configuration file. By means of the mentioned
242
CREATED BY
MATION
DATA
BASELINE
DRUG
DRUG EFFECT
PROTOCOL
PARAMETERS
PATTERN
DATA
COMMENTS
IHTG
GRAPHS
Figure 5. Main functions of the evaluation program, EVALEXT. EVALEXT evaluates the data created by IMPULSE. Data can b e edited and divided into intervals, which are inputs of statistical analysis. The recorded spikes, patterns and IHTGs can be displayed and printed. The results are presented in a protocol according the prescription of GLP.
32.
16.
0. 0
.
25
50
75
tine D i s t . : ?:40 LOCKED INTERUOL: Nextscreen Preuscreen Svitchcursor Junptonextdrug Locked Druglist accept Quit Enter in the CURSOR node rou,
233
I
38:40
Figure 6. Screen dump of EVALEXT menu and screen. Data can be divided into intervals. Markers above the histogram and the different patterns of the histogram represent che already dcfincd intervals. The two cursors allow an interactive selection of intervals according to the particular needs of the user. Statistical comparison is performed bctwccn the selccted intervals.
24 3
variability the program makes it possibile to satisfy broad range of requirements. The storage of parameters provides the complete registration of experimental conditions allowing the reproduction of an experiment with the same conditions. Data are stored on hard disk in standard binary files. The file contains the frequcncy histogram, markcrs, shapes of certain spikes, patterns and IHTGs selected and any comments. Neuronal activity and the effects of experimental manipulations recorded (hereafter referred to as drug adminismation) are cvaluated and printed in final GLP form by an independent program, EVALEXT. The main functions of EVALEXT are illustrated in Figure 5. Data can be edited (e.g., to remove artifacts) in graphical form. Drug adminisuation can be commented and also edited in a full window editor. Data can be divided into intervals. This is done either in an automatic way (predefined length of intervals connected to the markers) or in an interactive way using graphical selection of intervals (Fig. 6). The intervals represent the different states of neuronal
Baseline
;gjp
30.1
0.21
CLO-2
74.
CLO-4
CLO-8 WH-200 WH-400
100.0
5:1
32 f f6
I
18.7 #_____________-___-______________________----------------------------------KIBSO 37.2 1 .58 *
Press a key t o r e t u r n . . . PRINT: -Graph Data dEuice Q u i t
P r i n t r e s u l t s i n a protocol form on output d w i c e
Figure 7. Protocol of an extracellular unit activity experiment. The head of the protocol contains the description of the experiment. The results of ANOVA and the detailed effects of different drugs are presented. “Mean FR”: is an average value of spike numbers per bin for h e given interval. “SEM”: is the standard error of the mean. “Pcrccnt”: values are expressed as the pcrcentage of the baseline activity (considered as 100%). “SEP”: is the standard error of the “Percent” value. The stars indicate a significant difference from the baseline value at p<0.05 level (one sample t-test). “N”: is the number of bins in the given interval.
244
activity (ix., the baseline activity, the changes in firing rate caused by drug adminislration). EVALEXT performs statistical analysis (ANOVA followed by one sample t-test) of intervals (see above) and creates a protocol according to the rules of GLP (Fig. 7). The recorded spikes, patterns and IHTGs can be displayed and printed.
3. Discussion Main goals for dcveloping the program were to help the cell identification and to allow real-time analysis of neuronal activity and to record any experimental manipulation. Cell identification is a hard job, the experimenter should have extensive experience to attain it. IMPULSE provides full support in cell identification by the visualization of action potentials and firing properties of a neuron. It records and displays the firing activity and experimental manipulations. IMPULSE supports most of the experimental requirements of electrophysiology and related areas, especially pharmacology. Parameters of the experiment (e.g., sampling rate, time base of frequency histogram, etc.) can be varied in a wide range and stored in a configuration file allowing both for flexibility and standard measurements. The additional program EVALEXT performs statistical evaluation of changes in firing rate caused by experimental manipulations. Data can be edited and divided into intervals, which are the basis of statistical analysis (ANOVA followed by one sample t-test). The recorded spikes, patterns and IHTGs can be displayed and printed. Learning to use IMPULSE is very simple. It has a command interface that can be used either by single-keystroke commands or keyboard menu selection without any compromises between speed, flexibility and ease of use. In addition, IMPULSE has sophisticated context-sensitive help available at all times with the press of a key. Our laboratory has some years of good experience in using IMPULSE and the earlier versions [6-83. The Figures illustrate not only the program, but also some experimental results obtaincd by the use of IMPULSE (see the legends of Figures). The system can be useful in related electrophysiological methods (e.g., in measurement of ficld potentials and long term potentiations, etc.) as well, due to its flexible modular form.
References 1.
2. 3.
Thompson RF. Patterson MM, eds. Bioelectric recording techniques. New York: Academic Press: 1973. h e s RD. Microelectrode methods for intracellular recording and iontophoresis. London: Academic Press: 1981. Dingledine R, ed. BrainSlices. New York: Plenum Press: 1984.
24 5
4.
5. 6. 7.
8. 9.
Soto E, Vega R. A Turbo Pascal program for on line spike data acquisition and analysis using a standard serial port. J Neurosci Meths 1987; 19: 61-68. Kegel DR, Sheridan RE, Lester HA. Software for electrophysiological experiments with a personal computer. J Neurosci Meths 1985; 12: 317-330. Turbo Pascal Reference Guide, 1989. Gail L, Gro6 D, Pilosi 8. Dopamine agonists reverse temporarily the impulse flow blocking effect of gamma-hydroxy-butyric acid on nigral dopaminergic neurons. Neuroscience 1987; 22 (Suppl): 84. Gail L, Grod D, Pilosi 8. Effects of RGH-6141. a new ergot derivative on the nigral dopamine neurons in the rat. Pharmacol Res C o r n 1988; 20 (Suppl 1): 33-34. Gail L, Molnir P. Effects of vinpocetine on noradrenergic neurons in rat locus coeruleus. Eur J Pharmucoll990; (submitted).
This Page Intentionally Left Blank
Laboratory Robotics
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Compuring and Automarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
249
CHAPTER 21
Automation of Screening Reactions in Organic Synthesis P. Josses, B. Joux, R. Barrier, J.R. Desmurs, H. Bulliot, Y. Ploquin, and P. Mitivier RhGne-Poulenc Recherches, Centre des C a r r i h , 85 avenue des frCres Perret, BP 62 SNNT FONS CEDEX 69162, France
Abstract One of the most difficult problems in organic synthesis, is to find new chcmical reactions. A systematic screening experimentation stage involving a great number of assays covering the largest experimental field, is often the necessary condition for success. We looked into automation of this research stage, which is often fastidious and rcpetithe, and developcd a robot whose two main qualities are: 1. Versatility: the robot is adaptable to a great number of organic reactions, and this, in very different fields of research. 2. "Friendly usership": chemists from any research team (and non specialists in robotics) can prepare and carry out Lhc assays with the robot. Our choice was to develop a robot around a Zymate I1 system, associated with Zymark peripherals and also other peripherals specially developed by Rhane-Poulcnc; such as a multireactor oven and a multireactor refrigerating system. The role of the robot is to be able to conduct reactions (manipulation of solids and liquids, weighing, pipeting, introduction of reagents, heating, cooling, stirring). A friendly user interface has becn dcvclopcd so that the chemist may design the different assays through a succession of menus. The results we did obtain show that the initial target has been reached: more than one thousand assays have been carried out with the robot on very different subjects within one year.
1. Introduction One of the essential goals of industrial chcmical rcsearch is to find new reactions. It is those discoveries which will allow your company to make the difference and to have in the end thc best process to synthesize one or more molecules industrially. The key to
250
success in this exploratory part of research is often the number of reactions you try. Moreover this part is often tedious and repetitive in that the same experiment is repeated just modifying factors such as reagents, catalysts or temperature. We were very interested in developing a system able to perform quickly those screening reactions. The two big advantages that we were foreseeing in this realisation were: 1. to be able to screen a wide range of conditions, and particularly to increase the exploratory field. 2. to have a quick answer on the validity of new ideas. Two major constraints on the utilisalion of such a system were to be taken into account: 1. The systcm must be versatile; i.e., able to realize a wide scope of organic reactions. 2. It must be user-friendly; i.e., any chemist without specific knowledge should be able to use the systcm.
2. Definition of a screening reaction in organic synthesis Consultation with chemists on what are the elementary operations mainly involved in a complete screening reaction led us to consider a reaction as being a succession of five stcps as rcprescnted Figure 1. We will call the three first steps the synthesis, the fourth step the treatment and the last the analysis. A screening reaction is not complete until the result of the analysis, which allow you to answer the initial question. Our goal is to implement a complete system able to carry out those five steps, and this for a set of multiple reactions. In the f i s t stage we did limit ourselves to the implementation of the synthetic part of the screening reaction. To our knowledge, such a realisation would be new and needed a lot of dcvclopment whereas preparation of analytic samples and analysis was already widcly automatised.
3. Specifications for one synthesis Precise specifications were discussed with different teams of chcmists on the synlhctic stcps of the reaction. The specifications are givcn herc for each step of the synthesis. Step I- Infroduction of reagents: The system must be able to introduce various liquids which can be volatile (c.g., chloroform), “exothermic” (e.g., H2S04, a liquid is called exothermic when the addition of the reagent to the reaction mixture is exothermic), or melted solid ( e g , phenol, melting point = 40°C). The system must also be able to introduce various powdered solids. All these operations are to be carried out under inert atmosphere to protect thc reagents if necessary and the cffcctive weight of each reagent introduced must be as close as possible to the required one, The system must also be able to homogcnise the reaction mixture once the reagents have bcen introduced.
251
SYNTHESlS
the reaction mixture
TREATMENT analysis
m Analysis
ANALYSlS
Figure 1. The 5 successive stages of a complete screening reaction in organic synthesis.
Step 2: The system must be able to heat and stir the reaction mixture at a given temperature and for a given time. As we want to carry out more than one reaction at a time a multi-vessel oven is nccessary. Step 3: The system must be able to cool and stir the reaction mixture at a given temperature and for a given time. The system should be able to quench the reaction mixture by introducing a reagent as defined above in the reaction vessel.
4. Standardisation of the organic chemist reaction vessel The design of a reaction vessel convenient for almost any type of chemistry for screening reactions has been an important concern: consultation with many teams of chemists was ncccssary to point out what could be the ideal standard to perform screening reactions or test rcaclions. The standard reaction flask that we designed is shown in Figure 2, it looks like a large test tube which can be capped by different tools like a refrigerating head or a simple cap. The refrigerating head allows one to reflux, to add inert gas, to sample and to
252
Figure 2. Standard reaction vessel.
check temperature of the rcaction mixture. This vessel may be used either on the robot or by hand just as any classical reactor, and it is interesting to point out that all chemists teams are equipped with those reaction flasks. The total capacity of the flask is 30 ml and thc standard volume for reaction may vary from 2-3 ml to 20-25 ml.
5. Design of the system 5.1 Overview The System is constructed around a ZYMATE I1 robot arm associated with different peripherals; the current layout is represented Figure 3. The system is able to carry out six rcactions at one time, and to perform thirty in a row. Up to 15 reagents may be introduced in onc rcaction flask, and the system is currently able to handle 30 solids, and 12 liquids at one time The gencral flowchart for the execution of one reaction is given Figure 4. Wc will now describe the different opcrations more precisely, particularly when a special module has been developed for this application.
5.2 Introduction of reagents All reagents are weighed on a precision balance. This standard balance has been adapted to receive the rcaction vcsscl and to maintain the reaction mixture undcr inert atmosphcrc when the flask is open.
25 3
liquid reactants
M.L.S.
moling station
Figure 3. Robot table layout for synthesis.
Liquids may be introduced through syringes (MLS station on Fig. 3) or with pipets. Different pipeting protocols have been developed according to the type of liquid to be introduced (i.e., volatile, exothermic or melting solid). In the case of an exothermic liquid, the reagent is added while cooling the reaction flask to a given temperature (vessel cooling station on Fig. 3). Introduction of solids is not an easy task. Only powered solids are treated, and still in some cases, the sticky products will not be distributed correctly. We say that a solid is introduced correctly if the weight introduced is within a 10% range of the initial given weight. We have developed successively two modules to be able to handle correctly this task; both being based on the addition of solids through a vibrating hand. In the case of solids looking honey-like, we overcome the problem by dissolving the reagent in a solvent and handling it as a liquid.
254
Output
Data used as input
+ + Check the flask
1
Tare the flask
I
I
Open the flask
1
Check the flask
Reagents R and weight w
Exact weight w' introduced
I
Close t
h
e
r
l
1 v
Check the flask
1
+ -
Weight of the reaction mixture before reaction
Vortex the flask
Tcmpcraturc T1 Time tl
Time t2
Figure 4. General flowchart for the execution of one reaction.
25 5
1
v Weigh the flask
Weight of the reaction mixture after reaction
Open the flask
I
Weigh tFflask
I
Reagent A Weight a
I
1 Weigh the flask
I-
Weight of the added quenching reagent
+ Put the flask back
Figure 4. (cont.)
5.3 Multi flask oven and refrigerating unit We have developed an oven able to heat six flask at one time at three different temperaLures (two flasks for each temperature) ranging from 25°C to 250°C. In the same way we have dcvcloped a refrigerating unit based on the Peltier effect. The reaction mixture can bc cooled down to -30°C. Both units are placed over a multi field magnetic stirrer so that any reaction mixture containing a magnetic stirring bar is continuously stirred during the
256
multi oven unit
1
magnetic stirrer Figure 5 . Overview of the multi oven unit.
reaction (Fig. 5). These modules can also be used by themselves to carry out reactions apart from the robot.
5.4 Interface with the chemist A great amount of time has been spcnt to dcvelop the software. The utilisation of such a robot is drastically depending on the case, for chemist, to carry out a set of reactions. A friendly user interface has been realized and allows chemists to enter quickly the different characteristics of their assays. This can bc done without spccific knowledge of the system. The chemist design his assays through a succession of menus and questions. The program is currently written with Zymark EASYLAB language. At the complction a pcrsonal computer will pilot the complete system and a new interface will be written.
5.5 Security Security is of first importance in the design of such a system. We must recall here that the robot is conceived to work alone once programmed and particularly to work at night. Three types of security have been installed to protect both the system and t h e environment: first, manipulation test-points are implemented to avoid mis-manipulations of the robot arm or to detect adjustment troubles. Those checks allow one to avoid the
257
SYNTHESIS
ANALYSIS
Figure 6 . Overall view of the future complete system.
258
loss of a full set of reactions when trouble is detectcd. Second, for chemical safety consists of a hood covering the complete system when at work. Third, an electrical security is implemented in case of a general or partial electric cut.
6. The results Over 1,000 assays have been performed during the past year. We can not disclose exactly the type of reactions that were tried, but 12 different subjects were treated in very diffcrent fields of organic chemistry. Following some of these subjects we were able to file patents which shows the effectiveness of this tool. The big types of subjects are scrccning of catalysts to find a new reaction, studying the scope of one reaction varying the starting reagents, and studies on the reaction medium to perform one reaction (variation of solvents and co-reagents).
7. The future The tool will not be complcte until the full chain of the reaction is integrated, i.e., the treatment and the analysis is also automatically performed. We have begun to develop a second robot arm whose task will be to perform these opcrations in relation to the f i s t synthetic part. The final design of the system is represented Figure 6 and we believe it will be opcralional by mid 1990. For this application we will mainly use standard pcripherals as well as a speciric liquid-liquid extractor developed by Rhane-Poulenc that do not require any sampling beforehand.
8. Conclusion W i h his application Rh6nc-Poulenc possesses an original and very powerful tool. The rcsults as well as thc chemists enthusiasm to use such a system demonstrate that this type of tool is useful for testing quickly and automatically the validity of new ideas in organic chemistry.
E.J. Karjalainen (Editor),Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
259
CHAPTER 22
A Smart Robotics System for the Design and Optimization of Spectrophotometric Experiments F.A. Settle, Jr., J. Blankenship, S. Costello, M. Sprouse, and P. Wick Dcparlment of Chemistry, Virginia Military Institute, Lexinglon, VA 244.50, USA
1. Introduction Analytical chemistry is important in almost every aspcct of current society. Areas such as the environmental regulations, drug enforcement, medicine, electronics and computcrs, industrial health and safety, as well as national defense are dependent upon the ability of analytical methods to detect, identify and quantify elements, ions and compounds. In ordcr to produce useful information, reliable methods of analysis must be developed. Interlaboratory studics indicate large variations in both precision and accuracy of analyses. An example of this problem is found in a recent survey of results from existing mcthods for dctcrmining onho- and poly-phosphates in water and waste water [I]. Many analytical laboratories are faced with increased sample loads that require tcdious, redundant prcpxations and measurements. Intelligent analysts quickly become borcd with these rcpctitive tasks while less competent pcrsonnel cannot perform these procedures with the required prccision and accuracy. Intelligent, automated laboratory systems offer a way to increase the reliability of the data by minimizing the involvemcnt of the analyst in the performance o l routine tasks. Thus, it becomes important to provide the analyst with efficient ways to implemcnt automated methods of analysis. Although many laboratorics have demonstrated that the automation of one or two stcps of a method can result in more reliable data, only a few analytical methods have been totally automated [2]. Critical problems in implementing automated methods for analysis are method development and validation. Currently, the resources required to totally automate analytical methods are often lacking in the laboratories where automation is mosl nccdcd. Oncc an automated mcthod has bccn implemented and validated, it can bc transfcrrcd clcctronically among laboratorics having identical automation facilitics. Thc system dcscribcd in this paper will dcvclop and test a generic, flcxiblc, intclligcnt laboratory system to expcditc thc dcsign, optimization and transfer of automalcd analytical mcthods. This systcm will intcgratc commercial software for experimcntal design,
260
EXPERT SYSTEM FOR EXPERIMENTAL DESIGN AND EVALUATION OF RESULTS
1-
EXPERIMENTAL DESIGN
,
!
ct DATA ANALYSIS PROGRAMS
COMMUNICATIONS AND CONTROL PROGRAMS (BASIC. PASCAL OR C)
ROBOT AND INSTUMENT CONTROL PROGRAMS (EASY LAB, ROBOT CONTROL LANGUAGE)
+,% WORKSTATION
2 INSTRUMENT
Figure 1. An intelligent automated system for chemical analysis.
statistical analysis, expcrt system development and laboratory automation with the hardware components required for sample preparation and analyte measurement (Fig. 1). The hardware and software components will be made as compatible and interchangeable as possible to the facilitate efficient configuration, optimization and transfer of automated methods. In order to test this concept the system will be used to build and validate four fully automated methods for spectrophotometric and potentiometric determinations of ortho-phosphates in water samples.
2. System hardware A Zymate 11 laboratory cylindrical robot arm and System V controller form the nucleus of the automatcd laboratory system. The arm can service system components or
26 1
MASTER LAB STATION
REAGENT
I
BALANCE
I GRIPPER
PRECIPITATE
SPECTROPHOTOMETER
Figure 2. Robotics system for quantitative colorimetric analysis.
workstations arranged in pie shaped sections (Pyesections) about the base of the arm. Once a section has been bolted into position and its location entered into the controller, the spacial coordinates for a m movements and other operations associated the section need not be specified.This technology simplifies programming the tasks required to automate analyses. The configuration of the colorimetric analysis system is shown in Fig. 2. The components purchased from Zymark as Pyesections included the two test tube racks used for samples and reagents, the weighing station (Mettler AE200 balance), the dilute and dissolve station (a vortex mixer and a pumping station for dispensing measured quantities of three different liquids into a test tube or small flask), a sipping station to deliver samples to a flow-through cell in the Milton-Roy 601 ultraviolet-visible spectrophotometer. Stations constructed by our group included a precipitate detector, a pH measurement station, a heater station, a magnetic stirrer and a flask rack. The latter two stations are necessary to handle the 125 mL flasks required for water samples. The signals to control the laboratory equipment and instruments as well as the analog data signals generated by these devices are connected to the system controller through circuits specific for each workstation or through a more flexible interface board known as a power and event controller. This interface board has 8 digital I/O lines, the ability to condition and multiplex
262
-
RBIBBASE
STRUCTURED DATA AND VALIDATED METHODS
I
SUPERVISOR EXPERT SYSTEM
WEDGERASE HEURISTIC INFORMATION
I
COMMON DATA STRUCTURE
DESIGN
M OPTIMIZATION
SYSTEM CONTROL
STATISTICAL CALCULATIONS
USER INTERFACE
I
Figme 3. System software.
up to six analog signals to a 12 bit analog-to-digital converter as well as provisions for providing several rangcs of power.
3. System software The intclligcnt analysis systcm consists of different software componcnts controlled by an cxpcrt systcm supervisor (Fig. 3). Data and information are transfcrrcd among the componcnts using a common data structure. Although most of the componcnts are commcrcially available it has been necessary to write some programs to facilitate communication among components. When complctcd, the integrated components will function smoothly and efficiently to achieve the goals of h e project.
3.1 Supervisory expert system This program manages the entire automated systcm and must therefore interact with each of thc software components shown in Fig. 4. The analyst enters infonnation concerning the mclhod to be automated through a consultation with the supervisor expert system. In addition to asscmbling information, this component of the expert systcm provides guidancc and advice to the analyst for selecting parameters and initial conditions for the analysis. Once the paramctcrs and initial conditions have been chosen, another component of
26 3
-
4 CHOOSE FACTORS (PARAMETERS)
4 AN EXPEKMENTAL DESIGN
STEM
METHOD DESCRIPTION PHOSPAHTE DETM. SnCh METHOD
MIXING TIME: AVE TIME 2 MIN. HIGH LEVEL 3 MIN. LOW LEVEL 1 MIN.
PLACKETT-BURMANN 7 FACTORS, 8 EXPERIMENTS
I
1 CALCULATE RESULTS
I
I STD.ERROR. VARIANCE. MAIN EFFECTS
OBSERVE CHEMICAL RESULTS AND SUGGEST IMPROVEMENTS I
REVISED METHOD DESCRIPTION AND CHOICE OF FACTORS I
I
I
Figure 4. Expcrimcntal design and evaluation.
the supervisor expert system selccls the tasks required for the analysis from the robotic system controller’s dictionary and arranges these tasks into a program to execute thc specified analysis. At this point the supervisor system calls a secondary expert system to assist the analyst in developing a statistical design to generate a matrix of test points for the cxpcriment. Once an appropriate design has been selected, the test points are generated and transferred to the robotic controllcr program. The supervisor system then calls the automated laboratory controller which performs the set of experiments and collects the resulting data. Finally, an appropriate statistical program will be used to analyze the dala and return the results to the supervisor system which then determines if the conditions for the analysis have bccn fulfilled.
264
VP-Expert [3], a rule-based expert system shell, was selected as the supervisor for the prototype system. The principal considerations in the selection of this expert development system were: (1) memory requirements that allow it operate with other software components in a PC/XT or PS-2 environment, (2) ease of development and implementation, ( 3 ) ease of interfacing with other software components of the system and (4) a price of $250. The following features of VP-Expert were used in developing the system supcrvisor: (1) forward and backward chaining, (2) rule induction from tabular information, ( 3 ) transfer of data and information to and from external ASCII files, (4) “chaining” or linking different knowledge bases, (5) friendly, informative screens for the analyst and (6) communication with other software program components of the analysis system. While VP-Expert has served as a good shell for the initial prototype system, it has limitations in the areas of efficient communications among system components and in its ability lo support a dynamic, interactive user interface. The next version of the supervisor system will be implemented in an object-oriented shell, Ncxtpert Object. Tasks, instruments, workstations and even reagents can be treated as objects. Each objcct can store information, process information, create new information and communicate with othcr objects. An object represents a specialized part of the system and has its own local knowledge and expertise. Objects are defined by their response to messages from other objects. The object oriented approach has several advantages: (1) permits sharing of knowledge among related groups of objects; (2) is well suited for parallel processor operations; and (3) has been used successfully for many diverse applications.
3.2 Chemical knowledge base The knowledge bases of the supervisor system address both chemical and statistical domains. The chemical domain of ortho-phosphate analysis will include expertise on three colorimetric methods (vanadomolybdophosphoric acid, stannous chloride and ascorbic acid methods) as well as one potentiometric titration method that employs a ionsclective lead electrode. This knowledge base will assist the analyst in selecting the most appropriate methods for different sample types. Potential interferences and expected ranges of ortho-phosphate concentrations are cxamples of factors that influence the choice of a method. Once a method has been selected the expert system consults the method database file to find the record containing the analytical parameters for the method. If the method is not contained in the database file, the system asks if the analyst would like assistance in developing an automated procedure.
3.3 Knowledge buses for experimental design Two software packages that can serve as statistical knowledge bases are currently bcing investigated for use as software components of the system, DESIGN-EASETM[4] and
265
EXPERTiMENTAL DESIGNTM[5]. Both programs assist the analyst in determining the best type of experimental design to use for a given project. Factorial designs can be used to evaluate the relative significance of several experimental parameters. The statistical technique of analysis of variance allows the total variance of the results to be separated into components associated with each effect and also provides tests of significance. Fractional factorial design [ 6 ] , a subset of the full factorial design [7], can be used to reduce the number of experiments required to determine the critical parameters for an analysis [8]. These standardized chemometric procedures will be used (1) to simplify the choice of experimental parameters to be optimized and (2) to compare the results from different methods of phosphate analysis. In colorimetric determinations, the number of experimental parameters (factors) usually ranges between 2 and 10. The (saturated) fractional factorial design allows up to 15 factors to be tested. In this method the factors are tested efficiently using a small number of experiments. For example, the Plackett-Burmann design [91 permits seven factors to be tested with only eight experiments. These saturated fractional factorial designs assume that all interactions between factors are negligible and thus provide only the effect of single factors. The magnitude of the effect of a single factor on the performance of the analysis is known as the main effect. Evaluation of first order interactions between factors in fractional factorial design is limited because the interactions are incorporated with the main effects. If the number of factors to be tested is less than 4 or 5, then a full factorial design can be employed. This design allows interaction effects to be estimated without a drastic increase in the number of experiments. DESIGN-EASETMguides the user to one of the following designs; two-level factorial, fractional factorial, or Plackett-Burman designs. Once a design has been recommended, the analyst enters the name and levels of the variables to be studied and the responses to be measured. The program then assigns values to the parameters for the experiments that need to be run and can also randomize the order of these experiments. Results are analyzed and plotted in several graphical formats. EXPERIMENTAL DESIGNTMis another expert system that also helps the analyst determine which of 17 types of experimental designs are most appropriate for the problem under consideration. This program does not generate the actual design or analyze the results but can be linked to programs which perform these functions. After the automated laboratory system performs the set of experiments recommended by the experimental design component of the system, statistical information, such as the main effect for each parameter, standard errors and variance, can be extracted from the resulting data using data analysis programs. A final knowledge base yet to be developed will assist the analyst in converting these results into chemical information which can then be evaluated to see if the method meets the initial criteria including ruggedness tests. If the criteria are met, the procedures (LUO's and values for critical variables) are stored as an automated method in the system database for future use.
266
TABLE 1 Laboratory unit operations required for quantitative colorimetric analysis.
4.
REAGENT DELIVERY pipeting syringe (mls) MIXING vortexing magnetic stirring WEIGHING
5.
TRANSFER OF SAMPLE TO ABSORPTION CELL
6.
ABSORPTION MEASUREMENT
1.
2.
3.4 Automated system control The operation of the robot arm and all hardware components is managcd by Lhc system controller which is programmed in EASYLAB, a PL/M-based robot control language. Thc systcm controller is interfaced to two IBM AT microcomputers; one for programming the controller while the other contains the supervisor expert system shell and the expcrimental dcsign and data analysis programs. Thc tasks or laboratory unit opcrations (LUO’s) required for colorimetric analysis arc shown in Table 1. Groups of primary commands providcd in the Zymark software are asscmblcd into subroutines for each LUO. These user defincd subroutines known as “word” are stored in the controller’s dictionary. A controller program consists of a scqucncc of thcse executable “words”. The expert system uses information from the analyst and a knowlcdge of the colorimetric method to arrange sequences of thcse “words” into automated procedures. In addition to building these procedures, the expcrt system supcrviscs the transfer of the variables associated with LUO’s such as volumes of reagents and samples, mixing times and wavelengths for absorption measurements, to the correct “words” for the specified experiment. SHAKER (Fig. 5) is an example of a user dcfincd “word” that defines an LUO for the robotics system that handles vortex mixing of samples and rcagcnts. The commands that comprisc SHAKER are provided with the system software and have bccn organized to form a “word” that mixes samples and reagents. Variables for speed and time of mixing must be provided by the analyst directly or through communication with the supervisor syslcm. Oncc SHAKER has been defined and stored in the controller’s dictionary, it may be used as a component of highcr level robotics programs. PRETEST (Fig. 6) is an example of a high level robotics control program used for the qualitative analysis of selected metal cations. The use of IF/THEN commands in the
267
EASYLAB PROGRAM: SHAKER - This program will mix any sample that is in the - gripper hand at the time the SHAKER command is issued
PUT.INTO.VORTEX VORTEX.SPEED.1 = 90 VORTEX.TIME = 45 VORTEX.TIMED.RUN GET.FROM.VORTEX
Figure 5 . EASYLAB Program SHAKER
programs gives the robotics system limited, local intelligence-the ability to make instantaneous decisions based upon the results of sensor inputs. In this portion of the program, 1.3 mL of hydrochloric acid is added to a test tube containing the unknown solution to test for the presence of lead and silver. After this addition the test tube is moved to the precipitate detector to check for the presence of a solid, silver chloride or lcad chloride, that may have bcen produced by reaction with the hydrochloric acid. If the reading from the phototransistor of the precipitate detector is less than 100, no precipitate is present and the absence of lead and silver in the solution is confirmed. If the reading is equal to or greater than 100 then the presence of lcad or silver is indicated. The solution containing the precipitate is mixed in the vortexer using SHAKER and then centrifuged. After the liquid and solid have bccn separated by centrifuging, the liquid portion is transferred to another test tube and more hydrochloric acid is added to remove any lead ions rcmaining in the solution. The “word” TEST.DECANT.FOR.REMAINING.PPTcontains the robotic commands required for this LUO. After the addition of the second portion of hydrochloric acid, the resulting solution is again testcd for the presence of a prccipitate and appropriate actions taken based on the reading from the phototransistor. The program continues until the tests required to separate and identify all seven of the selected metal cations have been completed. Thrce levels of commands appear in the robotics control language. At the lowest level are the commands supplied with the system such as PUT.INTO.VORTEX. User defined “words” like SHAKER and TEST.DECANT.FOR.REMAINING.PPTrepresent the intermediate level. Finally, the program PRETEST is an example of the highest level command which contains looping and branching.
268
EASYLAB PROGRAM: PRETEST
- check f o r Pb or Ag DISPLAY OFF AG::ADD.1.3.ML.HCL VERTICLE.LED IF PPT >= 100 THEN 190 IF PPT < 100 THEN 150
if Pb or Ag p r e s e n t 150 SHAKER CENTRIFUGE AG::SAVE.DECANT.FOR.TEST AG::TEST.DECANT.FOR.REMAINING.PPT VERTICLE.LED IF PPT >= 2 0 0 THEN 160 AG::IF PPT.DETECTED.1N.DECANT GOTO 175 -
- all Pb removed 160 RACK.l.INDEX=DECANT.TUBE PUT.INTO.RACK.1 - all Ag removed 175
AG::IF.NO.PPT.DETECTED.IN.DECANT PUT.INTO.RACK.1 GOTO 200
- no c l a s s I i o n s p r e s e n t 190
UNKNOWNS=UNKNOWN.SALE DECANT.TIBE=l T$ = 'CLASS I IONS NOT PRESENT' PUT.ONTO.RACK.1 HEAT.OFF
200
FE::ADD.lML.NH3.BUFF
Figure 6. EASYLAB program PRETEST.
269
3.5 Databases Two types of databases are used by the supervisor system. One contains the structured data required to develop a procedure such as wavelengths and extinction coefficients for analytes while the other contains validated analytical procedures for phosphate analyses. The nature and quantity of the former type of information is such that it is best stored in a data base format rather than in the form of heuristic rules in a knowledge base. The latter database is used to store automated procedures that have been previously validated by the system. Before attempting to develop a method, the supervisor expert system interacts with the procedures database to see if an suitable procedure exists for the determination requested by the analyst. If a procedure exists, the analyst is informed and the LUO’s and values for the parameters are loaded into the robot controller, the determinations performed and the results displayed to the analyst for final approval.
3.6 Data structure The present prototype system transfers data and information among the software components through the use of ASCII text files. The component programs have the ability to read and write to this fundamental type of file. In some instances it has been necessary to write simple driver programs in QUICK BASIC to create the files and to open the communications links between computers. While the use of ASCII files may seem primitive, it has been effective in developing the prototype system. The use of an object-oriented expert system will change the format of the common data structure and should increase the efficiency of communications among system components.
4. Automated phosphate determination Efforts to date have concentrated developing and testing the software required to control the laboratory apparatus and instrumentation. Dictionary entries (“words”) for the LUO’s required to calibrate and determine ortho-phosphate concentrations by TABLE 2 Variances of laboratory operations. the vanadomolybdophosphoric colorimetric method have been DILUTION OF C U S O ~ 6 HZO STOCK SOLUTION defined and validated. CuSO, Measured at 840 nm An example of the work in = MM,2 + ABS,2 + SIP,2 TOTAL,2 progress is the analysis of the variSIP,2 = 1.17 x 10-5 ances involved with the mixing, ABS,2 = 1.66 x 10-5 transfer and absorption measureMIX4 = 0.86 x 10-5 ment. Aqueous copper sulfate TOTAL,2 = 3.69 x 10-5 solutions, known to be chemically
270
stable over long periods of time, were used for this study in order to separate the variances associated with the automated system components from the uncertainties associated with the reactions associated with the phosphate analysis. Table 2 shows the results of the copper sulfate experiments. In order to obtain the variance associated with the spectrophotometric measurement, a single copper sulfate solution was measured 40 times at a wavelength of 840 nm over a period of 30 minutes. One absorption value represents the avcragc of 5 measurements. Next, absorbance measurements were obtained for 40 aliquots taken from a single test tube using the sipper workstation of the automated system. This experiment included the variances associated with the spectrophotometer absorption measurement and the transfer of the samplc from the test tube to the spcctrophotomcter cell. The variance for the sample transfer was then obtained by subtracting the spcctrophotometervariance from the single test tube experiment. Finally a variance was obtained for the total process by pipeting a specified amount of stock copper sulfate solution into 40 different test tubes, adding a constant amount of distilled water to each, mixing the solutions with a vortex mixer, sipping one sample from each test tube into the absorption cell and obtaining an absorption measurement. The variance of dilution process is calculated by subtracting the varianccs for measurement and transfer from the variance for the total process which includes dilution. Our results indicate the largest error is associated with the spectrophotometer absorbance measuremcnt. This analysis of the variances of the individual tasks comprising an automated process illuswdtes the function of the controller software in an automated design and analysis. In the next phase of the project the experimental design and data analysis software components will be interfaced to the controller software thus permitting the analyst to design cxpcrimenls and transfer the procedures directly to the automated system for cxecution. The LUO’s required to obtain a calibration curve for Lhe vanadomolybdophosphoric acid determination of ortho-phosphates were implemented and calibration curves for three repetitive set of calibration solutions prepared by the automated system were obtaincd (Fig. 7). The results verified that uncertainties in absorption measurements are largest at the extremes of the plot, below 0.20 AU’s and above 0.80 AU’s. The calibration was both linear and reproducible for concentrations between these extreme absorption valucs. If the absorption valucs of the sample solutions are greater than 0.80 AU, then system will calculate the dilution required to give an optimum absorption value of 0.434, perform the required dilution and measure the absorption of the diluled solution. If an absorption value is below 0.200 AU, the system will warn the analyst and suggcst another method for the phosphate determination, the stannous chloride method which is reponcd to have a lower detection limit than the vanadomolybdophosphoricacid method. Work in progress is concerned with selecting and optimizing the critical experimental parameters for the vanadomolybdophosphoric acid determination. Factors under
27 1 CalibrationPlot Overlay 0 90 = 4.1039~-1+ z6we-h R*Z Y
=ow
= I W I k - 2 + Z6ZlSr-2. R"2 = 0 $96
Phosphate (pO43-) Concentration [men]
Figurc 7. Calibration plots.
considcration arc; (1) mixing and rcaction times, (2) pH of the sample solutions, (3) wavclcngth for absorption mcasurcments, (4) the prcscncc of interferences, particularity silicates, in llic sample solution and (5) Uie tcmperaturc of the sample solution. Tlic two programs mcntioncd in the prcccding section, DESIGN-EASETMand EXPERIMENTAL DESIGN TM arc currcntly being cvaluatcd for use in dcsigning the experiments. At thc prescnt time it appears that a full factorial mcthod can bc used for the design duc to thc limited numbcr of experimcntd parameters. Whcn the critical factors have been determined, a sequential simplex method such as SIMPLEX-VTMwill be interfaced to the laboratory system to automatically dctcrminc optimum values for critical experimental parameters. Once these values have been determined, the system will be calibrated with a series of standard solutions and the phosphate concentrations of water samples will be determined. The two other colorimetric methods and a potcntiomcuic mcthod for phosphate dcterminations will be studied using thc tcchniqucs dcvcloped for thc vanadomolybdophosphoric acid method. All four of the mclliods for phosphnlc dctcrminations will thcn bc compared. Information conccrning llic optimum conditions (conccntration rangcs, intcrfcrcnces, ctc.) will bc incorporatcd into tlic chcmicnl knowlcdgc basc of lhc supervisor systcni.
272
5. Summary The intelligent automated system allows the analyst to focus on the chemistry involved in automating a method by providing expertise in experimental design, data analysis and automated procedures. The interface between the analyst and the system facilitates the communication required to design and validate the operation of automated methods of analysis. The proposed systcm permits analysts to design and implement rugged automated methods for replication in many laboratories. The creation, electronic transfer and implcmcntation of this technology can improve thc reliability of interlaboratory data. Every attempt is being made to use existing, commercial software packages in the development of the system. The hardware and software components will be made as compatiblc as possible to facilitate efficient configuration, optimization and transfer of automated mclhods.
Acknowledgements This work was funded in part by National Science Foundation Research in Undergraduate Institutions Grant # CHE-8805930, a matching funds equipment grant from Zymark Corporation, Hopkinton, MA, and a grant from local research funds of the V M I Foundation.
References: 1.
2. 3. 4. 5. 6. 7.
8 9.
Franson M, ed. Standard Methods for the Examination of Water and Wastewater. 16th Edition. Washington, DC: American Public Health Association, 1985,440. Granchi M P, Biggerstaff JA, Hillard LJ, and Grey P. Spectrochima Acta 1987; 42: 169-180. VP-Expert (Version 2.2). Rerkley, CA: Paperback Software, 1989. S’LA7-EASE. Hennepin Square, Suite 191, 2021 East Hennepin Ave., Minneapolis, MN 55413,1989. Stdstical Programs. 9941 Rowlctt, Suite 6, Houston, TX 77075, 1989. Deming SN, Morgan SL. Anal Chem 1974; 9: 1174. Massart DL, Dijkstra A, Kaufman L. Evaluation and Optimization of Laboratory Methods and Laboratory Methods. Amsterdam: Elsevier, 1978,213-302. Morgan SL, Deming SN. Anal Chem 1974; 46: 1170-1 181. Placket PL, Burman JP. Biometrika 1946; 33: 305-325.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
27 3
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 23
Laboratory Automation and Robotics-Quo
Vadis?
M . Linder Mettler-Toledo AG, Greifensee,Switzerland
1. Introduction Growing demands of society have promoted the development of many aspects of analytical chemistry. Break-throughs in the field of microelectronics and microcomputer science have expandcd the scope of analytical instrumentation. Sophisticated instruments provide valuable analytical information. However, they require qualified operators. This situation often forces trained technicians and scientists to perform repctitive tasks rather than delegate them to less skilled personnel. Improving productivity is of top priority in most laboratories. There is a distinct trend in industry for more efforts in R&D and Quality Control. Often staff numbers are constrained and cannot expand to match the ever increasing workloads and demands for laboratory support. The performance of an analytical laboratory is judged by the quality of the results and the speed at which they are produced. Automation permits laboratories to maintain a high quality standard that conforms to the Good Laboratory Ractice (GLP) guidelines and creates challcnging work for profcssionally trained personnel. Laboratory automation involves two areas: Instrument automation and laboratory managcmcnt automation. The means of achieving management automation is through a Laboratory Information Management System [ 11, generally known by the abbreviation LIMS. Thc main topic of this contribution is the discussion of instrument automation.
2. Objectives of laboratory automation in chemical analysis Laboratory automation is a means of achieving objectives, some of which can be justified by economic factors such as cost savings. Economic justification is obvious for procedures such as extendcd operations (morc than 8 hours a day) and unattended operations. There are other objectives and motivating factors beyond cost savings, that lead to several bencfits and increased perfomance. They are summarized in Table 1. Improving laboratory quality and productivity has bccome an urgent strategy in many organizations. Thc lcading chemical, pharmaceutical, food, energy and biotechnology
214
TABLE 1 Objectivesbenefits of laboratory automation. Objective
Parameter / Benefits
Cost reduction and improved productivity
Savings in personnel and material Unattended/extendcd operations Reliability of the instruments Ease of operation
Accuracy and increased quality of measurements and results
High degrce of instrument control Improved spccificity/sensitivityof the procedures Increascd throughput allows multiple measurerncnts
Rcli ability/availability
Solid design High MTBF (Mean Time Between Failure) Low MTTR (Mean Time To Repair) Proven technique
Safcty
Reduced risks to employees and the environment Precise control of the measurement process No breakdowns
Speed
Reduced analysis timc Reduced turnaround time
Flexibility
Easy adaption to different procedures
companies face intense, world-wide competition. To meet this challenge, the strategies demand the following: (i) develop innovative new products that fulfill the customers need, (ii) manufacture products that meet higher quality expectations and standards, (iii) improve the organizational productivity and (iv) reduce risks to employees and the environment. These strategies rely on the ability of the industrial laboratories to provide incrmscd analytical support for decision making in research, product development and quality control.
3. Elements of the analytical process The tasks in the analytical chemical laboratory can be divided into three general areas (Fig. la): (i) sample preparation, (ii) analytical measurement and (iii) data evaluation. For the further discussion of automation in chemical analysis this simple schcmc needs to be refined. It must consist of the following elements (Fig. lb): (i) sampling, (ii) sample preparation and sample handling (which includes analytc rclcasc and analyte
215
a) Sample PreDaration
Sample Preparation
Measurement
4
Analytical Measurement
Data Evaluation
ata Acquisition & ata Precessina*
e!i Data Validation & Decision
Documentation
Figure 1. The analytical process: a) Basic elements of the analytical process. b) Elements of automated chemical analysis.
separation procedurcs), (iii) analytical measurement (including standardization and calibration), (iv) data acquisition and data processing, (v) data validation and decision and (vi) documentation. The data validation elcment may include a decision step that allows fccdback control to each element of thc automated system. In consideration of the potential bcnefits of automation, each item above should be addresscd. However, sampling will often be an external operation not amenable to automation.
4. Status of laboratory automation Research & Dcvclopmcnt in analytical chemistry in the sixties and seventies created new and improved measurement techniques. An industry for analytical instruments with highly automated measurement procedures has emerged. This business has now grown to annual sales of about $5 billion world-wide. The enormous dcvclopmcnts in computcr technology over the past decade havc rcsultcd in a complctcly ncw generation of computcr applications. This is also apparent in the ficld of laboratory automation. It has allowed Lhe analytical laboratory to automate data handling and documentation. Whole Laboratory Information Management Systems (LIMS) automate sample managcmcnt and record-keeping functions.
216
This leaves sample preparation as the weak element in automated analysis. Sample preparation is highly application dependent. Several non-trivial operations are necessary to bring the sample into the right state for analytical measurement. An incoming sample may be inhomogeneous, too concentrated, too dilute, contaminated with interfering compounds, unstable under normal laboratory conditions or in another state that prevents direct analysis. However, sample preparation has a high potential for automation. A typical example is the determination of water content by the Karl-Fischcr titration method [2]. Individual preparation steps for the different types of sample are necessary, whereas the titration procedure is always the same (using either a dedicated volumetric or coulometric titrator). Sample preparation is probably the most critical factor for the accuracy and precision of analytical results and for sample throughput and turnaround time. The productivity of laboratory resources (personnel and instruments) and laboratory safety are mainly influenced by sample preparation procedures. Manual sample preparation is subject to human variability, labor intensive and thus expensive, tedious and time-consuming, often dangerous (exposes people to hazardous environments) and difficult to reproduce after personnel changes.
5. Approaches to laboratory automation Two fundamentally diferent approaches to automation can be distinguished, (i) the automation of the status quo and (ii) the change of the procedure for easier automation (eventually the technological principle or the method of analysis). The strict automation of existing procedures is not always successful. The duplication of manual methods with a machine has its limits. Tailoring the techniques and procedures to the automation is a better approach to gain maximum benefit. Streamlined procedures, lower implementation costs, operational economics, better data and faster throughput can often be achieved by critically reviewing the existing methods. The change of a technological principle or of the method of analysis is a long term approach, which is a possibilily whenever a technological breakthrough occurs. Examples of this kind are thc evolution of the balance technology from mechanical to electronic balances and the introduction of robotics in analytical chemistry. The search for new assay methods and new instrumentation that reduce the amount of labor for sample preparation is an ongoing process. However, alternative methods are often not known or, for legislative reasons, they may not be allowed. At this point it is worth mentioning the concept of process analytical chcmistry (PAC) [3]. Unlike traditional analytical chemistry, which is performed in sophisticated laboratories by highly trained specialists, process analytical chemistry is performed on the front lines of the chemical process industry. The analytical instruments are physically and
211
Off-line Analysis Production Line
sntral Analytical Laboratory
Sampling
I
a-
Decision
Transmission
Validat ion
At-line Analysis Production Line Sampling
Industrial Laboratory Measurement
rn - - P A Decision
6
Validation
On-line Analysis Production Line Sampling
Measurement
Validation
Decision
In-line Analysis Production Line
Validation
Decision
Figure 2. Different strategies of traditional analytical chemistry and process analytical chemistry.
218
operationally a part of thc process. The output data are used immediately for process control and optimization. The major difference between the stratcgies of PAC and traditional analytical chemistry is shown in Figure 2. Depending on how process analyzers are intcgrated in the process, the proccdurcs can be classified as at-line, on-line or in-line measurements. In at-line analysis, a dedicated instrument is installed in close proximity to the process unit. This permits faster sample proccssing without too much loss of time caused by sample transportation. In on-line analysis, sampling as well as sample preparation are completely automated and form an integral part of the analyzing instrumcnt. The difficult process of sampling can bc avoidcd complctely by in-line analysis, where one or more selective sensing devices, c.g., ionsclectivc clcctrodes, are placed in direct contact with the process solution or gas. The choice between off-line, at-line or on-line procedurcs depends on the sampling frcqucncy imposed by the time constant of the process, the complexity of thc samples and the availability of the necessary sensors. Automatic methods applied to the analysis of a series of samples can be divided into two general categories: (i) discrete or batch methods and (ii) continuous-flow methods. In batch methods each sample is kept in a separate vessel in which the different analytical stages (dilution, rcagent addition, mixing, rneasuremcnt) take place through mcchanical processes. In continuous-flow mcthods thc samples are introduced at regular intervals into a carricr strcam containing a suitable rcagent. The injected sample forms a zonc which dispcrscs and rcacts with the components of the carricr strcam. The flow thcn passcs through a flow cell of a detector. The shapc and magnitude of the resulting rccordcd signal reflects the concentration of the injected analyte along with kinetic and thcrmodynamic information of the chcmical reactions taking place in thc flowing stream. Flow mcthods arc gaining more importance these days, especially in process analytical chemistry. Thc introduction of unsegmented-flow mcthods in 1974, now refcrrcd to as Flow Injection Analysis (FIA) [4],has remarkably simplified thc ncccssary equipment. Due to speed and rcagcnt cconomy, most common colorimetric, clcctroanalytical and spectroscopic methods havc bcen adaptcd to FIA. In addition, also sample preparation tcchniques such as solvent extraction, dialysis and gas diffusion havc bccn realized with flow injection analysis. In the last ten years, the problem of automated sample preparation and manipulation has bccn addressed by the use of flexible laboratory robots and dedicated sample handling systcms, such as autodiluters, Laboratory robotics, commcrcially introduced in 1982 [ 5 ] , as an alternativc to manual sample handling, brings several bcncfits. Analytical results are more reliable. Users arc gctting them faster, safer and often at lcss costs than before. Various objcctivcs of laboratory automation, discussed in Scction 2, can bc achieved.
219
Such robots are manipulators in the form Number of Samples per Day of articulated flexible arms with various possible geometric configurations (Cartesian, cylindrical, spherical or rotary), designed to Dedicated Automation move objects and programmable for different tasks. Flexible AufomaNon These latest developments provide the final piece for automated chemical analysis. Tying the automated sample handling sysManual tem to the chemical instrumentation and data b handling network leads to a complete system Complexity of Procedure approach for a totally automated laboratory. Automation procedures in industry tradi- Figure 3. Flexible automation. tionally have required a large quantity of identical, repetitive operations to justify the large initial investments of automation. This fixed or dedicated automation is most suitable for those processes where production volumes are high and process change-overs are low. This explains the success of automation in clinical analysis. The clinical laboratory is dealing mainly with two types of sample, blood and urine, whereas most industrial laboratories have to work on a wide spectrum of products. In addition, changing needs-new products and new analysis-are typical in modem laboratories. Laboratory robotics provides flexible automation able to meet these changing needs. Flexible automation systems are programmed by individual users to perform multiple procedures. They have to be reprogrammcd to accommodate new or revised methods. As illustrated in Figure 3, flexible automation bridges the gap between manual techniques and spccialized, dedicated automation. Automation of entire laboratory procedures that are unique to the compound and matrix being analyzed seems to be an impossible task. The diversity of sample materials is reflected in thc number of assay methods that are in use. However, the constituents of a mcthod are common partial steps or building blocks, that occur in one form or another in many assays. Examples of the most frequent unit operations are given in Table 2. Laboratory procedures can be represented as a sequence of unit operations, each specificd by a set of execution parameters. This suggests that these unit operations should be automated in order to automate individual analysis. This flexible approach takes advantage of the assembly line concept of manufacturing automation. From this step by step approach it is possible to derive a conceptual design of an automated analysis system that allows to run individual samples and small and large series of samples. These findings are in contrast to the developments in clinical chemistry where with the automation of serial analysis entire assay methods have been automated.
I
280
TABLE 2 Common unit operations in analytical chemistry. Unit Operation
Description, Example
Weighing
Quantitative measurement of sample mass using a balance Reducing sample particle size Adding exact amounts of reagent using a burette Dissolving solid sample in an appropriate solvent Adjusting concentration of liquid sample Separating unwanted components of the sample Direct measurement of physical properties (pH, conductivity, absorbance, fluorescence, etc.) Conversion of raw analytical data to usable information (peak integration, spectrum analysis, etc.) Creating records and files for retrieval (printouts, graphs, ASCII files)
Grinding Dispensing Dissolution Dilution Solid-liquid and liquid-liquid extraction Dircct measurement Data reduction
Documentation
With the combination of a limited number of automated self- cleaning units and the correspondent infrastructure of transport mechanisms and electronics control, it has been shown as early as 1976 by METTLER that a large part of the analytical workload in an industrial laboratory becomes amenable to automation [ 6 ] . The instrument developed employs several types of units for automation of the basic operations for sample preparation, a sample transport mechanism, an entry and weighing station, a central control minicomputer and a line printer for result documentation. The actual configuration could be tailored to individual laboratory requirements.
I Figure 4. PyTechnology (Zymark Corporation).
28 1
The laboratory robotic systems today use the same concept. The individual unit operations are automated with dedicated workstations and use the robotic arm to transfer samples from one station to another according to user programmed procedures. Zymark Corporation, the leading manufacturer of laboratory robots, has implemented this idea of dedicated laboratory workstations for unit operations in their PyTechnology [7]. In thc PyTechnology architccturc each laboratory station is rigidly mounted on a wedge-shaped locating plate, called a PySection. Each of the PySections has the necessary hardware and software to allow the rapid installation and operation of a particular unit operation such as weighing, pipetting, mixing or centrifugation, The PySections are attached to the robot on a circular locating base plate (Fig. 4). By indicating the location of the particular PyScction to the robot, the section is ready to run. Once a PySection has been put in place, the Zymate robot will bc able to access all working positions on that particular Py Section without any additional robot teaching or positioning programming by the user. This leads to a very rapid system set-up and the ability to easily reconfigure the system for changing applications. This new system architecture allows centralized method development. As long as scvera1 users have the same PySections, standardization and transfer of assay methods from one laboratory to another is easy. Based on this approach, several dedicated turn-key solutions for common sample preparation problems such as solid phase extraction, automated dilution and membrane filtration are offered today. The reliability of robots has been improved in the last few years, Modern robotic systems use both feedforward and feedback techniques such as tactile sensing of the robot’s grip and position sensing switches to assure reliable operation.
6. Workstation concept of analytical instrumentation Inspite of the numerous benefits that complete automation can provide an analytical laboratory, the process of implementing such systems is often difficult. Based on many years of experience in manufacturing of instruments for analytical and process systems we advocate the following general workstation concept (Fig. 5) as opposed to total centralization. A workstation may consist of the following main components: (i) an external general purpose computer, (ii) an analytical instrument and (iii) a sample changer or a robotic sample preparation system including dedicated workstations for the necessary unit operations. These on-site workstation computers can be linked to a LIMS. Each workstation can be developed indepcndently with little danger of adverse interactions with others. Workstations of this kind are not simple. Most of them will have some or all of the tasks of sample identification, mclhod selection, sample preparation, running the analysis, data reduction and evaluation, result validation, reporting and record-keeping.
282
Computer
I
LIMSILAN
Figure 5. General workstation concept fur automated chemical analysis.
Workstations must also provide calibration methods (running standards and blanks) and maintenance procedurcs (wash and rinse). They should provide opcrator help and includc procedures to deal with failures and cmcrgcncies. The instrument control is preferably not under the complete guidance of the external computer. The preferred scheme is to pass control parameters from the main workstation computer to the internal computcr system of the analytical instrument and thc robot which are normally running without close supervision by the master computer. This workstation concept can be found in many commercially available analytical systems. The user may start with thc basic unit, the analytical instrumcnt. Hc can later improve versatility and flcxibility with the addition of a Personal Computer with corrcsponding software packagcs.
7. Analytical instrumentation of the future Based on thc discussions in the previous sections on thc state of thc art, the benefits and problems of laboratory automation, necessary requirements for analytical instruments for thc ncxt dccadc can be formulated: The instruments must be easy to operate and the manmachine interface must be transparent also to less skilled users. High performance and high throughput are obvious featurcs. The instruments must provide the necessary
283
mechanical, electronical and software interfaces to allow the link with master computer, sample changer, robotic sample preparation and robotic transport systems, balance, barcode reader, LIMS and othcr computer networks. The instruments must allow method storage. Methods must be up- and downloadable to and from a computer. The instrument software includes sophisticated data reduction and data evaluation schemes using chemometrics. Fcdcral laws and GLP require intermediate storage of data and the everchanging needs for the analysis of new type of samples require solutions that are flexible with regard to sample preparation and data evaluation and presentation. The following trends for the immediate future can be seen today: (i) more automation of sample preparation, (ii) more reliable laboratory robots (either dedicated turn-key systems or general-purpose systems), (iii) flexible analytical instruments partially include and thus automate sample preparation, (iv) analysis moves closer to the process (more dedicated workstations), (v) improved communications (common data structures, common protocols), (vi) more flexibility for changing needs and (vii) links with other information systems (corporate systems, production systems). However, these dcveloprnents depend on the instrument and the LIMS suppliers to provide the necessary equipment and the corporate management and finally the user to accept these developments in automated chemical analysis. A laboratory has several alternatives for the implementation of analytical systems. It can buy either a total system or some subunits. Complete systems from one vendor has ~
~~
~
TABLE 3 Alternatives in system implemcntation. Alternative
Systems Responsibility
Advantages
Disadvantages
Buy total system
Vendor
Ready made solution
Few adaptions to specific requirements possible (price)
Buy all subunits
User
Flexible solution
Interfacing problems Service Systems responsibility
Buy/make subunits
User
Optimal solution takes considcration of application and organizational aspects
Interfacing problems Time costs
Make total system
User
Latest technology Optimal solution
Time costs
Service Documentation
284
the advantage of a clearly defined systems responsibility but may not fulfill all the requirements of the user. Table 3 summarizes the advantages and disadvantages of the different approaches in system implementation. In the long range future we will see new methods for analysis that drastically reduce sample preparation. The chemical process industry will have automated batch production with in- line measurement of the important process parameters. However, this needs new robust selective sensing devices. The addition of natural language processing and vision systems to robots will open new areas for automation. Continued growth in the power of microcomputers will offer exciting prospects to laboratory automation. New chemometric tools will provide complex database management, artificial intelligence and multivariate statistics. As the tasks assigned to laboratory instruments become more complicated, it becomes necessary for the system to make “if-then-else” decisions (rules) during the daily operation. Intelligent instruments must be able to adapt to changes in experimental conditions and to make appropriate modifications in their procedures without human interventions. This type of intelligent feedback to the different elements of the automated system can be realized with expert systcms. The instrument must be able to learn from its past experience. The advent of this kind of sophistication promises a continued revolution in the future practice of laboratory automation. These are some trends and partially visions perceived by the author. They may be obvious or new to the reader, or perhaps they may be incorrect. In any event, these are exciting times to be an analytical chemist and to participate in the laboratory revolution.
References 1.
2. 3.
4.
5. 6.
7.
McDowall RD, ed. Laboratory Information Management Systems - Concepts, Integration and Implementation.Wilmslow, UK: Sigma Press, 1987. Scholz E. Karl Fischer TitratioMetermination of Water. Berlin, Heidelberg, New York: Springer-Verlag,1984. Riebe MT, Eustace DJ. Process Analytical Chemistry-An Industrial Perspective. Anal Chem 1990; 62(2): 65A. Valcarcel M, Luque de Castro MD. Flow-InjectionAnalysis, Principles and Applications. Chichester, England Ellis Horwood Limited, 1987. Strimaitis JR, Hawk GL, eds. Advances in Laboratory Automation Robotics. Vol. 1 to Vol. 5, Zymark Corporation,Hopkinton, MA, USA. Arndt RW, Werder RD. Automated Individual Analysis in Wet Chemistry Laboratory. In: Foreman JK, Stockwell PB, eds. Topics in Automatic Analytical Chemistry Vol. 1, p. 73, Horwood, Chichester, 1979. Franzen KH. Labor-Roboter-die neueste Entwicklung: PyTechnology. GIT 1987: 450.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
28 5
CHAPTER 24
Report of Two Years Long Activity of an Automatic Immunoassay Section Linked with a Laboratory Information System in a Clinical Laboratory R.M. Dorizzil and M. Pradella2 1Clinical Chemistry Laboratory, Hospital of Legnago, 37045 Legnago (VR), Italy and 2Clinical Biochemistry Chair, Padua University,Padua, Italy
Abstract The organization of an immunoassay section developed in 1987 capable to automatically process 17 different RIA and EIA tests is described. The section is equipped with a Kemtek 1000 Sample Processor, a spectrophotometer, a 16 wells NE 1600 gamma counter, a IBM PCKT running commercial software for data reduction and PC-SYSLAB, a home made program, which interlinks the section with the LIS. After the entry of the patients and of the required tests, worklists relative to the daily workload are transferred to a floppy-disk and a hard copy is simultaneouslyprinted for reference. The samples are dispensed by Kemtek lo00 according to the worklist and, after the end of the procedure, radioactivity or absorbance measurements are transferred over a RS-232 interface to the IBM PC. After the calculation of the doses, PC-SYSLAB links the doses to the identification number of the patients and then transfers these to LDM/SYSLAB. Medical validation is helped by the printing of a complete table of results. An archive relative to the thyroid panel (containing today about 1,200 patients) can be searched for the presence of known patients. Every year more than 50,000 endocrinological tests have been reported without any manual entry and transcription of data by 2 medical technologists and 1 laboratory physician.
1. Introduction During this century the diagnosis in medicine changed radically; in the beginning of the century historical information was of paramount importance and physical examination made a modest contribution; in the fifties analytical data were added to historical and physical data while in the current era the information provided by instruments is very often of critical importance [ 11.
286
An avalanche of commercial instruments automated to different extents were marketed to execute the wide array of rcquired tests. The introduction in 1957 of AutoAnalyzer started a revolution in clinical chemistry. It placed reasonably reliable and rapid analysis of many blood analytes within reach of most hospital laboratories and made possible the production of large quantities of reliable data in short time with very reasonable labour requirements [2, 31. AutoAnalyzcr led the way to the admission screen proposed by A. Burlina in Italy consisting of a biochemical scrccn, physical examination and history collection [4]. The analytical instruments exist as “automation islands” within the laboratory with little interconnection between them. Today it is essential to overcome this problem and effectively bridge the gap between the “islands” [5]. In 1987 we started a project for optimizing the workflow of the immunoassay section in processing 17 RIA and EIA tests employing the common instrumentation of clinical chcmistry laboratory and for linking it to LDM/SYSLAB.
2. Materials and methods The laboratory of Legnago Hospital acquired a Technicon LDM/SYSLAB system in 1985: it is a “turnkey” system using two SEMS 16/65 minicomputers with 1 Mbytc of memory and twin 20 Mbyte disk drives. Three external laboratories are connected to LDM/SYSLAB by standard phone lines via modem. LDM is, directly or via an intelligent terminal, connected to terminals, instruments and printers. In our laboratory there are routine assays by RIA for estriol, HPL, progesterone, 17 beta estradiol, testosterone, T4, ff4, TSH, ferritin, CEA, PSA, insulin and digoxin and by EIA for FSH, LH, prolactin, alfa-feto-protein. Immunoassay section of our laboratory is equipped with standard instruments: a Kemtck 1000 Sample Processor (Kemble, UK); a mechanical fluids aspirator MAIA-SEP (Arcs-Serono, Italy); a spectrophotometer Serozyme I1 (Ares Serono, Italy); a 16 wells gamma countcr (Thorn EMI, UK) connected with a IBM PC. A commercial program (Gammaton, Guanzate, Italy) allows the calculation of the doses by standard procedures (Wilkins’s 4-paramctcrs, spline, linear regression, Rodbard’ weighted logit-log, point-to-point) and execution of on-line Quality Control procedures.
3. Manual organization of immunoassay section The samples obtained from the 4 connected laboratories were entered in the LDM/SYSLAB by the laboratory clerks. Two full-time medical technologists and one part-time laboratory physician supervisor workcd in the section; the shift of the 2 medical technologists was 8 a.m.-2 p.m. from Monday to Saturday. On the day of the execution of the spccific assay, the identification numbcr of evcry sample was manually rcgistcrcd on a paper sheet. Samples and reagcnts were dispensed by Kcmtck 1000 and thc samplcs wcrc further processed according to the manufacturer’s recommendations. The supernatant was
287
discarded by a MAIA SEP mechanical aspirator and then the radioactivity or the absorbance of the tubes were measured by a gamma counter or by a spectrophotometcr respectively. After technical and medical validation, the results were transferred to the hand-made worklists and then to the computer-made worklists. Finally the results were manually entered in the LDM. The organization of the work, also if partially automated, was tedious and time consuming for the 2 medical technologists, who had to enter the final results of the dose, i t . , the most delicate procedure, at the very end of their work-shift. Therefore, clerical mistakes were possible and it was very difficult for the physician in charge of the section to carefully evaluate the results especially for the thyroid panel (thyroxine, free-thyroxine, TSH), pregnancy panel (esuiol and HPL) and female gonadic function panel (estradiol, progesterone, testosterone), etc. This represents a crucial problem in the modem clinical chemistry laboratory since it is widely accepted that the data generated by the laboratory become ACTUAL information only when they are utilized in patient care. The laboratory physician plays an essential role at this stage in providing fast and accurate results to the clinician [ 5 ] .
4. Computerized organization of immunoassay section 4.1 PC-SYSLAB package The PC-SYSLAB package requires at least two microcomputers: one is directly linked to the LDM system, and one, through a switch, alternatively to a radioactivity detector or to a spectrophotometer.
4.2 Patient names and worksheet uploading A BASIC program simultancously emulates a VDU peripheral and a line printer. The patient names and the worklists are "printed" to the microcomputer (TELE-PC, Televideo Systems Inc., Sunnyvale, CA, USA), where they are stored as a sequential text file. Two PASCAL programs process this filc: one updates the patient name random file (FITRAV), and one builds the test request list (LISTA). Each request is identified by a workplace and a tcst number
4.3 Test list printing and result acquisition A BASIC program builds and prints the specific test lists. The samples are sorted in the list order and the assay is pcrformed exactly in the same way previously described. The medical technologists place the sample tubes on the racks of KEMTEK following the test list.
288
4.4 Downloading of results to LDM A BASIC program builds the results list from a sequential file recorded by the Gammaton software. The results lists, merged and sorted, are sent to LDM by Tele-PC and to a program that emulates an automated instrument. Simultaneously, a table containing all patients’ results is printed. This table is checked by the laboratory physician in charge of the section, who verifies the clinical consistency of the results, and updates the patient data base relative to the thyroid panel results. When necessary, he prepares interpretive comments using a LDM/SYSLAB terminal.
4.5 Looking for knowns Known patients are recorded in a random indexed file using a “public domain” program (PC-FILE, Buttonware Inc., Bellevue, WA, USA). A small, alphabetic index is then extracted from this data base and each name in the test is looked for by a binary search algorithm by a PASCAL program. Thc output of this program is a list named K N O W S OF TODAY.
5. Results Since 1987 17 of the 25 immunological assays pcrformed in the section are performed employing the described procedure. Every day, 6 days/week, 2 medical technologists perform a mean of 6 tests for a total of about 150-200 patient tests per day with a total of more than 50,000 tests/yeat. After some months of debugging, the PC-SYSLAB was used, without interruption, since octobcr 1987. It failed only a few days due to the breakdown of LDM and of the hard disk of the IBM PC connected to the counters. Table 1 shows thc benefits in terms of time caused by the adoption of PC-SYSLAB for the processing of pregnancy panel (estriol and HF’L). Since 6 tests are assayed every day, the total daily gain is about 2 work hours depending on the number of samples for evcry assay.
6. Discussion and conclusion LDM showed overall a good performance in producing conventional and bar-coded sample labels and worksheets, in data acquisition from analytical instruments and in patient reports printing. However, LDM system does not easily perform calculations and data base functions. The first immunoassays were developed in endocrinology and nuclear medicine laboratories not familiar with automation and little oriented toward standardization and
289
TABLE 1 Immunoassay computerization benefits. OPERATION
NO COMPUTER
COMPUTER
Worklist Dispensation Input of results
20 min 28 min 25 min
10 min 28 min 5 min
Total
73 min
43 min
simplification of techniques. The wide range of methodologies adopted by laboratories exhibited little uniformity, were extremely labour intensive and relied to a high degree on very skilled staff. In the ‘80 several firms marketed very complex instruments, requiring a lot of maintenance, to automate isotopic and non isotopic immunoassays. The majority of these instruments were “closed” and proved very expensive requiring dedicated reagents. Since this market is greatly fragmented, there is no commercial pressure toward the linking of the immunoassay section to the LIS which is also hampered by the well known absence of a standard protocol of communication between equipment and the laboratory information system [9, 101. Very few reports of microcomputer linking with a commercial “turn key” LIS are reported in literature. Davies and Mills [6] linked a clinical chemistry centrifugal analyzer and Hobbs et al. [7] linked, by a Commodore 4032 microcomputer, a DEC 11/23 minicomputer used for immunoassays. Only the implementation reported by Hobbs et al. was designed to add a “high level” function (patient data base) to LIS data managerncnt. The use of microcomputers allows to transfer the laboratory data to a lot of widely spread programs to obtain sophisticated elaborations. PC-SYSLAB uses “public domain” software for data base function and list editing; in other applications, such as in serology [ll], it uses also a spreadsheet for result calculations. In summary, microcomputers are able to reduce the technology gap in small laboratories for which the cost of a commercial LIS ($1,000 pcr patient bed) is a frustrating factor. They are able to do several computing tasks also in larger laboratories [12]. They make it possible that the laboratory personnel takes advantage from well documented, error free, powerful programs such as word-processors, data bases, spreadsheets and graphic packages.
References 1. 2.
Strandjord PE. Laboratory medicine-excellence must be maintained. In: Bermes EW, Ed. The clinical laboratory in the new era. Washington: AACC Press, 1985. Burtis CA. Advanced technology and its impact on the clinical laboratory. Clin Chem 1987; 33: 352-7.
290
Valcarcel M, Luque de Castro MD. Automatic methods of analysis. Amsterdam: Elsevier, 1988. 4. Burlina A. La logica diagnostica &l laboratorio. Padova, Piccin, 1988. 5 . McDowall RD. Introduction to Laboratory Information Management Systems. In: McDowall RD, Ed. Laboratory InformationManagement Systems. Wilmslow: Sigma, 1987. 6. Davies C, Mills RJ. Development of a data manager linking a Baker Encore to a LDM computer. Ann Clin Biochem 1987; 24: S1-51-52. 7. Hobbs DR, Lloyd GC, Alabaster C, Davies KW. The use of a DEC 11/23 minicomputcr to allow the entry of immunoassays result into a Technicon LDM/Syslab data management system. Ann Clin Biochem 1987; 24: S 1-52-54. 8. Forrest GC. A general review of automated RIA. In: Hunter WM, Come JET, Eds. Immunoassays for clinical chemistry.Edinburgh: Churchill-Livingstone, 1983. 9. Blick KE, Tiffany TO. Tower of Babel has interfacing lessons for labs. CCN 1990; 16 (4): 5. 10. McDonald CJ, Hammond WE. Standard formats for electronic transfer of clinical data. Ann Intern Med 1989; 110: 333-5. 11. Pradella M. Personal unpublished data, 1988. 12. McNeely MDD. Microcomputer applications in the clinical laboratory. Chicago: ASCP Press, 1987. 3.
LIMS and Validation of Computer Systems
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
29 3
CHAPTER 25
An Integrated Approach to the Analysis and Design of Automated Manufacturing Systems S.P. Maj* 50 Gowing Rd, Mulbarton, Nonvich NR14 8AT, UK
Abstract All organisations should have an information strategy and hence a long term view to prevent the fragmented and unco-ordinated introduction of computer-based production systems. This is especially true when it is recognised that companies successfully using Computer Integrated Manufacture (CIM), Flexible Manufacturing Systems (FMS) etc can be considered to be in a minority. Further, the technology employed should be part of a computer-based system that integrates, both vertically and horizontally, the total manufacturing complex. What is perhaps needed is an integrated method applicable to the analysis and design of such automated manufacturing systems.
1. Introduction Even though the technology exits, few manufacturing complexes have succeeded in fully integrating the production cycle [ 13. Communication systems and conveyor belts etc allow the physical integration of manufacturing activities wilh distributed databases sewing to integrate the information. However production systems vary considerably in the basic production cycle of product/production ratio from job, batch, line to continuouseach with its own associated production techniques. A major problem is optimal and reliable functional integration to give a consistent and cohesive production cycle [21. This is especially so when a system is converted from manualhemi-automatic data processing to a fully integrated information management system. The total manufacturing system must be Considered thus avoiding the dangers of having ‘islands of automation’. A large range of structured methods exist for the analysis and design of computer based systems [3]. Whilst these methods are perhaps adequate for commercial applications they are largely unused by production engineers and further it is considered that
* Independent Consultant
294
they lack the rigour demanded in the specification of critical applications. This is espccially true when it is recognised that many major software systems contain errors-errors that were introduced during analysis, design and implementation. More suitably formal specification techniques, based on rnathcmatical theory, can be uscd [4].They are however complex. Due to the widc range of processing criteria that can be found in manufacturing complexes it can be considered that the use of both structured mcthods of analysis and a formal specification is best suitcd. But without doubt what is nceded is an integratcd approach thus ensuring verification and validation to the required standard.
2. Manufacturing integration Computer Integratcd Manufacture (CIM) can be considered as the use of computers to plan and control the total manufacturing activities within an organisation. Horizontal intcgration links the activities and processes that start with the design of a product and finish with its delivery to the customer. Vertical inlegration links the detailed design and manufacturing activities that make up the horizontal dimension, through a hierarchy of control, to thc stratcgic plans of the organisation. Vertical activities are typically discontinuous, c.g., corporate strategy. CIM can be said to include a family of activities. The three major components being Computer Aided Engineering (CAE), Computer Aidcd Quality Assurance (CAQA) and Computcr Aided Production Management (CAPM). CAE is often taken to include Computer Aidcd Design (CAD) and Computer Aidcd Manufacture (CAM). Thus computcrbased systcms are available for the entire spectrum from product design, manufacturing systems design to process planning, monitoring and control. It should be recogniscd that fully exploited CIM demands both communications and information inlegration.
3. Hardware Rapid advances in electronic device fabrication have taken us from thc thcrmionic valve --large, expensive, inefficient and unreliabloto solid state transistors integratcd during manufacture onto a singlc piece of semiconductor, i.e., intcgratcd circuits. Advanced fabrication techniques now employ submicron lithography. Future trends can be clearly idcnlified as higher packing densities, higher operating speeds and new semiconductor materials. As a programmable, general purpose device the microprocessor made application specific only by the associated software. General purpose devices command large production volumes with minimal unit cost. Data exchange between individual units of computcr-controlled equipment is typically achicved by hard-wired, point-to-point cablcs with spccialised electronics, i.e., ‘islands of automation’. Thesc manufacturer dependent closcd communications systems are bcing replaced by the International Standards Organisation (ISO) Reference Model for Open Systems
295
Interconnection (OSI). The OSI model addresses the problems of providing reliable, manufacturer independent, data transparent communication services [5].Networks (Local and Widearea) provide the framework for distributed data processing, i.e., a network with a high degree of cohesion and transparency in which the system consists of several autonomous processors and data stores supporting processes and databases in order to achieve an overall goal. A practical realisation is the Manufacturing Automation Protocols (MAP) initiative. The OSI system allows for a number of alternative protocols each of which provides a means of achieving a specific distributed information processing function in an opcn manner. The specific application services required can be selected along with different modes of operation and classes of service. The MAP set of protocols being selected to achieve open systems interconnection within an automated manufacturing plant. The conclusion being networked manufacturing cells of intelligent instruments and computers etc linked via a gateway to the corporate network of other on site functions. The question is why do systems fail and why can software costs represent in excess of 80% of the total system cost? [6].Modem computcr hardware is extremely reliable and can be made fault tolerant. The problem is errors are introduced and errors propagate. Further, large software systems arc not static. They exist in a constantly changing environment requiring perfective, adaptive and corrective maintenance.
4. Software Computer Integrated Manufacture is supported by software that must include a database management system. Traditional file-based systems had the intrinsic problems of file proliferation and chronological inconsistency. A database however can be defined as a collection of nonredundant data shareable between different application systems. The database conceptual schema is the description of all data to be shared by the users. The external schema of each user being a specific local view or subset of the global schema as rcquired by that particular application. The shared, yet selective, access to non-redundant data allowing schema changes to affect all with the appropriate access hence reduced program maintenance costs. The developments in networking and distributed systems have made the distributed database a practical solution. A distributed database is a collection of logically related data distributed across several machines interconnected by a computer network. Thus an application program operating on a distributed database may access data stored at more than one machine. The advantages of distribution include each site having direct control over its local data with resulting increase in data integrity and data processing efficiency. In comparison, the centralised approach requires the data to be transferred from each site to the host computer with the subsequent communication overhead. The distributed system is an automatic solution to geographically dispersed organisations. The need to
296
provide a logically integrated but physically distributed information system is the basis for a distributed database [7-91. The system software must provide high independence from the distributed environment. Relational databases in particular have been successful at providing data independencc and hence system transparency.
5. Systems analysis and design Manufacturing systems are complex. Computer Aided Production Management (CAPM) is concerned with manufacturing planning and conttol. The earlier systems were concerned primarily only with inventory control. Later developments encompassed production planning. These combincd approaches being known as Materials Requirements Planning (MRP). Manufacturing Resource Planning (MRP 10 differs from MFW in placing less emphasis on material planning and more on resource planning and control. The difficulties associated with MRP must not be underestimated [ 101. Othcr systems include Kanban, Just In Time (JIT) and Optimised Production Technology (OFT). Whichever system is used it is considered that the production of complex systems requires the use of the software or system life cycle. This consists of a series of distinct stages, each stage having clearly defined activities. spically the stages are: statement of requirements, requirements analysis, system specification, system design, detailed design, coding, integration, implementation and maintenance. Melhods applicable to smaller systems if simply scaled up result in overdue, unreliable expensive and difficult to maintain computer based data processing systems. Many methods, with varying degrees of complexity, have becn developed such as Information Engineering, Structured Dcsign and Analysis [ll].These have largely bcen in the context of commercial applications. Most methods employ the basic principles of stepwise, topdown dccomposition in which the slepwise refinement allows the deferment of detailed considerations by the use of abstraction to suppress an emphasis detail as appropriate. They all attempt to be undcrstandable, expressive, implementation independent and generally applicable. Progression through the system development life cycle consists of a series of uansformations from the user statement of requirements to the detailed design [ 121. This involves documentation employing diffcrent notations appropriate to the requirements of each stage. The statement of requirements document will be natural language with some graphics for clarity. This document will, as a result of the complex semantics of English (or any other natural language) be ambiguous, incomplcte and contain contradictions. From this document the requirements analysis stage has to produce a requirements spccification to be uscd as a reference document for all subsequent work and for final acceptance testing prior to handover. As such it has to be complete, consistent and unambiguous. Progression through the development cycle reduces the natural language content with a subsequent increase in more diagrammatic notations. The output of each stage is a specification for the following stage from which the appropriate design is made.
291
Verification is the process of ensuring that the design of each stage is correct with respect to the specification of each preceding stage, i.e., is the product right? Validation ensures design integrity in that the final design should satisfy the initial user requirements, i.e., is it the right product? [13-161.
6. Structured systems analysis and design All organisations should have an information strategy and hence a long term view to prevent the fragmented and un-coordinated introduction of computer-based systems. The considerable conceptual, organisational and technical difficulties with regard the successful implementation of C M must not be underestimated. Further laboratories, for example, are subject to Good Laboratory Practice (GLP) regulations [17]. These regulations include the definition, generation and retention of raw data, Standard Operating Procedures (SOP’S)etc. The concept of quality assurance is to produce automated data systems that meet the user requirements and maintain data integrity. The distinct phases of the system life cycle are considered to give only minimal guidance. The enhancements to this basic framework came by demand. However one of the biggest problems in system work is ‘navigation’who does what, when where and how? What is needed is a method to act as a procedural template giving comprehensiveguidance. Structured Systems Analysis and Design (SSADM), legally owned by the Central Computing and Telecommunications Agency (CCTA), is an integrated set of standards for the analysis and design of computer based systems [18]. It is a generally applicable method, suitable for widely differing project circumstances, with clearly defined structure, procedures and documentation. The structure consists of clearly defined tasks of limited scope, clearly defined interfaces and specified products. The procedures use proven, usable techniques and tools with detailed rules and guidelines for use. The three different views based on functions, data and events give a complete and consistent system view with intrinsic documentation in the structure and procedures. Productivity gains are due to the standard approach, with known techniques and clearly defined user needs, to project planning. SSADM addresses the problems of confidentiality, data integrity and availability. Project quality is ensured by early error detection and correction with readable, portable and maintainable solutions thus ensuring verification and validation to the required standard.
7. Formal specif icat ion techniques Whilst structured methods of analysis are adequate for commercial and business organisations they lack the rigour demanded by on-line critical applications. Errors can be introduced into software in the incorrect specification of the environment and the failure of
298
the design to match the specification. Errors are introduced and propagate through the design process with unacceptable consequences in life-critical applications. Formal techniques are mathematical systems to generate and manipulate abstract symbols [20]. They can be used to describe mathematically correct system specifications and software designs together with the techniques for verification and validation. There are several types of formal technique, such as Z and VDM, but typically they consist of a language to provide the domain description and a deductive apparatus for the manipulation of the abstract symbols. The languages have an alphabet to define the symbols to be used, rules of grammar to define how the symbols may be combined in order to write acceptable strings of symbols or well formed formulae (wff), i.e.. rules of syntax and the interpretation of the language onto the domain of interest by the rules of semantics. The deductive apparatus defines the axioms or wffs that can be written without reference to other wffs and also the rules of inference that allow wff to be written as a consequence of other wff. Due to the complexity of proof procedures the original aims of fully automated proof systems to automatically demonstrate that programs met their specification without the need for testing have achieved limited success only. Using a formal technique (Z) it has bccn possible to write a limited specification for an automated analyser (ion selecLive electrode measurements) to act as a behaviour model-states and operations. Theorem syntax was checked for consistency. In recognition that most formal systems are incomplete, only limited completeness checks were performed.
8. Discussion Work to-date indicates that structured methods of analysis (SSADM) may be suitable for an integrated approach to the analysis and design of automated manufacturing systems. The procedural method helping ensure system quality, i.e., the features and characteristics that bear on its ability to satisfy total system requirements. The selective and appropriate use of formal specification techniques can be used for critical application specification and design together with reduced testing. The use of natural language is not eliminated but is used to enhance the techniques as appropriate. Formal techniques can be perhaps considered as tools in the repertoire of SSADM thereby allowing the complete business system integration with the more critical demands of industrial processes, i.e., Total Integrated Manufacturing (TIM).
Acknowledgements In acknowledgement of Dr. R. Dowsing and Mr.A. Booth of the University of East Anglia.
299
References 1. 2. 3. 4. 5.
6. 7. 8. 9.
Barker K. CAPM-little cause for optimism. Production Engineer 1984; November: 12. Woodcock K. The best laid plans of MRP 11's. Technology Vol8(14): 10. Connor D. Information Systems Specification and Design Road Map. Prentice Hall, 1985, Chapter 1. Denvir T. Introduction to Discrete Mathematics for Sofhvare Engineering. Macmillan, 1986, Chapter 1. Halsall F. Data Communications, Computer Networks and OSI. Addison-Wesley, 2nd Ed, 1988, Chapter 10. Sommerville I. Software Engineering. International Computer Science Series, AddisonWesley, 1987, Chapter 1. Kleinrock L. Distributed Systems. Computer November 1985; 90-103. Van Rensselaer C. Centralize? Decentralize? Distribute? Datamation 1979; April: 90-97. Bender M. Distributed Databases: Needs and Solutions. Mini-micro Systems 1982; October: 229-235.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990
301
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 26
A Universal LIMS Architecture D.C. Mattes1 and R.D. McDowall* ISmithKline Beechum Pharmaceuticals, P.O. Box 1539, King of Prussia, PA 19446, USA and 2WellcomeResearch Ltd., Langley Court, Beckenham, Kent, BR3 3BS, UK
In this presentation we will be discussing the need and demonsuating the benefit of providing a consistent LIMS architecture in order to meet user requirements. In order to discuss this issue we will lay some initial ground work to address the following: - What are we really trying to manage in a LIMS ? - What is Architecture ? From these points we will move into the presentation of a model architecture for LIMS and system design and implementation that this architecture leads to. The first point to be considered is that in a LIMS we should be managing INFORMATION and not just data. Information is: Data that has been proccsscd into a form that is meaningful to the recipient or user and is of real or perceived value in current or perspective decision processes. This definition begins to emphasize the differcncc between DATA and INFORMATION. The following begins to demonstrate the difference: The following set of numbers are a set of DATA: -42 -0.5 35 69 98 126
3 4 5 6 7 8
Although this set of DATA is accurate it is of little value. Why? Becausc thcrc is no information. If wc add thc following then we have some valuable information:
302
Hvdrocarbon Boilinn Points
DaL 4 2 -0.5 35 69 98 126
Carbon Chain 3 (Propane)
4 (Butane) 5 (Pentanc)
6 (Hexane) 7 (Heptane) 8 (Octanc)
In this last tablc in addition to adding data I have also addcd CONTEXT or rclationships to the data. Having the same data with no srructure or rclationships would be useless (cxccpt perhaps as a puzzle). Therefore in order to have information we must have DATA and CONTEXT (or the relationships of the data). Therefore we can say that more information value is NOT more DATA but more CONTEXT. Often times we have plenty of data which leads management and the organization to the point at which they are “drowning in data” and need “meaningful information”. In most organizations which we represent as scientists a key function is to convert data into meaningful and useful information. How does this convcrsion takc place? Thc following is a simple representation of a cycle which goes on within the scientific process. a. Wc cxecute experiments which generate duta. b. Wc pcrform analysis on data which generates information. c. We apply intelligence to information to gain knowledge. d. Wc dcsign experiments bascd on knowledge to (see a). Today most LIMS only address the DATA part of this process, and at times minimal analysis. What is the sourcc of this dcficicncy (i.e., not addrcssing thc entire proccss) the capability of thc systcms or the implementation of the systems? In reality it is probably a mixturc of both but I think that “Some LIMS CAN’T and most LIMS DON’T”. In order to reach this goal of integration we have dcvcloped a LIMS model. This model is based on simple architecture which allows a LIMS to be implemented and integrated with the organizational requirements while not overlooking functions which arc requircd for the system to be successful. We discuss this model in the light of an ARCHITECTURE which is an orderly design, prescnted in various views for diffcrcnt audiences in order to implement a system which meets ALL perceived needs. Wc can draw on construction industry as an examplc to explain architcchturc. To mcct the needs of a buyer an architect will prepare drawings which dcscribe the finishcd product. Thcsc drawings arc dcsigncd for scvcral audiences or different groups involved in the “building”.
303
The OWNER needs to know what a building will contain to insure that it meets the required FUNCTIONS. This is the USER. The DESIGNER needs to understand the required RELATIONSHIPS of these functions to insure a structurally stable building. This is the SYSTEM ANALYST. The BUILDER needs to understand the TOOLS and FOUNDATION in order to construct a finished building which meets the goals of the initial architectural drawings which the owner approved. This is the SYSTEM PROGRAMMER. The point of the architecture is to keep the various views of the “building” in synchronization in order to assure a satisfactory product. In the LIMS world this model provides an architectural vicw to communicate to all groups involved in the implementation of the LIMS and assure a satisfactory product. In the development of this architecture we had several objectives which we fclt a model like this could fulfill. - To define the scope of the LIMS. - To define the organization of the LIMS. - To facilitate communication between diverse technical groups about LIMS. - To provide a tool for training in the implementation and use of a LIMS. The components of the LIMS model are: - D a t a b a s e (a common data rcpository). - Data Collection (the path for data to cntcr thc LIMS). - Data Analysis (modules which read and write to the repository as they “gcneratc information”). - Data Reporting (the path for information to leave the LIMS for external use). - Lab Management (the tools for the managcmcnt of activities and resources within the lab).
( Database ]
I
A pictorial reprcscntation of t~ic modcl is includcd in Figure 1. This Figure 1. The LIhlS model. Arrows resprcscnt data and modcl can help us look at the informationflow.
304
Distributed Environment
Host Environment Figure 2. The designer’s view of the IJMS model indentifies the “structural relationship” of system components.
cornplcte picture of a LIMS and therefore begin to apply the LIMS to meeting a larger part of the scientific process discussed previously. In Figure 2 a similar yet different picture of the model presents a potential designer’s view of this model. This view increases the level of detail regarding the various components and outlines the organization or relationships of these components. In Figure 3 a similar, i.e., based on the same model, picture is presented which is the structure of an aciual LIMS application. This application CUTLAS (Clinical Unit Testing and Lab Automation System) was designed around an architectural model. The results of this have been that the application has been very successful as demonstrated by the following facts; - The system went from introduction to full production in less than 6 months. - The functions and relationships of the system components have been easily understood by the user community. - The systcm has been in operation for 5 years. - The system has proved to be extensible, It has been enhanced through the integration of new functions which in no way impacted the “running system”.
305
Figure. 3. The high-level structure of the CUTLAS-LIMS application.
To summarize, from his presentation it is important to walk away with the following key ideas: 1. A LIMS should be focused on managing INFORMATION and not just data. 2. The ARCHITECTURE behind a LIMS is an important requirement for a successful system. It provides a critical foundation. 3. The LIMS MODEL or architecture presented here has been successfully used in systems development.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
307
CHAPTER 27
Designing and Implementing a LIMS for the Use of a Quality Assurance Laboratory Within a Brewery K. Dickinsonl, R. Kennedy2, and P. Smith' Isunderland Polytechnic, Sunderland, UK and 2Vaux Breweries, Sunderland, UK
1. Introduction This paper presents an overview of the Vaux Laboratory Information Database (VaLID) developed by the authors for use in the quality assurance laboratories of Vaux Brcweries in the North East of England. The LIMS has been developed in-house and this paper discusses some some of the problems inherent in this development. The use of a structured systcms analysis and design methodology such as SSADM is presented as one method of overcoming some of these problems.
2. Background The authors are currently involved in a jointly funded project between Vaux Breweries, Sunderland and Sunderland Polytechnic. The purpose of the project is to develop a LIMS for use within the brewery whilst carrying out research into LIMS and their application elsewhere. The project is funded by Sunderland Polytechnic, Vaux Breweries and the National Advisory Board of the UK government. It is one of a number of research projects into User Friendly Decision Support Systems for Manufacturing Management. Vaux Breweries are the second largest regional brcwery in the UK. They produce a range of beers, lagers and soft drinks. The entire production process from raw material to packaged product takes place on a single site. The primary role of the laboratories within the brcwery is to ensure the quality of the product at each stage of the manufacturing process. There are a total of three laboratories, each spccialising in a particular area. A analytical chemistry laboratory carries out analyses on a wide variety of samples obtaining a number of analytical results which are used in the day to day production process. A microbiological laboratory assesses the microbiological integrity of process materials and final products, as well as ensuring plant hygiene.
308
There are approximately 70 different types of samples and 50 different analyses carried out on a routine basis. It is estimated that a total of 250,000 analyses are performed per year by the three laboratories with an average of 5 analyses per sample.
3. The VaLID system The VaLID system has been developed on 80286 based PC’s connected via an Ethernet network to a 80386 fileserver. The final system will have 8 such PC’s linked into the system. The PC workstations are used to enter data in the laboratories and to query the data in production. All of the data is held on the central fileserver. The system has been developed using the commercial Paradox database system and runs under the Novel1 operating system. The system consists of six major modules as follows:
3.1 Sample registration module The sample registration module is used by the laboratory staff to enter the details of samples as they arrive in thc laboratory. The user selects a sample type from a menu and is prompted for the sample identifying information. The information requested depends upon the sample selected and the group to which that sample belongs. For example, a bright beer tank sample requires brew number, tank, and quality to be input, whilst a packaged product requires this information in addition to product name, package size, best before date, etc. A master sample registration table records the fact that a sample has been logged-in together with the sample’s status and the date and time.
3.2 Analysis data entry module This module allows the analysis results to be recorded. A system of worksheets has bcen designed which holds results for one or more analyses. The analyses have been grouped according to location and time of data entry so as to allow laboratory staff to continue with a system similar to the previous manual one. When a worksheet is selected all the outstanding samples scheduled for data entry via that worksheet are obtained and the user may select one or more for which he wishes to enter data. After a number of discussions with users, and demonstrations of possible methods of data entry, a spreadsheet-like method was decided upon as being the most appropriate method for the majority of worksheets. After data has been entered it is checked against a set of specifications. The specifications allow a maximum and minimum value to be defined for each analysis, the exact value depending upon up to two of the sample registration details.
309
3.3 Validation Analysis results are not available outside the laboratory until after they have been validated by laboratory supervisors. Validation consists of viewing a worksheet with analysis results displayed and those out of specification highlighted. The supervisor may either hold analyses or allow them to be validated after which they may be accessed by production personnel.
3.4 Reporting A number of pre-defined reports linking data from various worksheets have been defined. These are produced on a daily basis as part of a daily laboratory report. A period report giving summary statistics may also be produced. ‘Pass chits’ are automatically produced when certain analyses are validated within specification allowing the product sampled to move to the next stage of production.
3.5 Query module The main method by which production staff may obtain data is via the query system. This system is currently still under development but will eventually allow data to be selected via a number of parameters including analysis or sample registration details.
3.6 Archiving To improve performance data not required on a regular basis may be archived. This data is still available but is not accessed during regular queries.
4. LIMS choices The VaLID system is a custom-made LIMS developed in house. This was one of three possible options available when installation of a LIMS was being considered, the others being to buy an ‘off the shelf’ commercial LIMS or to commission a software house to develop a system to Vaux specifications. By far the simplest and preferred option should be to purchase a commercial LIMS system. There are number of these now available on the market and careful consideration of these systems must be made before a decision to develop a unique system is taken. After considering a number of commercial systems on the market it was felt that they could not meet the specific needs of Vaux, in particular the need to include some of the particular requirements of the production deparunent was thought to be difficult. If a tailored LIMS was to be developed it was felt that a system developed on site was more
310
likely to meet the specific requirements since the development staff would be in much closer communication with the end user than if a system had bcen developed externally. Obviously this option is only available if the necessary computing skills exist internally within the organisation.
5. Potential problems when developing a custom-made LIMS Care must be taken when developing a LIMS. Software development is one of the most difficult tasks which can be undertaken by an organisation and careful planning and design are required if the project is to be successful. A number of potential problems arisc which are common to many software development projects. A full analysis of the requirements of the system nceds to be made and hardware and software chosen which will be able to meet these needs. A common cause of software failure is an underestimation of the data requirements or an overestimation of system performance. This may give rise to a conflict of requirements versus constraints which should be resolved as early as possible. The requirements of the laboratory must be balanced against the constraints of cost and facilities available. A realistic set of requirements nccd to be determincd if the development is to be successful. Careful analysis of the work involved in achieving these requirements needs to be made and a realistic timescale obtained. Often in software development the work involved is underestimated, leading to systems which run over time and, therefore, over budget. It has been estimated that approximately 90% of software projects run beyond their budgeted timescale. As well as being realistic, requirements must accurately reflect the needs of the laboratory. This requires good communication between those developing the system and laboratory staff. The LIMS needs to be fully and unambiguously specified and these specifications understood by all concerned. The needs of laboratories are constantly changing as new samples, analyses and methods are undertaken. A successful LIMS needs to be flexible enough to be modified to meet these changing needs. Development staff unfamiliar wilh a laboratory set up may fail to incorporate facilities for such modifications if they are not fully specified. The system must be designed with flexibility in mind. A LIMS, like any other software system, needs to be fully documented. Often documentation is left until a system has been developed which may lead to poor or incomplete documents. It has been shown that it is far more efficient to develop documentation whilst a system is being developed. This documentation should include both user and program manuals.
6. Suggestions for avoidance of potential problems Many of the problems discussed above result either directly or indirectly from a lack of
31 1
1 Sample Collector Sample Registration
3 Supervisor
Check Results
4 LaboratoryStaff Archive Data
A
v A
v 2 Laboratory Staff
5 LaboratoryStaff
Figure 1. Example of a data flow diagram.
communication between the user and the system developer. It is vital that the software engineer fully understands the operation of the laboratory and that laboratory staff understand the specifications being proposed. The computer personnel can gain knowledge of the laboratory from discussion with laboratory personnel and through studying the written standard operating procedures. It is unlikely, however, that such knowledge will encompass all the needs of the laboratory. It is therefore important to have regular meetings bctween computer personnel and laboratory personnel to discuss the developing specifications. During development of the VaLID system a thorough understanding of the laboratory operations was gained by ‘shadowing’ laboratory staff at work and studying the written procedures. Once sufficient dctail had been gathered a detailed specification was prepared and discussed with all concerned. This resulted in a number of modifications and rcfinements until a system to which all agree was obtained. During such meetings it was found to be advantageous to use some of the techniques from the Structured Systems Analysis and Design Methodology (SSADM).
7. SSADM SSADM is a formal methodology for use in the analysis, design and development of computcr systems. It has six stages each of which have defined inputs and outputs. The outputs from each stage should be agreed upon before continuance to the next stage. The six stages of SSADM are Analysis, Specification of Requirements, Selection of System options, Logical Data Design, Logical Process Design, Physical Design. A number of
312
Figure 2. Example of an entity life history diagram.
techniques are available for use within each of these stages. These stages when used with ihe defined techniques provide a step by step approach to system design. The techniques result in a non contradictory precise specification of the system. The deliverables at thc end of each stage also form a base for documentationof the completed system. The techniques each have precise rules laid down so that the ambiguities associated with natural language descriptions are avoided.
8. Data flow diagrams, entity life histories and logical data structures These three techniques provide a diagrammatic representation of the system being considercd. This can be either the existing physical system or the proposed computer system. They can be easily and quickly modified and restructured until an agreed specification has been reached. Such modification is often lengthy and difficult with written descriptions. Figure 1 shows the data flow diagram for data associated with a Bright Beer Tank sample within the VaLID system. This data flow was shown to be extremely similar for all beer samples within the system. The diagram shows how analysis data moves around h e laboratory. It does not, however, show the order in which each of the functions occur. To achieve this an Entity Life History is drawn up (Fig. 2). This shows the order in which each of the functions take place and what has to have occurred before data can move from one location to another. A Logical Data Structure is used to show the relationship between items of data, or entities, within the system. Figure 3 shows a simplified Logical dab structure for a BBT sample relating it to a number of analysis groups. All of these
313
BBT Sample
c02
Analysis
Haze Analyses
Scaba Analyses
Figure 3 . Example of a logical data structure diagram.
techniques proved extremely useful during development of the system. A full description of the SSADM techniques is beyond the scope of this paper but the interested reader may refer to any of the standard text books on the subject. SSADM was primarily designed for use with commercial data processing applications and some of the techniques and stages involved are, therefore, not applicable to the LIMS application. However, following the stages and techniques laid down provides a framework from which a successful system can be developed.
9. Conclusion The decision to implement a LIMS within a laboratory is one of the most important strategic decisions which can be made. This decision should not be entered into lightly or without planning and fully specifying the proposed system. Wherever possible commercial systems should be sought which meet the specified requirements. If a decision is made to develop a specific LIMS then care must be taken to ensure a structured approach is made. A number of formal methods are available which provide a structured approach to software development, of which SSADM is only one. The use of such formal methods can help to ensure that no ambiguities exist, either in the system itself or, in the perception of the system bctwecn the laboratory and software engineer. The important point is that full communication takes place between the LIMS developer and the laboratory at all stages of development,
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), ScienrificComputing and Auromation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
315
CHAPTER 28
Selection of LIMS for a Pharmaceutical Research and Development Laboratory-A Case Study L.A. Broadl, T.A. Maloney2, and E.J. Sub& Jr2 'Analytical Chemistry Department, Pfizer Central Research, Sandwich, Kent, UK and 2Anal)ltical Research Department, Pfizer Central Research, Groton, CT, USA
Abstract The Analytical Chcmistry laboratorics of Pfizer Ccntral Research on both sides of the Atlantic are jointly evaluating, selecting and implementing LIMS in both departments. A two-phased stratcgy was devised by an international project team to meet thc particular challcngcs of sclccting a single LIMS for our rescarch and development based laboratorics hat would satisfy bolh groups. Throughout the evaluation, the commitmcnt of thc future users of thc system has bccn actively encouraged by involving them directly in thc rcquircmcnts analysis and evaluation, primarily through the use of workgroups. The twophascd stratcgy has bcen highly successful, resulting in the sclection of a single vendor LIMS and wc now proceed into implcmcntation with the support and commitmcnt of thc projcct tcam and futurc users.
1. Introduction Thc sclcction of a commercial Laboratory Information Management System (LIMS) product for thc analytical rcscarch and devclopment dcpartments of a pharmaccutical organisation is a challenging process. The flexibility and functionality nccdcd cxtend bcyond that normally rcquircd in a QC environment. A system is required that can accommodate the unexpected, unknown sample for which no tests or specifications are yet dcfincd, that can handle evolving methods and specifications, and that can be configurcd to modcl thc rolcs and structurc of various different laboratories. The two Analytical Chcmistry dcpartmcnts of Plizer Ccntral Rcscarch in Sandwich, UK and Groton, USA havc bccn tackling this problem as ajoint project with the objective of evaluating, sclccting and implcmcnting a single LIMS producl at both sitcs. This papcr will firstly providc an oullinc of thc dcpartmcnts involvcd in this project, a dcscription of thc intcrnational project tcam, and its joint objectives and constraints. Thc
316
two phases of the LIMS evaluation process itself will then be described, highlighting experiences gained. Finally, the plans for implementation of the selected system are outlined.
2. Background The two Analytical departments of Pfizer Central Research in Sandwich, Kent, UK and Groton, Connecticut, USA are primarily research oriented. The common mission is the development and application of the analytical chemistry associated with bringing new drug candidates to market. There is an increasing need to assist the analyst in the easy and efficient capture of data, to facilitate the transformation of these data into information for both internal and regulatory use, and to share this information between the departments. Tn practice, the two departments are functionally very similar, but some organisational differences are apparent. Both comprise a number of project development, resource and technology groups responsible for all analytical activities from initial discovery of a drug through to its marketing and launch. This includes in-process, final QC and stability testing of the drug substance and its dosage forms. Both departments are also responsible for collating data for world-wide registration purposes. Historically most information generated in the laboratories was stored on paper. Many staff in both groups have, however, become accustomed Lo using a variety of statistical, structure drawing, word-processing and spreadsheet packages. In addition, a number of in-house developed databases have been implemented, handling such functions as sample tracking, storage of stability protocols and results, stability scheduling and logging clinical supplies’ expiry and disposition. Neither group has yet, however, interfaced instruments to its databases. As a result of the use of databases, the value of electronic information management, from day-to-day work and ad-hoc queries through to the collation of data for regulatory submissions, has quickly been appreciated by all levels of staff. By 1987 both departments recogniscd the need for LIMS and began to prepare for evaluation. Since there is close trans-Atlantic co-operation and frequent transfer and sharing of information, the decision was made to work jointly in an international project to evaluate, select and implement LIMS. This approach was seen as preferable to either independently selecting, maybe different, LIMS products at each site or even to having one site conduct the evaluation and selection on behalf of the other. A project team was established, comprising a LIMS coordinator on both sides of the Atlantic and additional management involvement. The responsibilities of the team are to coordinate the joint project, set its objectives and to plan and recommend decisions to senior management for endorsement.
317
3. Objectives, constraints and challenges The selection and implementation of LIMS has been widely discussed in the literature, including general perspectives [ 11 and specific examples from a variety of organisations [2-51. Most publications consider implementations at a single site with fairly routine sample handling needs: the challenges of a joint, international project in a pharmaceutical research organisation have not received attention. The paramount objective of this joint effort was to establish a unified system in Sandwich and Groton. The goal was to identify a system that would be flexible enough to accommodate the research and development nature of both laboratories. The team also anticipated the need to implement the system over many months, maybe years, so a system was needed that could be developed in a stepwise fashion. One major question to be answered, however, was whether to buy a LIMS or develop a system in-house. The decision to purchase a system was relatively easy: Pfizer is a pharmaceutical company and, while having excellent computer support, resources cannot meet the demands required for full scale LIMS development, nor is there the many years of expertise that the commercial suppliers can call on. The joint objective therefore became to select and purchase a product from the same vendor for both sites. In addition to the constraint of selecting a system that suited both sites, there were a number of other practical constraints. The LIMS would have to run on DEC VAX computers-a corporate and connectivity requirement. In speaking to some potential vendors there was often criticism for this ‘must’. However, the consideration of the hardware is a vital issue in any selection. For a large, two-site installation such as was planned, the full and committed support was nceded of the computing groups, which necessarily required abiding by their policies and recommendations. The connectivity offered by the existing Pfizer Central Research trans-Atlantic DEC computer network was also required. The system would also have to be compatible with the VAX based chromatographic data acquisition system installed at both our sites. Overall, it had to be possible to transfer data easily between chromatography and LIMS systems on a given site and across the Atlantic. There are many challenges associated with an international LIMS project, both administrative and technical. Obviously the geographical separation provides project administration challenges for the international team. In order to overcome these there is regular electronic mail and telephone communication between the project coordinators and international meetings are held on an approximately quarterly basis. These meetings are used to make major project decisions and to develop and agree strategies and timetables for the project. As has already been identified, the selection of a commercial LIMS product for a research and development laboratory presents its own series of technical challenges. The
318
flexibility and functionality needed extend beyond that usually required in a QC environment. A system is required that can accommodate the unexpected, unknown sample for which no tcsts or specifications are yet defined, that can handle evolving methods and specifications, and that can be configured to model a range of laboratory disciplines. These challenges are increased when the same LIMS product has to be suitable for multiple laboratories, each with its different product types, roles, organisational structure and, therefore, LIMS requirements. The two departments are functionally similar but do have differences in organisational structure. Throughout the evaluation, selection and implementation process it was necessary to work toward obtaining a product that suited both departments equitably. A further challenge for the international project team is to balance the joint project needs and plans against the background of all other work in the two departments, It is often necessary to justify priorities and resources for the LIMS project against the competing requirements for the time of staff involved. This can be further complicated when the extent or timing of available resource differs between the two departments. Considerable flexibility is required to schedule joint LIMS activities against this changing background. Finally, there is the challenge of developing, and sustaining, user commitment in a long-term project such as this. The joint process for the selection and implementation of LTMS has bcen designed specifically to include significant participation and contributions from the potential users of that system.
4. Strategy Having made the decision to select a single vendor system it was clear that a thorough evaluation of those available would be required to determine which was the most suitable and whether that systcm could be implemented to meet our requirements. The basic strategy developed for our evaluation comprised two main phases. Firstly, a preliminary evaluation would be conducted of all commercially available systems that met the original constraints. Both departments would conduct similar evaluations, essentially concurrently, and reach a joint decision as to the leading contender for further evaluation. The second phase would then be based on an extensive on-site evaluation of the selected product. Again, the evaluation would be conducted on both sides of the Atlantic. In addition to the ‘hands-on’ demonstrations and evaluations, both of these evaluation phases would include significant preparation and other associated research, including analysing and developing requirements, literature review and discussions with vendors. These activities would involve many people outside the international project team: the team decided to implement a workgroup approach throughout the evaluation and implementation process with the objective of developing user involvement in, and commitment to, the project.
319
5. Phase 1 preparation The details of the first phases of the project were developed at a first international meeting. During this meeting the project team developed the strategy and timing for evaluation of the commercially available LIMS products that met our original constraints. In preparation for the first evaluation phase, each department had previously researched its basic LIMS requirements through review of the departmental structure, operation, sample and information flow, and through interviews with staff and brainstorming sessions. Joint brainstoming sessions were then held during the first international meeting to identify, merge and organize the LIMS requirements. These requirements were then classed as essential or desirable and were also organised into six general categories. This classification was designcd to help structure the evaluation and to provide areas of study for workgroups. Following this preparation, the strategy developed for the Phase 1 evaluation included the following activities: (i) The participating vendors would be invited to demonstrate their systems at both sites for two days. During these sessions each workgroup would have its own session with the vendor and the vendor would be invited to make an overview presentation of the system. (ii) Sales and system literature would be obtained from all vendors and reviewed. Copies of all available manuals-system manager manuals, user guides and reference manuals, would be requested to be provided to each site. (iii) A joint U W S Request For Proposal would be prepared and submittcd to all the vendors that were to participate in our selection process. Their proposals would be available at both sites for review. (iv) The project team would then select the most suitable product for extended demonstration (some 6 months) on site. (v) At the end of the extended evaluation a joint recommendation for purchase would be prepared. Throughout the process, both departments would enjoy the same opportunity to evaluate the products.
6. Request for proposal The Request For Proposal was designed to obtain from the vendors the information nceded to reduce the number of systems for final evaluation. It also provided the opportunity to describe to the vendors the joint nature of the evaluation and to emphasise the need for their cooperation in meeting joint demonstration requirements and schedules on both sides of the Atlantic. The document, some 20 pages, was prepared during a second international meeting and contained six sections; (i) the mission of the departments, (ii) the existing laboratory information management systems-this included an overview of the departments on both
320
sides of the Atlantic, including their interactions and sample and information flow, (iii) background and objectives-a discussion of the current LIMS and automation strategy and an outline of laboratory instrumentation, connectivity and data import and export requirements, (iv) LIMS functionality required-a list summarising our essential and desirable requirements, separated into six categories, (v) selection and implementation approach-this contained information for vendors such as requirements for the on-site demonstrations and (vi) response to RFP-here was described the response required from the vendor.
7. Requirements & workgroup categories The initial joint LIMS requirements were organised into six general categories: (i) Sample management: this considered information available to the laboratory exclusive of that obtained by analysis-for example sample identity, description, tracking, scheduling and status. (ii) Data management: this was concerned with information gathered from sample testing and included the recording, processing, manipulation, validation, storage, retrieval and reporting of data. (iii) Quality management: this was a term used to include two areas. Firstly, electronic quality assurance functions such as data validation, results validation, method performance and system performance. Secondly, method and specification management, looking for a database and tools for the preparation, validation, storage, revision, indexing and performance monitoring of analytical methods and specifications, and also at on-line validation of test results against specifications. (iv) Instrument management: an assessment of the database and tools for the identification, indexing, validation, calibration and maintenance of instruments and associated electronic data transfer (with error checking). (v) Users and customers: this covered searching and reporting functions and the user interface and ergonometrics-for example on-line help, screen displays and responsc times. (vi) Technology and validation: in this category was considered the technology of the LIMS database and tools, for example the database management system structure, archival and retrieval functions, audit trailing and also regulatory needs in the arca of computer system validation.
8. The formation of workgroups Since the project team aimed to develop user commitment primarily through the use of workgroups, these workgroups were established in the earliest stages of the evaluation. In both departments the participants were invited from various groups and represented all
321
levels of staff, from technician to management. In this way a wide range of experience and expertise was included in the the LIMS evaluation process. Both departments were free to structure and staff the workgroups to fit in with local organisation and resource availability. The use of workgroups allowed a large number of people to participate directly in the project. In Sandwich, for example, approximately one third of the department was involved in one or more workgroups during Phase 1. During the first phase, each department had a group of people studying each of the six categories outlined previously. Each workgroup included a local coordinator and secretary and reported its findings through minutes, presentations or meetings to its international LIMS coordinator. The international LIMS coordinator usually attended the meetings to provide tutorials, background information, information from other workgroups, both local and trans-Atlantic, or other information as needed. Otherwise the workgroups were given the freedom to conduct the meetings as they saw fit.
9. Workgroup objectives and activities-Phase
1
The main feature of the first phase was the series of two day on-site demonstrations of the products. The first objectives of the workgroups were to take the basic joint lists of requirements and functions derived from earlier brainstorming sessions and to develop these requirements in order to (i) provide input into a more detailed functional specification and (ii) produce a list of questions to be put to the vendors during demonstration of their systems. The workgroups were then expected to (i) attend the demonstrations and (ii) report their findings-the answers to the questions formed the basis of the workgroups’ summary reports. The use of predefined question lists during the vendor demonstrations enabled the workgroups to obtain the information they needed in addition to the information that the vendors selectively chose to give them. It also helped the groups obtain corresponding information from all vendors. At the end of this first phase the level of user involvement and activity was high. The degree of understanding of LIMS was growing steadily. The level of detail reached in developing the requirements and reviewing the vendor systems was excellent, and this provided a large part of the input into the subsequent selection decision process.
10. Phase 1: Selection decision At the end of Phase 1, the leading system was identified by means of a thorough and structured decision analysis process which took place during the third joint meeting of the international project team. The Decision Analysis technique, named Kepner-Tregoe after its founders [6],comprises a series of steps as follows: (i) define the objective-“to select a vendor for further evaluation”, (ii) establish criteria-some time was spent formulating
322
twenty criteria that could be used to sort the systems and also reflected the most important requirements of the system, (iii) weight the criteria-it was found easiest to start with all criteria weighted 5 on a scale of 1-10, and then to increase or decrease weightings from that starting position, (iv) scoring-the systems were scored out of 10, with 5 representing a ‘satisfactory’ score; h e overall weighted scores were then calculated for each system, at which point an overall winner emerged, (v) consider the adverse consequences -any potential problems or adverse consequences with the decision obtained from the scoring exercise were then considered and, finally, (vi) take an overview of all the available data and make the best balanced decision-the final decision was to progress with the single system that had scored highest in the first part of the process. Information in support of thc decision making was drawn from project team and workgroup findings arising from the on-site demonstrations, from the vendors responses to our Request For Proposal, from a review of sales and system literature, and from additional project team rcsearch. The selected vendor system was then progressed to Phase 2 of the process, with the objective of gaining a more detailed understanding of the system and its suilability for both departments. A further objective was to identify any additional requirements to be spccificd in the system before making a purchase.
11. Phase 2: Preparation and activities The slrategy proposed for Phase 2 cenued on installing the selected product on sitc, at both sites, for a detailed technical evaluation. Following a meeting with the vendor at the cnd of 1988 it was agreed that, in Sandwich, the configuration and installation for onsite cvaluation would be performed in much the same way as it would be for a permanent installation. The main features would include (i) a preparatory range-finding and scheduling meeting at Sandwich to determine requircments for the configuration, (ii) two 3-day configuration meetings at the vendor site to prepare the system for installation, (iii) 2-day programming training course to enable instrument interfacing to be evaluated in-house and (iv) 5-day installation and training at Sandwich. Once installed, h e system would be evaluated by means of comprehensive and reprcscntative worked examples and scenarios. In Groton, a similar on-site evaluation was also conducted. There were slight differences in the structuring of the collaboration with the vendor and the details of the in-house activities, but the objectives were the same.
12. Workgroup activities-Phase 2 As in the first phase of this project, there was extensive workgroup involvement. The initial LIMS requirements had been identified during Phase 1. During preparation for the
323
onsite evaluation, the objectives of the workgroups were to develop these requirements in more detail. In Sandwich, much of this work was developed by a core workgroup comprised of selected members of the earlier workgroups. Prior to the first configuration meeting the workgroups identified sample types, representative specifications and reports, required database fields, example test procedures, calculations and specifications, and reporting requirements. Unique sample identification numbering systems were also developed. During this time the workgroups also developed worked examples, scenarios and other exercises to be used during the hands-on evaluation period and identified various aspects that could challenge the system. The workgroups contributed to collation of specimen documentation for reference during configuration-a document containing the identified methods, specifications, database fields and report types to be accommodated in the evaluation system was compiled for the vendor to use as reference. The workgroups were then expected to attend hands-on evaluation of the system during the six months, report their findings and input into the final decision. During Phase 2 the workgroups were often able to identify the exceptions to routine QC sampling, testing and reporting that would be most likely to offer a challenge to the system.
13. Configuration process The configuration process was similar at both sites but, for clarity, only the Sandwich process will be described here. The first configuration meeting was held in January 1989 at Sandwich. During this meeting the dates for the remaining configuration and installation meetings were agreed and Pfizer were actioned to prepare for the next meeting: it was necessary to develop various unique identification schemes and to identify database fields and formats. The two 3-day configuration meetings were then held at the vendor site. During these meetings the basic vendor system was configured to match the requirements developcd at Sandwich. The main activities were as follows: (i) the database fields were built into the system, (ii) our unique numbering schemes were incorporated, (iii) example screens were prepared for most basic functions, (iv) a skeleton menu structure was configured, (v) some specifications and tests were configured, (vi) examples of various types of customer report were generated for use as templates and (vii) a provisional security structure was configured. By the end of the meetings the software had been configured successfully to accommodate our starting requirements for sample tracking, method and specification management, result entry and reporting. A Pfizer programmer also attended the final part of the second meeting to be trained in the interfacing and data transaction languagesoffering the opportunity for a more informed evaluation of direct instrument interfacing.
324
14. Installation and on-site evaluation activities The software was installed on-site at Sandwich in May 1989 by the vendor. Hardware for instrument interfacing was also supplied. Installation took less than a day: the remainder of the installation week was spent on user and system management training. To structure the evaluation, two major scenarios were proposed for examination. These scenarios would be based on ‘real’ work and would identify and involve associated gcnuine samples and data. The first scenario was designed to look at how the systcm could be used for work associated with products in the advanced stage of developmenta current product was used as the worked example. The evaluation considered aspects such as sample logging and tracking, configuring and using methods and specifications, and reporting. In addition to the drug substance and dosage forms, considcration was given to how the system would be used for associated raw materials, in-process samples, excipients and comparative agents, providing a comprehensive scenario for a representative advanced candidate. The fewest problems were anticipated in implementing LIMS for advanced candidates, but it was the area where it was essential that the systcm could accommodate our requirements. The second scenario considered early developmcnt projects. The ability of the system to handle projects from the earliest development sample, through to formal stability, was asscsscd. To simulate this, the early development of a candidate, examined retrospectivcly, was used as a model. Greater problems were anticipated in handling these early samples, since thcre can be rapid changes in sample types, methods and specifications. It was the area of work that would probably be implemented last, but a high dcgree of confidence was still required that the system could accommodate such work to a reasonable cxtcnt.
15. Hands-on activities A range of representative test procedures were analysed and translated into tests. A series of product specifications were entered to allow evaluation of the use, maintenance and rcvision of specifications. Screens, menus and standard reports were developed to assess how easy it would be to routinely modify and configure the system in house in a live implementation. Custom retrievals and reports were developed, again to assess the ease and flexibility of operation. During the evaluation the core workgroup held a series of mcctings and examined the system on-line. The users were able to get hands-on experience of the system and wcrc able to identify more detailed rcquircments, questions and concerns. Members of the core workgroup logged in a number of samples to the system, using real laboratory data from complctcd work. The use of the instrument interface was examined on-line through the
325
interfacing of an analytical balance and UV spectrophotometer,and the development of a program to enable content uniformity and dissolution of capsules to be tested.
16. Experiences, observations and benefits In general it was found that the configuration developed and evaluated could accommodate most of the sample types, with varying degrees of ease and elegance. Most of the basic functionality required was provided. Because of the limited training and experience available, first efforts at setting up tests, specifications, screens and reports were time consuming but not difficult. In general, most methods or test procedures could be modelled and translated into tests but, in particularly complex cases, the need for custom work was identified. As experience was gained with the system so also some of the future implementation issues began to emerge. As was expected, one of the challenges is the ‘unknown’ sample -we soon recognised that an implication of using specifications for early development candidates would be the need to be able to create and update specifications at short notice to accommodate these new, unknown, samples and to keep up with test and specification updates. The need to further consider the issue of responsibilities for system updates in a live implementation was recognised. Would test and specification management be restricted to LIMS system managers and implementers or would specified laboratmy staff be able to maintain their own records? The experience of interfacing instruments was extremely worthwhile. Workgroup users were able to gain a valuable insight into practical aspects of online data acquisition and were able to provide valuable feedback on ergonometric issues such as how to manage the sharing of interfaces between users. The two-phase strategy, in particular the extended evaluation phase, has been extremely successful and has a number of benefits. An in-depth understanding of the functionality, flexibility, ease of use and scope of the system has been gained. A better understanding of the product now enables consideration of the role of the vendor LIMS product within a wider information management strategy: the long term implementation and evolution of the system can be planned. The evaluation has helped to assess the resource implications for implementation: it was recognised that the resource required throughout our implementation would depend partly on how many requirements are met by the off-the-shelf purchase and how much customisation (either in-house or by consultancy) is required. The groups are now better placed to assess this and can also estimate and plan for the operational support of the system, including user training. As the role of the workgroups also now changes from evaluation to implementation, the benefits of the workgroup approach become increasingly evident. Through their involvement in all stages of analysis, specification and evaluation, the workgroup members have developed a good understanding of LIMS in general and a working knowledge
326
of the selected product. They have a greater awareness of the likely impact of LIMS in the workplace and realise the need to consider the future implementation of LIMS when, for example, evaluating instruments for purchase, writing methods or automating proccdims. There is also an enthusiasm for the benefits that they expect from the implementation of LIMS in their area. The project team has collaborated throughout the selection process in order to achieve the joint objectives and to make the best use of available resources on both sidcs of thc Allantic. Where appropriate, certain aspects of the evaluation have been sharcd, to be cxamined in greater detail by one or other group, for example where the requirements of both dcpartments were the same. In other areas of study it has been necessary for both groups to evaluate the system to ensure that specific local requirements were properly addressed and assessed. Through this collaboration both groups have a clearer idea of how to mcct our unification objectives with this system. A clearly defined objective at the start of this project was to implement the same LIMS product at both our sites. What was not clcarly defined at that stage was how similar the local implcmcntations of that single product should be. Knowing the technology as well as the goals, it is possible now to identify pre-requisites to developing this technical unity.
17. Completion of Phase 2 At the conclusion of Phase 2 of the evaluation it was agreed that the system under evaluation should be purchascd for implementation in our two departments. Phase 2 has provided thc confidcnce that the system can bc implemented for handling our more routine work-for example stability and clinical trial samples. There is less certainty about how thc system will handle non-QC samples, but preliminary evaluations have been encouraging. From an understanding of the expected enhancements and the vendor’s mission and product commiunent, there is confidence that the system can form the nucleus of a joint LIMS implemcntation. Most of the basic requirements are met satisfactorily, either as delivcrcd functions or achievable by customisation.
18. Future plans-Implementation Both laboratories place value in an implementation that is modular in nature and expect to prototype the sclccted LIMS in a portion of each Department prior to implementation laboratory-wide. The objectives in prototyping include continued collaboration and user involvement through workgroups to (i) learn the LIMS product, (ii) uncover problems carlicr with less impact on laboratory opcrations, (iii) gain implementation expcriencc, (iv) idcntify future product development and implementation requirements, both sharcd and sitc-specific, cnhance the user interface and develop training requirements and (vi) determine system validation needs.
327
The first area of effort in Phase 3 is to prove the LIMS configuration. Then follows a refinement of the requirements for advanced candidates, i.e., for stability work, followed by implementation in that area. Effort in other associated areas such as interfacing with our chromatographic data acquisition system will also be started.
19. Conclusion The two-phase evaluation strategy has been highly successful, resulting in the selection of a single vendor LIMS for our two departments. The selection decision has been made with a well developed understanding of requirements and of the evaluated product. The evaluation has also given an insight into a number of implementation issues and both groups now proceed into the next phase with the support and commitment of the project team and future users. It is our certainty that the international effort and user commitment will continue to contribute to the goal of a unified LIMS for the two departments and continue to knit the two departments together as a whole.
Acknowledgement The authors thank Dr. J. C. Bcrridge for his help during the preparation of this manuscript.
References 1.
2. 3.
4.
5. 6.
McDowall RD. Laboratory Information Management Systems. Wilmslow, England: Sigma Press, 1987 Berthrong PG.Computerization in a Pharmaceutical QC Laboratory. Am Lab 1984; 16(2): 20. Cooper EL, Turkel EJ. Performance of a Paperless Laboratory. Am Lab 1988; 20(3): 42. Henderson AD. Use of the Beckman CALS System in Quality Control. Anal Proc 1988; 25: 147. Dessy RE. Laboratory Informantion Management Systems: Part 11. Anal Chem 1983; 55(2): 211A. Kepner CH, Tregoe BB. The New Rational Manager. Princeton Research Press, Princeton, New Jersey, USA, 1981.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europel 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
329
CHAPTER 29
A New Pharmacokinetic LIMS-System (KINLIMS) with Special Emphasis on GLP U. Timm and B. Hirth Pharmaceutical Research and Technology Management Departments, F. Hoffmann-La Roche Ltd, 4002 Basel, Switzerland
Summary A new, ‘in-house’-developed, laboratory information and management system for pharmacokinetic studies (KINLIMS) is presented. Chromatography data systems are connected via PCs, terminal servers and Ethernet with a central DEC-VAX computer. Bioanalytical data acquired in the laboratories are sent electronically to a central ORACLEdatabase for GLP-conform data handling. The data are then exported ‘on-line’ to RS/lprograms for subsequent kinetic treatment. The main benefits of KINLIMS with respect to ‘Good Laboratory Practice’ may be summarized as follows: - No ‘off-line’ data transfer steps involved between acquisition and pharmacokinetic treatment. - KINLIMS ensures adherence to GLP with respect to data handling/manipulation as laid down in departmental SOPS. - All data manipulation steps and administration activities associated with data handling are documented in an electronic GLP-journal. - A sophisticated built-in authorization hierarchy ensures GLP-conforming system handling. - Future oriented, system-independent and secure concept for the archiving of KINLIMS-produced data. - An automated validation procedure allows rapid re-validation of all vital KINLIMSfunctions.
1. Introduction Pharmacokinetic studies dealing with the time course of absorption, distribution and excretion of a drug are an important step in the development of new pharmaceuticals.
330
Drug analysis in biological fluids is an indispensable part of these investigations and reprcscnts a time-consuming and cost-intensive factor during drug development. Modern data systems are, therefore, widely used to improve the efficiency of collcction, rcduction, and documentation of analytical data, with concomitant reduction in cost pcr sample analyzed. The next step in computerisation of analytical laboratories is the introduction of laboratory information managemcnt systems (LIMS), and this is now being intensively investigated. The aim of these new systems is to co-ordinate thc automation of analytical instruments and data handling and cxtcnd this to the distribution of information, management and control within the overall structure of an organisation. According to our experience, none of the available commercial LIMS-systems fulfill all thc data processing and management requirements encountered in trace drug analysis in biological fluids. For this reason, a LIMS-system with special emphasis on pharmacokinetic studies (KINLIMS) has been developed in our company. Main advantages of the new system may be summarized as follows: - Easy and quick input of parameters from all kinds of prc-clinical and clinical studies. - All data import, handling and export steps are in compliance with Good Laboratory Practice (GLP) regulations. - Duration of analysis is considcrably decreased by avoiding manual data transfer stcps bctwcen acquisition and pharmacokinctic treatment of data. KINLIMS is a very complex and sophisticated system (it needed 4 man-years for developing the concept and realizing the application). A detailed description of all aspccts of the system would obviously be beyond the scope of this paper. For this reason, the papcr bricfly describes the concept and functionality, and then concentrates on a single aspect of the new system, namely the benefits of KINLIMS with respect to GLP.
2. Brief description of the system KINLIMS is installed on a central DEC-VAX computer and uses VMS as the operating systcm, ORACLE as central database and RS/l for graphics and statistics. The application was dcvclopcd ‘in-house’, using the 4th generation development tool UNIFACE. According to the hardware concept shown in Figure 1, chromatography data systems in the laboratories are connected via personal computers (PCs) and terminal servers with thc ccnlral VAX computcr, using Ethernet for data transport. Up to now, three different chromatography data systems are supported by the system, including SP 4200 integrators (Spectra Physics), Nelson 4430 XWZ data systems (PerkinElmcr) and Nclson 2600 data systems (Perkin-Elmer). In ordcr to import the acquired data in the form of standardizcd daily reports to thc host computer, all data systems have bccn cquippcd with uscr programs, which wcre developed in our own laboratories. Transfer of daily rcports to thc central computer is achieved by the protocol-driven progams Kermit (SP 4200, Nelson 2600) and HP Datapass software (Nelson 4430 XWZ).
33 1
I
VAXNMS
I
Ethernet ~~
Terminal Server
Terminal Terminal
P-a
Laser Printer
Chrorn.
Figure 1. KINLIMS hardware concept.
VT-200 compatible terminals or PCs with terminal emulation are used to handle the application and to provide the system with information about the kinetic study, investigated drug, etc. Data output is possible by means of local printers or central QMS-, PostScript- and LN03-laser printers.
3. Overview of KINLIMS-functions KINLIMS offers three diffcrcnt types of functions (project, management and systcm). To perform a function, the user must have obtained the corresponding authorization from his KINLIMS-manager (see paragraph 4.4). The project-functions represent the heart of the application and are applied to laboratory projects (see Fig. 2). A lab-project represents a complete pharmacokinetic study or only a part of it (e.g. the urine samples). For test purposes (training, validation, ctc.), it is possible to define test projects which may be deleted aftcr use. Only three persons (supervisor, analyst, pharmacokineticist), together with their seniors and deputies have acccss to the data and can carry out project-functions (details are given below). The status of a labprojcct can be either ‘planned’, ‘ongoing’, ‘closed’, or ‘archived’, depending on thc progress within the lab-project. During initialization, the status is ‘planned’. After aclivation, the status changes to ‘ongoing’ and the lab-project is now ready for data acquisition. No further data handling is possible in lab-projects with status ‘closcd’, while only archived lab-projecls can be deleted from the system. During the initialization phase, a new lab-project is defined and all relevant information conccrning investigated drug, dcsign of the kinetic study, involved samples (including calibration and quality control (QC) samples), and analytical methodology is entered
332
i
KlNLlMS
INPUT
OUTPUT
Project-Functions Spectra Nelson/ Nelson/
Daily Reports
+%-
Figure 2. KINLIMS project-functions. For each project-function name of activily, status of lab-project and necessary authorization arc shown (S = Supervisor. A = Analyst, P = Pharmacokineticist.I = Import).
to thc system. Inputs are made either directly via input masks, or by sclecting the information from pre-defined dictionaries, which are maintained by KINLIMS-managers. The initialization procedure has been specifically designed for treatment of pharmacokinetic studies and allows an easy and rapid input of sample descriptions from all kinds of preclinical and clinical studies, such as experimental kinetic studies, toxicokinctic studies, tolerance studies, bioavailability studies, randomized multiple dose studies, etc. After activation by the supervisor, the new lab-project is ready for data acquisition. Analytical data are imported in the form of daily reports generated by chromatography data systems and transferred electronically to the host computer. The daily reports contain only reduced data from a single day, namely names, conccntrations and qualifying remarks for all three sample types, as well as information about the quality of the calibration for that particular day. Daily reports can also be corrected under GLP-control after acquisition. However, corrections must be justified by means of a comment (details arc given later). The stored data can be selectively retrieved from the database, displayed in 'working tables' and treated statistically in compliance wilh GLP (exclusion of invalid data, removal of statistical outliers, calculation of means and relative standard deviations for replicate determinations, statistics with calibration and QC data). The quality of lhe data
333
can be evaluated by exporting the data to RS/l and generating graphical outputs, such as quality control charts, cumulation curves, etc. For laboratory management, various reports, tables, lists, etc., can be generated and displayed on the screen or printed out on local and central printers. After data release by the supervisor analytical reports for calibration, QC and unknown samples can be generated. The pharmacokineticist transfers the analytical end data ‘on-line’ into his private ORACLE-account and evaluates kinetic parameters by means of pharmacokinetic programs based on RS/1. Completed lab-projccts may be closed and then archived for long-term data storage. Archived lab-projects can be removed from the database leaving only some cardinal data in the system to allow the management of archived projects. Management-functions are used to gain management data from all individual lab-projects, e.g. overview and description of existing or archived lab-projects, number of analysed samples per year, number of released concentration values per year with respect to various parameters, such as applied analytical methodology, involved species and biological fluids, etc. System-functions are used to maintain the application. The authorized KINLIMSmanager can dcfine ncw users, edit existing user definitions, maintain dictionaries for species, biological fluids, etc., and run the automated re-validation procedure.
4. Benefits of system with respect to GLP The Federal Good Laboratory Practices regulations for all non-clinical laboratory studies, and the impending Good Clinical Practices (GCP) regulations require that all analytical data for pivotal studies included in an IND/NDA submission meet specific criteria for acceptability. KINLIMS plays a central role during the analysis of pivotal studies and, therefore, underlies the principles of Good Laboratory Computing Practice (GLCP). Considerable effort has been investcd to incorporate major GLCP principles into the concept of KINLIMS, including: - Data infegrity:all analytical data and supporting information is maintained in a sccure and consistent manner through all steps between import and archiving. - System integrity: the system is developed, maintained, operated and used according to the highest standards of computer technology. - Traceability: suitable controls have been incorporated into the system to ensure that data handling is performed in compliance with GLP and that particular activities, including project-, management- and system-functions, are performed by the correct people. The following six paragraphs indicate in which way these major GLCP principles were realised.
334
PC Send Dail Report
To AX
VAX
I I I
Extract Data and Prepare Temporary File
I
I I I I
! II I
I
I
& l okay?
I
NO
YES
I
Store Imported Data in Database I
Send Message 10 PC Data Import Success-
Figurc 3.
D a t a import.
4.1 On-line data transfer Before KINLIMS was introduced to our laboratories, daily reports were only available in form of printouts. Pocket calculators were used for data processing and final data had to be entered manually into kinetic programs. In KINLIMS no ‘off-line’ data transfer step is involved between acquisition and pharmacokinetic treatment thus avoiding any time-consuming and sometimes faulty transcription from raw data to final pharmacokinetic report. Figure 3 shows schematically the import of acquired data into the central database. After establishing automatically the connection to the host computer (user name and password are required), the transfer routine is started. All relevant information is extracted from the imported daily reports, stored in a temporary file and tested for syntax and logic. In casc of errors, the import program sends error messages to the chromatography data system and rejects the imported data.Otherwise, the temporary data are stored in the database and the user receives a message that the data import was successful. In Figure 4, the on-line export of data from KINLIMS into other programs is shown schematically. Quality of acquired data can be graphically evaluated in the following way: after starting an RS/1-session via KINLIMS, the data are transferred temporarily from ORACLE to RS/1 and processed by m a n s of RPL-procedures. At the end of the RS/l-session, the system automatically returns to KINLIMS. For end data treatment, a number of interfaces to pharmacokinetic programs or tcxt-systems have been developed. The data are transferred in the form of ORACLE-tables from the database into the private
335
RS/1
I I 1
Plot Quality of Calib-Data Plot QCControl Chart
Cumulation
KINLIMS~ Procedure for Calib.-Samples %%%E :l
Urine SamDles
Raw Data Treatment
h H
I I I I
I
RS/1
I
I I I I
I
Pharmacokinetics
Elsfit
H
Interface for Released unknown SarnDle Data
I
I
I
I
I I I I I
I I
I
I
I
I I
I
I
I
Plot Quality of Replicate Determinations
Interface to Text-Systems Table
I I
I
Enddata Treatment
Figure 4. Data export.
accounts of end users. RPL-procedures are started, picking up the exported data and preparing suihblc RS/1-tables for direct data input into kinetic programs based on RS/1 (INDEPEND, ELSFIT), or into text-systems.
4.2 Data handling according to GLID KINLIMS ensures adherence to GLP with respect to data manipulation / handling, as laid down in dcpartmcntal standard operating procedures (SOP). All analytical laboratories work according to the same quality criteria with respect to treatment, rejection and prolocolling of data. This may be illustrated by the exclusion of invalid raw-data within the KINLIMS-system. All laboratory data systems linked to KINLIMS carefully monitor the quality of data during acquisition and, if necessary, flag invalid concentration valucs with ‘qualifying rcmarks’. For example, concentrations falling outside the calibrated range receive the remark ‘OUT’ (above calibrated range) or ‘BLC’ (below limit of calibration), respectively, as demonstrated in Table 1. All imported data flagged with a qualifying remark are then identified by KINLIMS as invalid data and excluded automatically from any further data trcatment. Exclusion of suspicious data is also possible after data import at the KINLIMS-level. For example, the user may flag manually statistical outliers with ‘EXC’. It is also possiblc to rejcct all data from a daily report, or to correct individual sample names or qualifying rcmarks in daly reports by means of a special GLP-correction routine. For GLP-reasons,
336
TABLE 1 Exclusion of invalid raw-data in ‘WORKING TABLES’. SAMPLE NO
SAMPL TIME
DATE
~~
CONC FOUND ~
REM A
REM B
CONC MEAN
RSD* N (%)
~~~~
UA 100 UAlOO
Om Om
26.03.90 27.03.90
-2.43 3.24
NOP BLC
UAlOl UAlOl UAlOl
10m 10m 10m
26.03.90 27.03.90 28.03.90
100.24 130.11 83.56
OUT
UA 102 UA102 UA 102
30m 30m 30m
26.03.90 27.03.90 28.03.90
210.98 205.13 312.45
130.11
-
1
208.06
1.99
2
EXC
EXC
* Relative standard deviation REMARK A (imported together with values by the laboratory data system) N O P No Peak Found BLC: Below Limit of Calibration OUT:Out of Quality Range EXC: Excluded from further treatment CLE: Calibration Level Excluded AAR: Acquired After Release REMARK B (manually set by the user on KINLIMS-level) EXC: Excluded from further data treatment
Comment for GLP-Journal Please insert a COMMENT for: Dailv reDort 1T080789.UT excluded
I
GLP-Journal, data correction Date
Time User Activity Comment
________________________________________ 09-AUG-1990 14:49 Dr. XY Daily Rep. lT080789. UT exclude Problems with analytical method 10-AUG-1990 08:34 Dr. AB .................... 11-AUG-1990 11:12 Dr.CD 12-AUG-1990 15148 Dr. EF
Figure 5 . GLP-Journal.
....................
337
TABLE 2 List of activities recorded in the GLP-Journal. ACTIVITY Activate lab-project Import of daily report Extend lab-project by new parameters Extend lab-project by new samples Exclude data from further treatment Withdraw ‘Exclude data from further treatment’ Exclude daily report from further treatment Modify header of daily report Rename quality control sample Rename unknown sample Change qualifying remark of quality control sample Change qualifying remark of unknown sample Release data Withdraw ‘Release data’ Close lab-project Archive lab-project
COMMENT
No No No No Yes Yes Yes Yes Yes Yes Yes Yes No Yes No No
all handling steps associated with modification or rejection of data are documented in an electronic GLP-journal, as described in paragraph 4.3.
4.3 Electronic GLP-journal One major aim of GLCP is the inclusion of a history into the LIMS, showing which person was rcsponsible for the various activities carried out on particular items of information. For this reason, all important activities in ongoing lab-projects are documented with date, time, user and type of activity, in an electronic GLP-journal. In the case of critical data handling steps, the system even asks for a comment which is also protocolled in the GLP-journal as shown in Figure 5. The GLP-journal is divided into three sections, dealing with lab-project activities, working table activities and corrections of daily reports. The user has only rcad-access to the entries and, therefore, cannot overwrite or even dclete inputs in the GLP-journal. Table 2 shows all activities underlying the GLP-control and indicatcs in which case a comment is requested by the system.
4.4 Authorization hierarchy A sophisticated built-in authorization hierarchy ensures GLP-conform system handling
and guarantees a high standard of data security. Four authorization levels have been installed to control access to the system (Fig. 6 ) .
338
t
Fourth Level’) Project
-Supervisor -Analyst - Pharmacokineticist
Third Level Functions
- Project.)
-Data Im ort - Project-binition
Second Level VMS-Identifiers
- KINLIMS-Cwner
- KINLIMS-Manager - KINLIMS-User
VMS
First Level Operating System - VMS-User-ID (Username, password)
---------- - KINLIMS
’) Data Access also granted to seniors and deputies
Figure 6. Authorization levels.
The first two levels are controlled by VMS. All users require a VMS user idcntificaLion and a valid password and must be authorized for KINLIMS. Only users managing thc application receive the VMS-identifier ‘KINLIMS-manager’, and have access to programs and VMS-identificrs. The third authorization level concerns global KINLTMS-functions such as ‘projcct’, ‘project definition’, ‘data import’, ‘management’ and ‘system’. Dcpcnding on the responsibility within KINLIMS, a user may be authorized for one or more of these functions. All staff members obtain authorization for project-functions, while only laboratory supervisors are also authorized to dcfine new lab-projects. Only staff-members with spccial training obtain authorization for import of daily reports. The management function is dcpcndent on the position of the user in the organisation: Managers and group leadcrs can scarch through all lab-projects of their group members, while a laboratory supervisor can only manage his own lab-projects. For security reasons, only two persons in each dcpartmcnt (KINLIMS-manager and his dcputy) are authorized to carry out system-functions. The fourth lcvel regulates the privileges within a particular lab-project. Only thrcc pcoplc (supcrvisor, analyst, and pharmacokineticist), together with their seniors and deputies, have access to the data, and can perform project-functions. However, according to the different responsibilities in the pharmacokinetic study, they have different privilcgcs within the lab-projcct, as shown in Figure 2.
4.5 Archiving of data For GLP rcasons, all KINLIMS-produced data from pharmacokinetic studics must be retained for a pcriod of at least 10 years after the last introduction of the pharmaceutical.
339
I I I 1 I
I I I I I 1
I
Data Loader
b
Electronic Archives
Cassettes Optical Discs Tapes
I I I I I I 1 I I I I I I I
Figure 7. KINLIMS archiving concept.
Keeping all the data on-line would need a huge database and lead to serious responsetime problems. For this reason, a secure, system-independent and future-oriented concept for the archiving of KINLIMS-produced data was developed (Fig. 7). A large rcport-file, including the lab-project description, all imported daily reports, the working tablcs (showing all data manipulations) and the GLP-journal is produced togethcr with an ASCII-file containing all released end data of the lab-project. Report-file and end data-file are sent on-line to the electronic archive and stored finally on tapes, cassettcs, optical discs, etc. During the dearchiving process the report-file is sent back into a defincd VMS-account and can bc displayed on the screen or printed out without the need for any special utilities. The dearchived end data-file is loaded into a private ORACLE table which may be re-used for data transfer into pharmacokinetic programs, as already dcscribcd. The main advantages of the archiving concept may be summarized as follows: - Sccure long-tcrm data storage: several copics of the archived data files are produced and stored at different locations. Stored data are protected against modification and accidental loss. Only authorized persons have access to the archive rooms and are allowed to dcarchive data. - System indcpendcnt data storage: KINLIMS-data are not archived with their data structure and, thcrefore, need not to be restored into the database during dearchiving. Interpretation of the dcarchived ASCII report- and end data-files is possible without any spccial tools, and is not dependent on VMS, ORACLE, UNIFACE or the KINLIMS-application itself.
340
- Future-oriented data-storage: Long-term storage of KINLIMS-produced data saves space in expensive theftproof, watertight, and fireproof storage rooms. Dearchiving is reasonably fast and is completed in less than 2 hours.
4.6 Automated validation All LIMS-systems working under GLP conditions have to be revalidatcd at regular time intervals, or after relevant changes in hard- and software. Because of the complexity ol the application, re-validation of all vital KINLIMS-functions is a tedious and time-consuming task. For this reason, a procedure for automatic validation has been developed specding up the re-validation process and nceding only minimal input by the user. In the ‘Learning Mode’, the operator prepares validation modules Tor all relevant labproject functions, including initialization of a validation test-project, import of test data sets, dam correction, data handling, data reporting, data export, data security, and data archiving. In the ‘Executive Modc’, execution of one or more validation modules is started and the application is tested automatically.
Acknowledgements The authors wish to thank Mr. U. Blattlcr, Mr. A. Rook (Multihouse, The Netherlands) and Mr. Y. Schcrlen for their participation during development of the KINLIMS-system, Dr. H. Eggers, Dr. J. Kneer and Mr. M. Zell for helpful and stimulating discussions, and Dr. H. A. Welker and Mr. G . Zaidman for developing RPL-procedures in connection with KlNLIMS. Thanks are also due to Mr. H. Suter for drawing the figures. Finally, the authors are grateful to Dr. D. Dell for continual encouragement during development of the system and for correcting the manuscript.
E.J. Karjalainen (Editor), Scientific Computing and Automarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
34 1
CHAPTER 30
Validation and Certification: Commercial and Regulatory Aspects M. Murphy Sema Group Systems Ltd, Wilmslow, UK
1. Introduction There has been a significant upsurge of interest in the subject of computer system validation over the last few years, largely triggered by the pronouncements of regulatory bodies. Much of this interest has been fearful, concentrating on the difficulties, timescales and costs involved in validation, and exacerbated by the relative lack of knowledge which scientific professionals have of computers and their works. I would like to try and dispel some of this fear and trepidation, so my intention in this paper is to look at the wider reasons for going through a validation exercise, and to show the various benefits which an organisation can obtain from its successful completion. In order to achieve these ends this paper will be divided into four main areas, namely: - a review of the current position in t e n s of regulation and certification, and the likely future trends; - a discussion of what validation actually is, and what it is intended to achieve; - a brief look at how one actually goes about doing it, and - a consideration of the benefits which flow from the completion of the process.
2. Regulatory situation report: Now and in the future I do not propose here to re-cover the well trodden ground of developments in the regulator’s position on laboratory computer systems. Nevertheless, a quick look at the status quo will help to set the framework for the rest of the paper. The present position is that the US FDA and EPA, and the UK Department of Health regard it as essential for certain categories of laboratory computer systems to be validated. This applies to systems involved in the pre-clinical safety assessment of drugs (human and veterinary), food additives, cosmetics, agro-chemicals, etc. Even though this covers a relatively small subset of what might be called ‘commercial science’ this requirement has been enough to worry very large numbers of people. The global nature of the markets affected by these requirements is such that there is little comfort to be
342
gained from thc fact that most other nations are sticking strictly to thc OECD requirements on GLP, which tends not to say much about computer systems at all. Those of you who arc looking to the regulators for a degree of comfort may find it in thc fact that no regulationary position has been adopted on the use of computcrs in othcr parts of thc product developmcnt cyclc, namely research (as opposed to dcvclopmcnt), clinical trialing or manufacturing. Be warned however, that it does not appear that this situation will obtain in pcrpctuity. On one hand mcctings have already takcn place in the UK to considcr whcthcr a similar set of rcquiremcnts should be built into GMP inspcctions-a developmcnt that the US would be unlikcly to ignore. On the othcr hand the CANDA projcct (in which NDAs are being submittcd to FDA electronically, albcit on a trial basis) currently mixcs data from validated and unvalidatcd systems (c.g., clinical)a position which cannot be tcnablc in the long term. Having sccn that the scope of mandatory regulation is likely to increase, what about the ‘voluntary’ scctor? There is no doubt that the major commcrcial change affecting manufacturing industry in the late 1980s and early 1990s is Lhe emphasis on quality. In thc mid 1980s quality was seen by both buyers and sellers as an optional extra which was available if you were prepared to pay for it. This is no longer true! In today’s markcts it is necessary to be ablc to demonstrate quality in a commodity in order to be able to sell it at all. In scicncc related industries this not only increases the laboratory work loads (thereby increasing thc rcliance on computcr systcms) but also raises thc nccd to demonstrate thc rcliability of the data and informalion produccd. Presently this dcvclopmcnt is having the cffcct of requiring morc and more organisations to scck certification or accreditation to rccognised quality standards such as I S 0 9000 (or somc equivalent). The rapid approach of thc Europcan Single Markct in 1992, with its emphasis on mutual recognition of tcst rcsults, is crcating additional prcssurc for the laboratorics of manufacturing cornpanics to scck accrcditation for thc ability to carry out specific CEN or CENELEC tcsts. This prcssurc can only continue, and will bc incrcascd by the implications of Product Liability Icgislation and so forth. What has this to do with thc validation of computer systems? Simply this, thc accrcditation authorities have begun to takc notice of the work that has bccn done in both thc pharmaccutical and computer industries, and arc seriously considcring its application to the laboratorics they assess. The day when validation and ‘Good Computing Practice’ become a requiremcnt in thcsc areas cannot be long dclayed. If we look at the position in purcly commercial terms then a numbcr or lessons can bc drawn : 1. Thc prcssurc of work on laboratorics is going to incrcasc with thc increasing need to
dcmonstrate quality and safcty in products. 2. Thc falling numbcrs of scicncc graduates mcans that this can only be accomrnodatcd by incrcascd crficicncy, which inevitably involvcs thc use of computcrs.
343
3. Market and regulatory pressures will increasingly require those computer systems to be validated in order to demonstrate the reliability of the information they produce. The bottom line therefore is that if your computer systems are not validated, or worse still cannot be validated, your commercial position will be seriously and increasingly impaired.
3. What is validation and what are we validating? Having established that the failure to validate laboratory computer systems is (at best) commercially undesirable, it would be good to consider the basics of what exactly validation is. The commonly accepted definition of validation, for regulatory purposes at least, is that produccd by the IEEE, namely : “(1) The process of evaluating a system at the end of the development process to assure compliance with user requirements. (2) The process of evaluating software at the end of the software development process to ensure compliance with software
requirements”. Clearly there is some room for interpretation in this definition-if that were not so there would be much less worry and discussion about the whole subject. It is also worth remembering that the definition was not coined with GLP specifically in mind. It is therefore worthwhile to look at the definition and provide a more concrete interpretation. The first difficult point to arise is the word “system”. What does this mean? It must mean more than just software, or there would be no nced for part 2 of the dcfinition; what is the GLP interpretation of the term? I suggest that the term “system” needs to be interpreted widely, and includes the software, the hardware, the documentation and the people involved. The second difficult area is what is meant by “the end of the development process”? This has bccn interpreted as meaning the end of system or integration testing, ie before the system goes off to the user site. I suggest that this interpretation is inadequate as it cxcludcs consideration of the hardware on which the system will run, a proportion of the documentation ( e g , SOPS) and the knowledge and training of the people who are actually going to use the system. That being so “the end of the development process” must mean the end of commissioning, i.e., the point at which the system is installed in its target environment. This, I believe, is the commonly accepted view. The only rcmaining problem with part 1 of the definition is thc term “user requircments”. It is likely that the originators of the definition meant this to mean conformity to the agreed specification, but will this do for a GLP environment (whether formally
344
regulated or not)? Clearly conformity with the specification is important, in the sense that the system does what it should, but we need to consider also whether the system provides functions or allows actions which would not be acceptable for GLP purposes. It seems necessary to presume the existence of a user requirement for certain functions to be present (such as audit trail) almost regardless of what the specification actually says. Given these interpretations we can provide an expression of part 1 of the definition of validation in terms of a concrete set of questions to be answered. These are: 1. Are the functions of the systcm restricted to those which are acceptable under GLP? 2. Within that, does the system provide the functions that the user requires? 3. Are those functions properly documented in terms of user manuals, SOPs, operators guides, etc? 4. Are the people who will use the system (including computer operations staff, if any) properly trained? 5. Does the installed system actually work in its production environment?
It might be thought that if part 1 of the definition of validation covers all of these points
then part 2 is redundant. Part 2 refers to “the process of evaluating software at the end of the software development process to ensure compliance within software requirerncnts”. Since this cannot mean user requirements, which are covered part 1, we are forced to conclude that this is related to the requirement that laboratory equipment needs to be properly designed and produced, so as to be capable of maintenance to the standard of the bcst of current good practice. Expressed bluntly, if perhaps contentiously, this boils down to the simple question: “have the software developers done their job properly?” This is probably the part of validation which most worries the scientist and most annoys the computer specialist. The lattcr often objects strongly to a pcrceived implicaLion of incompetence (much as many scientists did in the early days of GLP itself). The scientist on the other hand, is frequently conscious of his lack of knowledge. The problem is not made easier by the fact that there is no single, universal statement of what constitutes good practice in system development. Nevertheless there is a degree of consensus on what should and should not be done in a properly run development project. It is therefore possible to break the second part of the definition down into a further series of concrete questions, they are these: 1. Are there SOPs (even if referred to be a different name) covering the software devel-
2. 3. 4. 5.
opment process? Do they require rigorous specification and design in detail? Do they require thorough testing? Do they require comprehensivequality assurance and quality control? Do they require rigorous change management processes?
345
6. Do they require the use of suitably qualified personnel? 7. Have the SOPSbeen consistently and demonstrably applied? In thus breaking the definition of validation down into discrete and specific questions I hope to have provided a practical basis from which validation can commence. Before moving on to consider the mechanics of validation, however, I would like to make one further point. Many people start to think about validating systems at the end of the development. Even the briefest consideration of the questions that need to be asked makes it obvious that this is far too late. If you do not consider validation and quality from the outset you are almost certainly doomed. If the system is not built to be capable of being validated then no amount of validation will help you.
4. The mechanics of validation Now that we have determined the objectives of the study (for that is what validation is) we can start to look at the way we are going to go about doing it. I do not propose to talk about this topic in great detail as it deserves a paper to itself (and we will shortly hear Hr. Ziegler presenting just such a paper). It is nevertheless necessary to give an overview of the subject in order to provide the basis for discussing the positive benefits which arise from validating a system. The first esscntial is to determine what you are going to regard as raw data and what you are not. In a GLP context this decision is fundamental to the validation of a system since any system which has no dealings with raw data may well be excluded from the need for validation. This is not to say, however, that deciding that all raw data will be on paper removes all need for validation as, for example, it is virtually certain that you will need to push electronic copies of the data through such things as statistics packages in order to create reduced data for reports. In any event most people with terminals or PCs on their desk will refer to electronic copies of data before they will walk a hundred meters to go and find the original paper. The next step is to define the system you are going to validate. This activity has two components: the first being to define the boundaries of the system, and the second to determine the components within that boundary. The definition of the boundary, which I often refer to as the “domain of compliance” is important partly because we want to be sure that we cover all parts of the system which handle raw data, and partly to ensure that we do not expend effort on systems that do not require it. The identification of the component parts of the system is important because we need to know exactly what skills and knowledge are required to make an adequate assessment and because it helps in estimating the time and effort required to do it, but more of that later. Before discussing what we are going to do with all this ‘configuration’ information it is worth considering the importance of selecting the right team. The question of skills and
346
cxpcricnce is just as important as in any other GLP study. The information gathered during the project phascs already described can be placed alongside the list of questions to be addressed in order to determine the expertise necessary to obtain reliable answcrs. Whilst it is impossible here to provide hard and fast guidelines on team selection it is possible to make some observations which may be hclpful: 1. Do not be afraid to involve your specialist computing staff. In-house specialists can
provide an indcpendcnt asscssment of externally produced software, or it may be possible to get such an assessment of an intcrnally produced system from some other part of the organisation. It is essential to have a competent external assessment of the software, and equally to have the developers around to answer questions as they arise. 2. Be sure you get the application aspects of the systcm (ie what it does) assessed by people who have bccn and will continue to be involved in the areas bcing assessed. Just as you would not ask a junior technician to validate (e.g.) an advanced spccuoscopy package, remcmbcr that rank does not confer omniscicncc, and thc junior staff oftcn have valid and useful points to make. 3. Rcmcmber thosc whom the laboratory exists to serve! This may not sound like a GLP issue, but if you send to Regulatory Affairs information which they need, or choosc, to re-manipulate before submitting to licensing authorities your compliance might bc questioned. 4. Last, but by no means lcast, rcmember the Quality Assurance Unit. They are aftcr all, mandated to act as the guardians of GLP. Thc last important point to be made about staffing a validation excrcisc is that it is likely to require significant effort, particularly if thc system in question is a large one. To get the job donc in a rcasonablc time it is essential that management make a cornmitmcnt to provide the necessary rcsourccs. The definition of the scope of the system to bc validatcd can be used to estimatc thc rcsources required, and such an estimate will make thc management commitmcnt casier to oblain. The assignment of individual pcople to the validation study will normally result in their allocation to look at spccific parts of h e system. Once that allocation has been madc it is possible for the sub-teams thus formed to identify the dctailcd yardsticks against which compliancc is to be measured. This is important, as it is no good coming up with results like "it looks ok"-after all you would never do that with a ncw chemical. The degree of delail requircd, and thc dcgrce of specificity, will vary from casc to case, so that it may be appropriate for example to have a single, gencrally agreed yard$tick against which to mcasure thc adcquacy of a program specification. As a rough rule of thumb yardsticks for inspcctions will be more general (if not necessarily less dctailcd) than those for actual tests. It is, of coursc, almost certain that the validation exercise will involvc both these elements.
341
All the points I have covered so far in this section are concerned with the planning of the study. The completion of these tasks will put you in a position to actually start to do the work, knowing what to look for, how to find it, and (to a degree) how to assess it. Before beginning the study proper there is one other job I would recommend you to do, which is to devise some method of quantifying the results. Confoimity to GLP requirement is rarely a black and white issue, and an agreed mcans of measuring the shades of grey can bring additional benefits as I shall shortly seek to show. I propose to say very little about the ‘doing’ part of the validation exercise. This is not to say that it is unimportant, but it is essentially a question of putting into effect what has already been planned. One point I would make is this; keep good records of the observations made and the test results obtained. It is unlikely that this ‘raw data’ will be subject to retrospective scrutiny in the way your laboratory data might, but it will certainly bc used, and it might also serve to impress any passing inspector.
5. Reaping the benefits So far this paper has contained a large proportion of doom and gloom, lightened only by the possibility of keeping the regulators happy, and, maybe, even staying in business. Clcarly thcse bcncfits are not sufficient or the subject of validation would not have caused the worry that it so obviously has. So where are the other benefits? If we choose to begin with the technical benefits the first must be the disciplinc it imposes on the developers. Any developers who recognise that the system is going to be validated, and property validated at that, would be fools not to take steps to ensure that thcir work will m e t the challenge. The results of this approach must have many tangible benefits for the laboratory and its staff: 1. The rate of system failure will be lower than might otherwise have been the case, which leads to rcduced data loss. The savings to bc achieved here will vary greatly dcpcnding on the way the system operates. At the very least there will be less need to re-enter data lost or corrupted by system failure, and hence less scope to query the validity of such ‘raw’ data. In more advanced systems where data is being acquired directly from instruments there will be savings to be made from reduced re-analysis and the re-preparation of samples, and less danger of total data loss due to the fact that re-analysis is impossible for whatever reason. Lastly, of course, there are many instances in which speed of analysis and reporting is crucial, for example, flow manufacturing environments, where total system loss is wholly unacceptable.
2. Another technical benefit arises from the fact that the system is likely to have been better built if it is known that detailed validation is to follow. Such systems are generally much easier to modify when the need arises. This means that not only can any
348
failurcs be more swiftly remedied, but the changes that become necessary during thc life of a system can be more quickly and hence more cheaply applied. The bottom line is faster response to change at reduced costs.
3. Talk of changes lo systems raises the question of re-validation, which is a requirement aftcr any change to hardware or software, as well as periodically if there is a long period during which the system is static. The existence of a dctailed, segmented validation protocol makes it much easier to re-validate selected parts of the system and to cvaluatc thc rcsults obtained; a further benefit to users anxiously awaiting a new or upgraded facility. This is a specific instance of the actual use of thc raw data accumulatcd during the initial validation of the system. 4. Another use of this raw data comes from the fact that no system is ever going to be perfect. In quantifying the conformity of parts of the system to the agreed yardsticks thcrc will always be cases where a better result might have been hoped for. Not only will the quantification of these instances allow their correction or enhancement to be prioritiscd, but the raw data will be invaluable to the developers in actually doing something about the problems which have been identified.
5. Thc bcncfit to the people involved in the validation excrcisc cannot be over-emphasiscd. Not only will they become familiar in detail with ‘their’ part of the system, but they will develop confidence in it. This will not only make them champions of thc system among their colleagucs, but it will make them ideally placcd to act as centres of expertise, and to assist in the training of those around them. This enhancement of knowledge and confidence will, in itself, make the system more effective by virtuc of thc fact that people will use it more willingly and with a lower incidence of error. 6. All of thcse points lead to the fact that the system will produce more reliable and consistcnt data, and hence better and more timely information. Aside from the fact that this will improve the reputation and influence of the laboratory it is likely that widcr and different uses will be found for the information produced, thereby beginning an upward spiral which can only benefit all concerned.
6. Summary and conclusions In thc coursc of this paper I have sought to show that the pressures of regulation and the commercial need for laboratory accreditation are increasing. There seems little doubt that this will continue to be the case and that science based induslries will need to conform to survive.
349
It is clear that the validation of laboratory computer systems is fundamental to meeting the requirements of both regulatory and accreditation authorities. Although the definition of validation adopted for official purposes contains few specifics I believe it can be boiled down to the basic questions of “does it operate as GLP would require?” and “can it be maintained and enhanced without prejudice to its satisfactory operation?”. Based on that premise I have tried to show that the way to complete validation thoroughly in a reasonable time is to define properly the system to be considered, identify the skills required for the validation, define the tests and inspections, and the yardsticks against which they are to be assessed, and to quantify the results. Finally I hope I have shown that the benefits to be accrued from this exercise go far beyond merely kecping the authorities happy, but are fundamental to the success of the system, the laboratory it serves and the business served, in turn, by the laboratory.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
351
CHAPTER 31
Developing a Data System for the Regulated Laboratory P.W. Yendle, K.P. Smith, J.M.T. Farrie, and B.J. Last VG Data Systems, Tudor Road, llanover Business Park, Altrincham, Cheshire, WA14 5RZ. UK
1. Introduction The ability to demonstrate Good Laboratory Practice (GLP) and high standards of quality control to regulatory authorities is increasingly a requirement for many analytical laboratories [l]. Since computer data systems form an integral part of the modern laboratory, validation of such systems can form a major part of the demonstration of GLP, and it is essential that such systems do not compromise standards of quality in use in the laboratory B 3 1 . The regulatory bodies place the responsibility for validation of computer systems with thc laboratory. There is, however, a great deal that the vendors of such systems can do to assist in this process, and users of laboratory data systems are increasingly demanding evidence of high quality control during software development, and support for their validation procedures. This paper dcscribcs the adoption of a software development environment suitable for production of softwarc for use in the regulated laboratory, with examples taken from the dcvclopmcnt of the XChrom chromatography data system. The software development life cycle (SDLC) adopted is illustrated, and the implications of the requirements of validation to both vendor and user are discussed.
2. Validation and verification Although oftcn taken as synonymous in everyday use, the terms validation and verification have well-defined meanings in the context of software development [4]: Validation: Thc process of cvaluating software at the end of thc software development process to ensure compliance with software requirements.
352
Verification:
1. The process of determining whether or not the products of a given phase of the soft-
ware development cycle fulfill the requirements established during the previous phase. 2. Formal proof of program correctness. 3. The act of reviewing, inspecting, testing, checking, auditing, or otherwise establishing whether or not items, processes, services or documents conform to specified requirements. From these definitions it can be seen that it is possible for a user to validate a system (in the strictest sense of the word) with no input from the supplier of the system, since validation is a "black-box'' technique performed on the finished product. Indeed, it can be argued that validation can only be performed by the user, since only the user knows the requirements of the system in their particular environment. The supplier can, however, simplify the user's task by providing skeleton validation protocols on which the user can base their own validation procedures, and by providing example data sets and expected results from these for testing. Although validation of a system in the user environment is essential, this black-box approach alone is not sufficient to satisfy the requirements of many regulatory bodies, which require evidence that a system is both validated and verified. Since verification requires both examination of the intcrnals of a data system (sometimes referred to as grey- and white-box testing), and demonstration of quality control during development of the system, this can only be achieved with the full co-operation of the software developer. Furthermore, the system can only be truly verified if the complete software development life cycle (SDLC) is designed with verification in mind.
3. The software development life cycle The SDLC is the complete process of software development from conception of an application through release of the product to its eventual retirement. Although the exact SDLC adopted will vary among projects (the SDLC adopted for XChrom development is reprcsented schematically in Fig. l), the following stages can usually be identified: Analysis Specification Design Implementation Testing Release support Each of these stages are discussed in detail below.
(LIK)
m 3&-[
Software Development Life Cycle
Figure 1. Schematic of software development life cycle (SDLC) adopted for XChrom development.
W W VI
354
3.1 Analysis The analysis stage of the SDLC involves identification of the scope of an application, and will itself consist of several stages. For the developer of commercial data systems, a common stage of analysis will be identification of market niches that are vacant. This will typically involve reviewing the existing products of both the developer and competitors, and will often be combined with a review of the state-of-the-art of both hardware and software. As an example, conception of the XChrom data system was prompted by several major wends in the development of both hardware and software. The intention was to utilise the availability of high performance graphics workstations wilh powerful networking and distributed processing facilities (which had previously bccn beyond the budget of the average laboratory), and to follow the industry trend to open systems by use of the X Window System [5] and relational database management systems supporting Suuctured Query Language (SQL; [6]). Having identified an application in concept, its feasibility on both technical and marketing grounds must be checked. By documenting the concept and having it reviewed by technical staff, sales and support staff, existing customers and potential users, the initial direction that development should take (and indeed whether development should proceed at all) can be identified. It has been shown [7] that the costs of identification of errors (including incorrect analysis of a system) escalate rapidly as the SDLC proceeds, and so the aim of the analysis stage is to identify any erroneous concepts at the earliest possible instant.
3.2 Specification The specification stage of the SDLC involves formal summary of the analysis stage, and a detailed description of the requirements of the system from the point of view of the user. This description should be formally documented in the Software Requirements Specification (SRS). A major part of validation of the system will involve demonstrating that the requirements in the SRS havc bccn met by the system, and so the specification stage of the SDLC should also produce the System Validation Plan (SVP) which formally sets out how compliance with requirements will be tested. To identify errors in specification at the earliest possible stage, bolh the SRS and SVP should be thoroughly formally reviewed. Although this review process may be performed internally by the developer, it makes sense to involve users in this process, since both SRS and SVP should be expressed from the users point of view. This may pose logistic problems when (as in the case of XChrom) the user community is widely distributed geographically. The approach adopted in XChrom development to overcome these problems
355
has been to conduct external informal review prior to formal internal review, with the views of the external reviewers being collated and expressed by a representative at the internal review.
3.3 Design The design stage is the first stage in converting the requirements of the users (identified at the specification stage) to computer code, although little or no code will be produced in this stage (the possible exception being for prototyping). Of the numerous decisions to be made at the design stage, perhaps the most important is how the design itself will proceed. There are various "methodologies" for formalising the design process, often embodied in Computer Aided Software Engineering (CASE) tools to assist the developer. For a system such as XChrom that combines elements of real-time software engineering, objectoriented programming and relational database design, no single methodology fulfills all design requirements, but similar design processes can be applied to all elements of the system. System level design involves identification of functional modules into which the system can be divided. Within these modules the various layers of the software need to be identified, and then the contents and the interface between the layers can be designed. As in the case of analysis and specification, all stages of dcsign nced to be documented and reviewed, to ensure earliest possible identification of potential errors in the system.
3.4 Implementation The implementation stage of the SDLC involves committing the detailed design to computer code, using the programming language@) specified in the design stage. In order to chcck thc quality of coding, it is necessary to specify codmg standards that any piece of code can be verified against. In the case of XChrom these standards cover topics ranging from preferred layout style (e.g., standard module and function headers) and external standards to adhere to (e.g., ANSI C) to allowed data types and mechanisms for error handling. Development and enforcement of such standards requires tactful planning, since there is the risk of offending professional pride, for which software engineers are notorious 181. One mcthod for minimising problems here is to make code reviews a peer-group activity to reduce friction, although ultimately the outcome of peer-group review will of course itself need to be reviewed by project managers. One aim of coding standards should be to promote "robust" or "defensive" programming [9], in which the programmer attempts to cater for all possible eventualities in a
356
picce of code, even those that are considered extremely unlikely. In the development of XChrom protocols have been designed to encourage defensive programming, including mechanisms for internal status checking, parameter checking and comprehensive internal memory management. In addition to code reviews to ensure compliance with coding standards, source code can be subjected to automated analysis using CASE tools. VAX Source Code Analyzer (Digital Equipment Corporation) and the Unix utility lint have been used on the XChrom project for this purpose. Source code can be further checked by ensuring that (where appropriate) it will compile and execute on a range of hardware platforms using a range of compilers. When possible, code for XChrom is compiled and checked on Digital Equipment Corporation VAX and RISC hardware (running VMS and Ultrix respectively, Hewlett Packard HP-9000 series hardware (running HP-UX), and on a range of compilers on personal computers running MSDOS. In a multi-platform development, it is essential to employ both code and module managcment systems to ensure consistent versions of software across all platforms. For XChrom a central code and module management system has been established using VAX Code Management System and VAX Module Management System (Digital Equipment Corporation). All revisions to code must be checked through this central system, which allows the nature, date and author(s) of all modifications to be tracked.
3.5 Testing Although represented here as a single stage of the SDLC, testing actually comprises several stages. The lowest level of testing (sometimes referred to as "white-box" testing) is performed during implementation, and is encouraged by defensive programming, and enforced by source code analysis and examination of code using interactive debugging tools. Once a particular module of a system has been implemented (and white-box tested) a tcst harness can be constructed for the module. This consists of a piece of code that exercises all of the functionality provided by that module, without the module bcing inserted into the system itself. This level of testing is sometimes described as "grey-box'' testing. Once all of the modules in a system have been grey-box tested, integration of these modules can begin. As individual modules are integrated together, the composite modules can themselves be grey-box tested. During this integration testing it is desirable to ascertain whether program flow is passing between modules as predicted by the system design, and this can be achieved using CASE tools for profiling or coverage analysis. VAX Performance and Coverage Analyzer has been used during XChrom development for this purpose. Once module integration and integration testing is complete, the system as a whole can be validated against the SRS, using the SVP. Once this procedure has been performed
357
once it can be simplified by the use of regression testing, i.e., the results of one test can be compared to those of previous identical tests to ensure validity, without demonstrating validity from first principles. This testing can be greatly simplified by the use of CASE tools for regression testing (such as VAX Test Manager; Digital Equipment Corporation), although during XChrom development it has not yet been possible to find an automated test manager that will adequately test all aspects of a highly interactive graphical application.
3.6 Release Although regression testing is the simplest method of system testing during development, the system should be completely validated using the SVP at least once prior to release to users. Before gcncral release to all users the system needs to undergo "live" testing in a laboratory situation. It is not, of course, possible to perform such testing in a regulated environment, but given the current concerns over data system validation, we have found that many regulated users will assign resources to live testing of new releases of software. It is essential that the developer promotes a strong working relationship with such users. Release of a new version of a system may require new user documentation (manuals, etc). Although not discussed here, development of user documentation should parallel development of the software. Documentation should pass through its own specification (based on the SRS), design, implementation and testing phases, with appropriate review at each stage. In our experience, review and validation of documentation is more troublesome than the equivalent validation of software, since the evaluation of documentation relies on subjective measures such as style and other aesthetic concerns as much as objective measures such as factual accuracy. In addition to user documentation, a new release of a system will require an installation protocol and a validation protocol. The installation protocol should be supported by appropriate documentation and software to allow the user to install and configure the system to the needs of their laboratory. The validation protocol should consist of documentation and test data to allow the user to perform basic system validation, which can be augmented by the users own validation and regression testing. In addition, details of verification and validation procedures employed during development should be made available to those users that require them.
3.7 Support Once the system has been released, the supplier can support the regulated user in a numbcr of ways, Training of users of the system will promote correct use of the system and explain modes of operation, simplifying the design of standard operating procedures by the users. This can be augmented by technical support (both remote and on-site) to both
358
assist in routine use of the system and to monitor feedback by users of the system. Support for a regulated user should provide a formal system for monitoring and recording user fccdback, and should provide a formal escalation mechanism in the event that the user is not satisfied with the technical support provided. Specific support for user validation can be provided in a number of ways. In addition to providing a basic validation protocol at release to all users that require it, the supplier can collate information and results from users own validation procedurcs. This information should be regularly distributed to users (in the case of XChrom this is achieved through a user newsletter and user group meetings). In addition, where appropriate, support can be provided for hardware that requires routine validation (in the case of XChrom, annual revalidation of the VG Chromatography Server acquisition device may be performed either by VG or the user).
4. Implications for the software developer Production of a data system for the regulated laboratory requires development of a system that can be both validated and verified. Wc have seen that this requires development and adoption of an SDLC which promotes software verification, and this has both advantagcs and disadvantages to the software developer.
4 .I Advan tag es The principal advantage of developing a system in this way is that it aims to provide the user with the system they want, which should lead to satisfied users and (hopefully) an expanding user base. In addition, adoption of latest standards and techniques of software engineering should promotc feelings of professional pride (identified as a possible problem in Section 3.4) in software engineers, and therefore produce a more satisfied and motivated development staff. The verified software produced should be of a higher quality, and this should lead to reduccd maintenance effort on the part of the developer. The effort liberated in this way may be used to offset the extra effort required in adopting a verifiable SDLC, or may be channeled into development of new products. The reduction in maintenance, coupled with the tight definition of each stage of the development process, should simplify projcct management, and enable the developer to make most efficient use of the development team.
4.2 Disadvantages The principal disadvantage to the developer is the increased development effort required in the adoption of a verifiable SDLC. Since the distribution of effort throughout the
359
SDLC is not uniform (more staff are required during implementation than specification, for instance), this increase in effort cannot be met simply by recruitment of more staff, but must also result in an increase in the length of the development cycle. Both factors increase the cost of development, as does the need to purchase CASE tools necessary for verified development. In addition to evidence of validation and verification, regulatory bodies may also require access to commercially sensitive material such as proprietary algorithms and source code. The only circumstances under which access to such material can be justified is in the event that a developer refuses to verify software (unlikely given the current concerns of regulated laboratories) or ceases support for a system. In the case of XChrom, the latter possibility has been covered by provision of an escrow facility. The source code and development documentation is lodged with an independent third party, and a legal agreement is drawn up which allows the user access to this in the event that the dcvcloper ceases support of the system. In our experience the provision of an escrow facility satisfies the requirement of regulatory bodies for access to proprietary information, while restricting the access to this information to essential cases only.
5. Implications for the laboratory Although development of a verified system is intended to meet the requirements of the regulated laboratory, there are both advantages and disadvantages to the users of such systems.
5.1 Advantages The principal advantage is of course that demonstrating that the system meets the requirements of regulatory authorities should be greatly simplified. The effort that the user will nmd to dedicate to this activity should be reduced, and the system validation procedure should take the shortest possible time. The system should also be more reliable, reducing any down-time or effort spent on trouble-shooting, and generally increasing user confidence in the system. This should in turn promote use and understanding of the system, which will hopefully increase the efficicncy of the laboratory.
5.2 Disadvantages The principal disadvantages to the user will be a reflcction of the increased effort committed by the developer, such as increased cost of the system. In addition, the increased length of the development cycle will mean that any requested changes to the system will take longer, and so response time for verified user-requested modifications will increase.
360
There may also be a disadvantage in the performance of the initial releases of the system. Since the product will be extensively modularised and layered, there will necessarily bc more code to execute than in a system which has not been designed with verification in mind. Furthermore, initial versions will contain a high proportion of internal tcst code (see Section 3.4 above) which has to be executed and therefore reduces performance of thc system. Once the system has bcen extensively tested in a "live" situation, many of these internal tests can be identified as redundant, and removed from the system. System performance may therefore be expected to increase with successive releases.
6. Conclusions Validation of computer systems in the laboratory may be performed by the user without support from the system developer. Such validation will, however, need to be augmented by system verification in order to satisfy the requirements of most regulatory authorities. This verification can only be performed with the full support of the supplier, as it nceds to be designed into the system, and incorporated into all stages of the software dcvelopmcnt life cycle (SDLC). Development and adoption of a SDLC which allows system verification requires commitment of increased resources by the developer. In the short term this may providc problems for both developer and user, but these minor inconveniences should be acccpted, since in the long term development of a verifiable data system is to the mutual advantage of both parties.
Acknowledgements The authors would like to thank David Giles (Logica Communications and Electronic Systems Ltd, Stockport, England) for information and discussion on thc adoption of a verifiable SDLC, for designing the SDLC shown in Figure 1, and for producing that figure.
References 1.
Good Laboratory Practice: The United Kingdom Compliance Program. Department of Health. London, 1989. 2. Good Laboratory Practice: The Application of GLP Principles to Computer Systems. Department of Health, London, 1989. 3. Computerized Data Systems for Nonclinical Safety Assessment. Drug Information Association, Maple Glen PA, 1988. 4. IEEE Standard Glossary of Software Engineering Terminology. ANSIDEEE Std 729-1983. IEEE Inc, New York,1983. 5. Scheifler R,. Newman R. X Window System Protocol, Version 11.Massachusetts Institute of Technology. 1985.
361
6. 7. 8. 9.
Date CJ. A Guide to the SQL Standard. Addison-Wesley, Reading MA, 1987. Grady RB, Caswell DL. Software Metrics: Establishing a Company-Wide Program. F’rentice-Hall, New Jersey, 1987. Weinberg GM. The Psychology of Computer Programming. Van Rostrand Reinhold, 1971. Sommerville I. Software Engineering. Addison-Wesley,Wokingham, 1989.
This Page Intentionally Left Blank
Standards Activities
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
365
CHAPTER 32
Standards in Health Care Informatics, a European AIM J. Noothoven van Goor Commission of the European Communities, DG XIII-FIAIM, Brussels, Belgium
Abstract To create a common market in Europe it is not sufficient to support common understanding and cooperation in research and basic technological developments. Harmonized applications and common solutions for practical problems are of equal priority. The Commission of the European Communities thus undertook a number of programmes with the aim of advancing informatics and telecommunications in important application fields. One of these programmes is concerned with medicine and health care. It is called AIM, Advanced Informatics in Medicine. Comparcd to other fields, the introduction of infomatics in health care is a late and slow process. It is often said that the problem of medical informatics is the medical information itself. This information is potentially as complicated as life itself. Based on the universal one-patient-one-physician relation, the organization of health care is shallow and extended, while the information is extremely diversified. Moreover, the nature of the information may be both of vital importance and of private interest. Therefore, both security and privacy should be guaranteed to the highest degree. On the other hand, the technologies of informatics and telecommunications can support medicine and health care only when a certain level of standardization is accomplished. What is more, standardization is also a condition for industrial developments of applications. The current AIM action supports 42 projects. Five of these aim directly at making proposals for standards, while most of the others have standardization as their second or third objectivcs. From the other side the Commission has mandated CEN/CENELEC to take standardization action in the field of medical informatics. EWOS, the European Workshop for Open Systems will play an intermediate role. Finally, EFMI, the European Federation for Medical Informatics, founded a Working Group on the subject.
366
POS T UPK
CAE
CAA
Figure 1. Computers entered the hospital in two corners: CAE, computer aided equipment; CAA, computer aided administration; POS,patient oriented systems; T,timc; UPK, use of pertinent knowledge.
1. Introduction From a historical point of view computer systems entered the hospitals in two corncrs (Sce Fig. 1). First, as a device supporting the performance of medical equipment, and secondly, in about the same period, as an automatic card tray in the accounting dcpartment supporting simple office duties. Soon after that developments started of mcdical equipment with a computer as an csscntial part. For computer tomography, digital angiography, ultrasonic imaging, nuclear mcdicinc, magnetic resonance, and analyzers dedicated computers are indispensable. On the othcr end, the automatic card tray developed into billing systcms, accounting systems, and dcpartmcntal management systcms. The overall situation is still characterized by stand-alone systems which are not connected to each other, and which conceivably even could not be. This arrangement reflccts that of relations in hcalth care in general. Of old, health care is characterized by small operational units, final responsibility at the base, individual patients, no authority structurcs, in short a great multiple of the basic and primary one-patient-one-doctor relation. Historically this social situation led to a significant characteristic of medical information as a derivative of medical language. For long the use of medical information as expresscd in medical language was practically limited to the primary one-paticnt-one-doctor relation. Neither need nor possibility existcd for a wide disseminalion, and thcrcforc mcdical languagc rcmained individual, diversificd and not generally defined. What is more,
361
medical language is usually about exceptions, and in some cases could only express the inherent uncertainties of the medical issue. In the past decade technical means to integrate and to communicate became available on a large scale. The diversification of the medical information, however, forms the main obstacle to use these means in health care. On a Conference on Scientific Computing and Auwnation it should be emphasized that certainly the highly sophisticated features in operation or in development in current computer systems are scarcely of any use in the administration oriented systems for health care. The problem of medical informatics is the character of the medical information itself. However, not only communication over distances should be considered. The current generation of systems can accommodate knowledge bases and make accumulation and therefore communication of information over time durations a useful perspective. The combination of both communication modes -over distance and over time duration- would create possibilities of systems in health care that are more oriented towards the patient. Indeed, originating from the isolated comer of a single medical apparatus, computers now serve IMACS, image archiving and communicating systems. These systems are coupled to a number of mcdical devices for receiving the image data, receive alphanumeric patient data from the administration systems, and should transmit the information to other places. The orientation of these systems is towards the patient. At the other comer the computer that originally replaced the card tray in the accounting department will gradually become a knowledge based system that supports the administrative staff in planning and accounting as well as the medical staff in deciding on diagnosis and treatment. Also here an orientation towards the patient will occur. Eventually, any overall system or combination of systems will combine characteristics of the threc comers and handle data from the three sources: the test results, the administrative data, and the data concerning clinical judgements and the like. A harmonization of semantics and syntaxes might mean an effort for the individual physician. However, it is a condition for extended knowledge bases, it facilitates epidemiology, it will support professional education, it will be at the foundation of advanced health care policy, and it will be given an enthusiastic welcome by third party payers.
2. The AIM programme By promoting international cooperation in research and development the Commission of the European Communities aims at a number of objectives. Of primary importance are the realization of a common and uniform market and the creation of chances on that market for the own industries. In the Commission two Directorates General are charged with the task to promote research and development, viz. DG XI1 for fundamental research and life sciences, and
368
DG XI11 for informatics and telecommunications. The plans of the Commission are pcriodically unfolded and up-dated in the so called Framework Programme of Research and Dcv elopmcnt. Also the application areas of the informatics and telecommunications technologies are included in the Framcwork Programme: for the application in road traffic and transport thc programme DRIVE; and for that in education the programme DELTA. The AIM Programme, Advanced Informatics in hlcdicine, aims at the promotion of the applications of these technologies in hcalth care and medicine. For thc AIM Programmeas for the other programmes-an initial Exploratory Phase was dcfined, and needs were indicated for a subsequent Main Phase and a conclusive Evaluation Phasc. As gencral objectives were takcn: the improvement of the efficacy of hcalth care; the reinforcement of the position of the European Community in the field of thc mcdical, biological and health care informatics; and the realization of a favourablc climate for a fast implementation and a proper application of informatics in health carc. Furthermore, it was considercd that the costs of medical care are high and still rise, and hat the applications of informatics and telecommunications form an ideal opportunity to improvc the quality, the accessibility, thc cfficacy and the cost-cffectivcncss of this care. By broad consensus a Workplan was dcfined in which three Action Lincs are drawn. The first pertains to thc dcvelopment of a common conccptual framework for coopcration, thc second is composed of five more technical chapters, which will be dcscribcd bclow, and thc third mentions the non-tcchnological factors. The chapters of the sccond Action Line are: thc medical informatics climate; data structures and medical records; communications and functional integration; the integration of knowledge based systcms in hcalth carc; and advanced instrumentation and scrviccs for health care and mcdical rcscarch. Thc main activity of thc AIhl Programme is to subsidizc projects that should fulfil thc tasks or parts of the tasks as dcscribcd in the Workplan. In this “cost-shared” modcl thc Commission pays half thc costs of the projects. Thcse projccts should be undcrtakcn by intcrnalional consortia formed for the purpose. At lcast one of the partners of a consortium should be a commercial entcrprise in one of the member states, and at least a second partncr should come from another member state. A further requirement was that at least one of the partners was either an institute or an company profoundly concerned with mcdicinc or hcalth care. Furthermore, the consortia could have partners from one or more E R A countries (European Free Trade Association, the countries are Austria, Switzerland, Iccland, Norway, Swcdcn, and Finland). The costs EFTA partners make, are not to bc refunded by the Commission. The dcsircd intcnsification of the use of advanced IT scrvices in hcallh carc could hardly bc conccivcd without a harmonization of protocols, syntaxes, and semantics. The dcvclopmcnt of common slandards is onc of thc main rcquircmcnts for a common nxukct for 1T systcms, and Ihcrefore a primary mission of the AIM Programme.
369
Most of the 42 projects in the current AIM phase have the development of standards in their particular fields of interest as an important objective. Five projects even undertook to formulate complete sets of proposals for standards.
3. Standardization In Europe the national institutes for technical normalization founded a platform for cooperation called CEN, ComitC EuropCen de Normalisation. It applies both the formal and the practical proccdures to obtain European standards. Within CEN a great number of CT’s, Technical Committees, are charged with the direct responsibility for standardization in the various areas. The members of a TC are delegated by the national institutes. Observers from international organizations concerned with the specific field may participate in the meetings. Institutes analogous to CEN are CENELEC for electrotechnical products, and ETSI for telecommunications. To confer on demarcations between their operating areas they formed a joint Information Technology Steering Committee, ITSTC. Rccently EWOS, European Workshop for Open Systems, was founded. By organizing rather informal workshops and forming small project teams EWOS intends to achieve quick results. On the one hand these are practical recommendations for standards which are fed into the formal CEN procedures, and on the other hand EWOS supports the actual implementation of standards. Any institution can become a member of EWOS. In 1989 the Commission of the European Communities, DG XIII-E issued a mandate to the standardization institutes to explore the general aspects of standardization in medical informatics in order to define the requirements in this area. The ITSTC decided that CEN coordinate and be primarily responsible for the total work, and that EWOS undertake a part of it, namely the standardization of the transaction of medical data. CEN formed a Technical Committee for Medical Informatics to make a first classification of the area. This TC 251 had a first meeting in June 1990. Apart from the delegates of many countries, representatives of the AIM office, of EFMI (see below), and COCIR (the committee of manufacturers of radiological equipment) participated. From many nominees the TC chose a Project Team, PT 001, to do the actual proposing work. A first report is to be expected soon. Already in March 1990, thanks to their rapid procedures, EWOS formed a Project Team,PT007, of six experts. They directly started to work, and a first draft of their report was issued.
4. EFMI, European federation for medical informatics Generally the participation to the discussions about standards for technical products is limited to a relatively small number of manufacturers. To identify a set of suppliers in the
370
vast area of health care informatics is difficult, however. The usual actors form a varied group which ranges from computer researchers to medical practitioners, and from salcsmen of systems to project managers. They discuss the desired standards, and their representatives form the CEN TC 25 1. In 1975 the national societies for medical informatics founded EFMI, European Federation for Medical Informatics. The main activity is the organization of a yearly congrcss M E , Medical Informatics Europe. In 1988 EFMI started a Working Group on Standardization in Health Care Informatics. So far the Working Group held five meetings. The participants, either as contact persons of the national societies or as individual experts, exchange views and inform each other of standardization developments and projccts. The EFMI Working Group proved to be an ideal instrument to identify in the various countries those experts that are interested in standardization, and to get them enthusiastic about international cooperation. In this way and via relations in the AIM projects it was possiblc to activate many national circles to participate in the CEN meetings. In the world of standardization cooperation is a first requirement. It is considered beneficial when the same expcrts are members of the necessarily great number of committees. From the outside the persons in the consortia of the AIM projects, the members of the EFMI Working Group, and the delegates and representatives in the CEN and EWOS committees seem to form an inextricable entwining and a closed shop. Howcvcr, thcy will wclcomc anybody who as themselves is willing to make an effort to standardization.
5. Conclusions Standardization is a condition for the wide scale use of health care and medical informatics and for the creation of a common market. In the last two years three important categories, namely the Commission of the European Communities with thcir programmes and thcir mandates, the medical informaticians via thcir European professional federation, and the national normalization institutes through their European committcc havc shown to be aware of this problem and taken actions. As results, a number of AIM projects, the CEC mandates to CEN and EWOS, the EFMI Working Group on standardization, the Technical Committee of CEN, and the Project Teams of CEN and EWOS are working on the subject. Bccausc of personal unions and good mutual relations an cxccllcnt Cooperation is achieved.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
37 1
CHAPTER 33
EUCLIDES, a European Standard for Clinical Laboratory Data Exchange between Independent Medical Information Systems C.Sevens1, G . De Moor2, and C. Vandewallel 1Department of Clinical Chemistry, Vrije Universiteit Brussel, Brussel and 2Medicul Informalics Department, Rijksuniversiteit Gent, Gent, Belgium
Clinical laboratory medicine has developed in the past two decades into a major medical specialty. Its primary goal being informative, it may actually miss its point because of the inability to transfer timely and correctly an ever increasing load of data. Almost all clinical laboratories provide computerized information but the way they exchange data with the clinicians who request the tests, differs considerably. Clinical laboratories vary in their structure according to their status. They are hospital-integrated, university-associated, or stand-alone, mostly private. In a hospital whcre various computers are used-administration, laboratories, clinical departments, pharmacy-the interconnection may present real difficulties. In private instances, one is temptcd to crcate unique electronic bonds between users, private practioners, and the laboratory. This type of connection which implies compatible software and hardware, is of course vcndor oricnted, not compatible with othcr systems and limits the practioners’ freedom of choice of services. The need for standardized vendor independent laboratory exchangc systems is obvious. Euclidcs is a project within the exploratory phase of the AIM programme (Advanced Informatics in Mcdicine) of thc Commission of the European Communities. It was started in 1989 in Belgium within two university centers, the Medical Informatics Departmcnt at the Rijksuniversitcit Gent and the Department of Clinical Chemistry at the Vrije Universitcit Brussel. There are in total thirtcen partners belonging to seven counlrics: Belgium, France, Grcece, Ireland, Italy, Norway and the UK. G . De Moor is the project leader. The projcct is aimed at the production of a standard suitable for use in at least four arcas of application : 1. two-way routine transmission of laboratory messages (requests, results) bctwccn rcmote computcrs in primary care physicians’ offices, hospitals and laboratories (hospital, university or private).
312
2. the transmission of data for external quality assessment programmes. 3. the use of the semantic model to interface analysers with laboratory computers. 4. forwarding of anonymous, aggregated data to public authorities for financial analysis and budgetary control. In thc pilot phase which should not exceed a duration of sixteen months, three main subprojects are conducted simultaneously.
1. The message handling system (MHS) Euclides has chosen to use the powerful existing X.400 MHS standard rccommendcd by the C C I n (ComitC Consultatif International de TelCgraphie & TClCphonic). The standard is based on the International Standard Organisation (ISO) Open System Interconnection (OSI) model. Euclides will be implemented on top of the 1984 version. However the use of the 1988 version, which enhances universality and as such matches the goals of Euclides even better, will be recommended. The MHS (Fig. 1) can easily be compared to a postal service. A sender utilizes the facility of a user agent (UA) to create envelopes. They contain messages not to be deciphered by strangers which should be delivered to another user or receiver. The message transfer system consists of routes created between sorting areas, the message transfer agents (MTA). The 1988 version, not yet fully commercially available, improves the routing and personalizes the service [ 13. Euclides’ message Lransfer system put forward in the prototyping phase (Fig. 2) will involve the public domain provided by national administrations as well as the private
MESSAGE TRANSFER SYSTEM
I
MESSAGE HANDLING SYSTEM
MESSAGE HANDLING ENVIRONMENT -
Figure 1. MHS-model(l984).
373
........ (
I.....
Physician 1
Physician 2
Figure 2. EUCLIDES prototype.
domain. The latter can contain MTA's like a front-end computer to a large clinical laboratory system as well as local UA's with full X.400 functionality or remote UA's in combination with a MTA from the public or the private domain. Within this subproject, security and data protection are major issues. Authorization, authentication, encryption of all or part of the messages and the use of check functions to ensure data integrity are all considered in the Euclides' standards. The general sccurity service of X.400 will evidently be used. At users' ends however, Euclides will not attempt to consider security measures such as physical security of the local hardware or access control within the local system software. These are the responsibilities of the users.
2. Syntax In every language rules are used in communication exchanges but are not explicitly described at every exchange. The set of rules, or syntax, is part of the hidden knowledge [2]. The purpose of the Euclides syntax message is to mimic reality and avoid the embedded overhead of syntax rules in each exchange. An object-oriented approach has been adopted. The kernel of the system is the information exchange unit (I.E.U.) (Fig. 3) composed of a header, a body and a trailer, all
314
--Fl INFORMATION
t
w [SYSTEM MESSAGE
-I.-.,-
INFORMATION
{-I;---
IIN FORMATION 1
1
*
I
MESSAGE
> INFORMATION
JEEE
I
Figure 3. EUCLIDES information exchange unit, >: sequence, selection, *: iteration O:
mandatory features. Although structurally similar, the syntax (system) messages are quite distinct from the data messages. The former contain meta-data, i.e., syntax rules about the objects of the data message. The latter contain the data themselves and their relationships. Each dialogue between sender and receiver within Euclides starts with one or more syntax messages laying the ground rules for the exchange. When data messages arc transmitted they follow the syntax rules but do not contain them. Figure 4 shows the format of a data message set. The label is the unique identifier of the data message, c.g., test request. The body contains the low-level objects which are qualified as mandatory, conditional, optional or prohibited, e.g., patient, analyte, specimen. The object again is composed of attributes which contain the actual values of the message. From the point of view of the clinical pathologist, they are the alpha-numeric values attributed to the result of a chemical test.
375
lhn,,KI AGGREGATE
P MESSAGE
7
DATA MESSAGE SET
MESSAGE VERSION TAG I
I
'
MESSAGE
DATA MESSAGE Bow
MANDATORY OBJECT >
CONDITIONAL 0 OBJECT DATA
- MESSAGE
<
OBJECT
,OPTIONAL OBJECT >
Figure 4. EUCLIDES data message set. >: sequence, selection, *: iteration O:
3. Semantics Two problems are tackled in this part of the project, terminology and classification. Existing literaturc is rather scarce and does not meet the expectations of Euclides 131. Lists of tests do not address the clinical pathology in its whole but limit themselves to one or two subspccialties like clinical chemistry, haematology, immunology, microbiology, toxicology. On the other hand, classification systems of clinical laboratory procedures have been set up with goals like financial purposes or as an aid to medical diagnostics. The Euclides lists are being set up by compiling existing nomenclatures with the content of laboratory guidcs. These include tests in all subspecialties and are gathered from represcntative clinical laboratories from all over Europe. Synonyms and acronyms were
376
identified, unique codes were given, standard units were considered and basic common rules for a standard nomenclature system were developed. Items and their relationships have bccn chosen in accordance to the syntax rules and with the aim of being universal. The lists are currently being translated into the language of the parlners in the project and even beyond these borders. All the features described here will make it possible for any user to send messages in his usual terminology, in his own language. The messages will be transported by the message handling system using the Euclides codes. The Euclides standard resources translate the message into the receiver’s own terminology and/or language. The use of different units of measurements and their conversion have also been taken into account. At the time of writing a minimum basic data set comprising about 1,000 tests (or analytes), 100 units of measurement and 70 types of specimens has been extracted for use in the prototyping phase. It is already available in four languages.
4. Current development and conclusion Based on the preliminary research conducted over the past year within the three subprojects described in this paper, the implementation phase has started. A software package is being developed, called the “Euclides bridge” to provide the functions and tables
v COMPRESS/ DECCMPRESS SYSTEM
Figure 5 . EUCLIDES bridge.
EUCLIDES
b
# BRIDGE
ENCRYPT/ DECRYPT SYSTEM
311
necessary to the communication with local systems. The objective is to implement Euclides next to and in communication with existing systems without having to make any major modifications to the local software nor interfere with it. The dialogue between partncrs is initiated in the locally available bridge. The syntax rules are called into action and the mapping of the local file with the Euclides I.E.U. format (dialogue tables) lakes place (Fig. 5). These dialogue tables are used to obtain the values of the local data message. There is one dialogue table for each local system. After the creation of the dialogue table, the local data are translated into the Euclides syntax which is checked for errors. External packages (which are not part of the bridge) are called to compress/decompress, encrypVdecrypt and transmit the message. The Euclides project, after a year of existence, is entering the implementation phase. The work that has been achieved in three areas can be summarized as follows: 1. a standard message handling system X.400 is being used for direct application in a medical domain 2. a flexible object-oriented syntax, applicable to all fields of the clinical laboratory data exchanges has been created; its particular features are to convey metadata and avoid the burden of systematic overhead of syntactic rules 3. multiple lists of clinical laboratory objects have been elaborated; they can match any local files, they make the local data transferable to another party and as such are of universal use.
References 1.
2. 3.
Schicker P. Message Handling Systems, X 400. In: Stefferud E. Jacobsen 0-J. Schicker, Eds. Message Handling Systems and Distributed Applications. Elsevier Science Publishers B .V. (North-Holland)IFIP, 1989: 3-41. Adapted from Collins Dictionary of the English Language, 1985. CPT4, SNOMED. Institut Pasteur, ASTM, ICD9-CM. CAP Chemistry.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
319
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 34
Conformance Testing of Graphics Standard Software R.Ziegler Fraunhofer-Arheitsgruppe f u r Graphische Datenverarbeitung (FhG-AGD), WilhelminenstraJe 7,0-6100Darmstadt, FRG
Abstract The widesprcad acceptance of Graphics Standards like GKS, GKS-3D, CGM, and PHIGS as international standards for computer graphics leads to a software market offcring a lot of implcmentations of these standards, even before they become official. This paper dcscribcs conformance testing of GKS and CGI implementations. The first testing service was cstablishcd for GKS. Thc developmcnt of this service was finished 1989. The testing scrvice for CGI is still under dcvclopmcnt and a prototype is now available. The testing serviccs for both standards rely very highly on the visual checks of a human tester. Thus an important considcration is whether thc human judgement of the corrcctncss of picturcs can be rcplaccd, at lcast in part, by automatic processes.
1. Introduction Pcoplc who buy an implemcntation of a graphics standard like GKS, GKS-3D, CGM, and PHIGS rcquire a guarantee of the corrcct functioning and its compliance with the standard. Tcst suitcs that test implcrncntations to detcrmine whether the functions pcrform corrcctly and that the language bindings (and data cncodings) have been implemcnted are needcd. It is dcsirable to have the test suite available at the time the standardization process of a certain graphics standard has finishcd. This means that both, standard and test suite, have to be developed in parallel. Thc foundations for the certification of graphics standard software were laid at a workshop held at Rixensart near Brussels in 1981 [Thom-84]. Experts on graphics, and expcrts on certification of software systems discussed the issue of how to apply certifications to graphics software. In 1985, the first phasc of the conformance testing service programme (CTS1) was launchcd by the Commission of the Europcan Communities. Within CTS 1 thc tcsting scrvicc for the Graphical Kerncl System (GKS) was developed.
350
Two years later the second phase of the CTS-programme (CTS2) started. Within CTS2 a project is now underway to develop testing tools for the emerging Computer Graphics Interface (CGI) standard. Implementations of a certain graphics standard can be tested only by an accredited testing laboratory [IS0-90]. The client of an implcmentation can get the test suite in order to apply a pre-test. This enables him to make changes to the implementation to correct any errors. The formal test of the implementation, which leads to a final test report, is then executed by the testing laboratory. The results of the execution of the tests are listed within this final report in detail. The certification authority tests whether the criteria for issuing a certificate are fulfilled. Testing laboratories for GKS testing are GMD (Germany F.R.), NCC (United Kingdom), AFNOR (France), NIST (United States), and IMQ (Italy). NCC (United Kingdom) and AFNOR (France) are currently discussing the establishing of these laboratories for CGI testing.
2. GKS conformance testing The Graphical Kernel System (GKS) DSO-851 is the first international standard for programming computer graphics applications. GKS defines a standardized interface betwecn application programs and a graphical system. Moreover it is a unified methodology for defining graphical systcms and their concepts. GKS supports graphics output, interactive operator input, picture segmentation and segment manipulation. The graphical standard GKS covers a wide field of 2D-applications in intcractive computer graphics. The first testing service for graphics software was developed for testing implemcntations of GKS. The GKS documcnt is essentially an informal specification written in natural language. Interpreting the specification in terms of a computer program requires human intellectual effort. Errors can occur. Thc testers take the complete GKS implemenLition, subject it to an intensive test suite, and hope to discover errors. The test suite is a sequence of test programs and is based on a simple model which involvcs two interfaces: the application intcrface between GKS and an application program, and the human interface where GKS output is observed and input devices are operated. The testing strategy for GKS involves five distinct tcst series [KiPf-90]: - data consistency, - data structure, - crror handling, - inpul/output, and - metafile. The data consistcncy test series examine the GKS description tables which describe the configuration and certain properties of the GKS implementation. The tests check the
38 1 test o p l d 2 :frameI values in the description tables for consisset polymarker representation tency and conformity with the GKS standard. ix mk,msf,colix bundled individual The data structure test series ensure that the values in the GKS state lists are manipu- 1 5 1.0 I X X lated correctly. This is done by setting, mod5 4 3.3 4 Ll 0 ifying and inquiring the values. The error handling test series produce 10 3 5.5 3 x error situations and then check that an error mechanism in line with the GKS standard is I5 2 7.8 2 supported by the implementation. The input/output test series provide a 20 1 10.0 1 check of the G K S Implementation as a whole. This is done through a comprehen- Figure 1. Polymarker representation test. sive set of tests which cover all the input and output capabilities of GKS. The graphical output is produced by the tests and checked against a set of reference pictures which show the expected appearance of the test output on the display. The Evaluator’s Manual gives a list of items to be checked for each picture. It describes the process of execution for the tests accurately. The input is tested by a set of defined operator actions which should produce specific results on a display. Figure 1 shows the reference picture for the test of polymarker representation. It has to be checked whether the certain representations are drawn with the attributes described under the headings marker type (mkt), marker size factor (msf) and colour index (colix). The ‘bundled’ and ‘individual’ drawn markers have to be identical for each index (ix). The metafile test series check that the GKS metafile is used correctly. Metafiles are created and checkpoints are used to enable visual comparison between screen output and reference pictures. The metafiles are then interpreted and the sessions are interrupted at exactly thc same checkpoints so that the output from metafile interpretation can be checked against the reference pictures. The GKS state list entries from the generating and interpreting sessions are also checked to ensure that they are identical. The test programs in each series are grouped into sets, each of which contains the cumulative test programs for a specific level of GKS. The five test series are testing different areas of the GKS standard. They are of different complexity. Each test series assumes that the previous test series ran without major errors. If the description tables that are examined in the first test series cannot be inquired it would not make sense to run the data structure test series as this test series gets its information about the workstation under test from the workstation description table. Therefore the test programs should be executed in the order listed above. The current GKS test suite only tests GKS implementations with FORTRAN language binding. For the GKS C language binding a pilot
* +
+
382
version is developed. Clients can buy the GKS test tools from the national testing laboratories. These laboratories offer a list of all tested and certificated GKS implementations.
3. CGI conformance testing The Computer Graphics Interface (CGI) [ISO-891 defines the interface from a graphics system to a graphical device. A CGI virtual device may be a hardware device or a soltware implementation. A specific implementation is bound to an environment (like hardware, operating system, control software) and may be influenced by other controlling interfaces in the environment. These dependencies have to be taken into account and increase the complexity of the tests. CGI defines control, output, segment, input, and raster functions. This set of functions covers h e whole GKS functionality (CGI as GKS workstation) and in addition provides raster functionality. It is expected that CGI will bccomc International Standard at the beginning of 1991. A project is now underway to develop testing tools for the emerging Computer Graphics Interface (CGI) standard in parallel to the standardization process. The developmen1 team aims to build on the experience gained in constructing and using the GKS tools.
3.1 Test system structure The definition of the CGI standard covers several parts. Beside the functional description there exist additional standards for data encodings (binary, character, clear text) and language bindings (FORTRAN, Pascal, Ada, C). The CGI testing service aims to build testing tools which cover all these requirements, CGI implementations with different language bindings as well as implementations with different data encodings. The evaluation of the requirements leads to the illustrated CGI test system structure (Fig. 2). The main components are the description of the implementation under test, the test case database and the test suite interpreter (TSI). Valid CGI implementations (which conform to the standard) can differ in the functionality (profiles) and in the capabilities of the virtual device. Therefore the CGI standard defines description tables which describe a certain implementation. The test system component ‘description of the implementation under test’ reflects all description tables defined within the CGI standard. Furthermore certain state list entries which are noted as implementation dependent (e.g., the bundle representations) are included. This description must be set independently from the test. A program called inquiry tool will call the necessary inquiry functions to gain the information. If necessary the entries will be set manually. This effort is needed to have the information available in a file in a unique and well defined format. Dependent on the description of the CGI implementation under test the test cases are selected (‘selection’).
383
I
CGI Implementation Under Test 4
Figure 2. CGI test system structure.
The ‘test case database’ contains all implemented test cases. The structure of the database is a directory structure subdividcd according to the functional parts of the CGI standard (control, output, input, segments, raster). During runtime the complete test set for a specific implementation undcr test will be selected. Each selected test case contains additional documentation (help utility) which describes and documents the test.
384
Example: Following is the help text for Test Case ‘Polymarker Geometry-qpe and Posit ion ’: TARGET: Type and Position of the Polymarker primitive are checked here. DISPLAY This is done by drawing several markers. All markers are centrcd horizontally and their position is marked by annotation lines. They are drawn within one box and the type is described by annotation text. CHECKS: Please check, that 5 markers are visible. The 5 markers should be centred to the position annotated by lines and a surrounding box. The marker typcs should be (from top to bottom) describcd by annotation text: Plus Sign, Star, Circle, X, Dot. The CGI test suite is written in a self-defined C-like test description language (pseudo code). Selected test cases are interpretcd by a test suite interpreter (TSI) which interprets the pseudo code according to the language binding or data encoding of the implementation under test. Thus the tests are portable to different environments. The available prototype can interpret pseudo code to an implementation realized as a procedural C language binding. Until the end of the project (mid of 1991) additionally the TSI will be able to interpret to FORTRAN, binary and character encoding. The standards for the Pascal and Ada language binding and the clear text encoding won’t be considered, because up to now their standardization process has not been started. The selection of the test cases depends on the description of the implementation under test. The ‘test results’ are collected within a file and contain pass/fail answcrs and additional remarks.
3.2 Test picture design The testing strategy is similar to the GKS testing strategy (no metafile tests) but the design of the test software was changed in some major parts [BrRo-89]. The CGI test suite includes automatic testing and visual chccking. Data consistency/structure and error handling tests can be tested automatically, the main test will be done by visual chccking of output by a human tester. The decision of the test suite developers was to cover as many aspects as possible of the CGI implementation under test. But the test software must be designed in a way that the tester won’t become bored or tired. Therefore the requirements for the designer of the required test software were to keep the tests interesting, simple, and uncluttered. Visual cues (self-annotating test pictures) have to be included to aid the judgement of the tester (see Fig. 3). Finally, redundant tests have to be avoided.
385
The specification of the CGI test suite does not rely I on the concept of reference pictures (such as was used in GKS). The wide range of functionality of CGI in addiPlus Sign tion to diverse hardware capabilities makes the selection I of appropriate reference pictures impossible. Test cases Star interpreted by the TSI generate visual output. The human tester has to examine whether the result corresponds to 0 the required behaviour, defined within the standard. Circle The CGI test system satisfies all requirements according to the design and specification criteria. The X ‘polymarker representation’ test within the GKS testing (see Fig. 1) can be applied to CGI testing, too. However, the design must be changed. The describing columns (ix, Dot mkt, msf, colix) will be removed. The test documentation will contain this description (see the previous example). Visual cues will be added to evaluate the correct positioning and sizing. This example illustrates the appli- Figure 3. Test of polymarker. cation of the design criteria for CGI testing. Consequently all test pictures are very simple with self-annotating visual cues. The annotation utilities use the POLYLINE, POLYMARKER and TEXT function and certain attribute functions. A possibility for using cues is boxing i.e., POLYLINE drawn round the primitives to be checked. Each test case is documented explicitly. The human tester can get additional help by this documentation which is available either on a separate screen (on-line documentation) or within a comprehensive manual. The TSI interprets the test case. Furthermore the TSI manages the reporting of the test results and the on-line test documentation (if feasible). Thus a user interface was designed to handle interaction with the tester and to access the test documents, so as to present information to the tester. Finally, information is passed to the automatic report generator, concerning the tester’s assessment of whether a test has passed or failed. As mentioned above the CGI test system is available as a prototype. This first version will be capable to test CGI devices covering the GKS level OA functionality realized as ‘C’ language binding implementations.
-+-
-*-II
-x-
- . -I
3.3 Application of automatic testing The realized CGI test system includes visual checking and automatic testing. The output/input test which is the main part of the test is done by visual checking. Automatic tests are performed for checking whether a certain profile is implemented, whether the inquiry functions deliver correct information and the error handling mechanism behaves
386
correctly. These tests concentrate on checking consistency of description table entries and dcfault/currcnt setting of state list entries, but not on output visible on the screen. The described scheme as used in the GKS and CGI validation service relies strongly on human checking: it is thus subjective and limited by the accuracy of the human eye and brain. An important consideration is whether the human judgement of the correctness of pictures within the CGI test system can be replaced, at least in part, by some automatic processes. The necd is certainly great: CGI defines a much richer set of primitives and atwibutes, compared to GKS, and the demands on the human tester will be severe. Thus we evaluated whether automatic testing of visual output could be applied and integrated within the CGI test system. How can the test program ‘see’ the generated visual output? The first approach [Brod891 we made was to examine the raster image-we call this testing at the raster interface. CGI includes a function (GET PIXEL ARRAY) that returns the colour of each pixel within a specified rectangle. The raster interface is thus a practical point at which to examine graphical data. Our preferred approach is to analyse the raster data generated by the implementation in response to a test program. We define a number of conditions that must be satisfied, and analyse raster data to verify whether this is indeed the case. For example we develop a number of conditions that characterise a line. These conditions will include a set of pixels that must not be illuminated, a set of pixels that must be illuminated, and relationships between remaining pixels. We have looked in detail at the issue of whether a sequence of pixels adequately represents a given line, and have derived a characterisation based on human visual assessment. Furthermore we developed similar characterisations of other primitives (line primitives, polymarker, polygon primitives, cell array) and have defined certain test methods. Similarly, we dcvclopcd conditions that must be satisfied in certain graphical operations. We have looked at the attribute binding mechanism (bundle representations), the clipping mechanism, all segment mechanisms (creation, deletion, display, transformation, copy, detection, inheritance) and all raster functions. Generally these conditions relate to the raster data before and after the operation. A trivial example is segment (in)visibility. Thc condition for segment invisibility is that the raster data should be identically zcro aftcr the operation. Furthermore clipping (see Fig. 4) is well suitcd to automatic testing of pixel maps. We draw a picture using a set of output primitives and attributes, with clipping set to ‘off’. The pixel map is storcd, and the screen cleared. We repeat the process exactly, but with clipping this time set to ‘on’. The pixel map is again retrieved. The area outside the clipping rectangle must be cleared to the background colour, and the interior must match on a pixel-by-pixel basis with the interior of the original. We have seen that many graphical operations such as segment visibility, have a very simple cffcct on the pixel map of a rastcr device and so can be checked by automatic means. Our work has shown, however, that care is needed in this comparison process to
387
I
CGI I
Reference
*
Inquiries
Database
Documentation
Figure 4. Clipping test.
allow for rounding errors-particularly in the discretisation step. The automatic testing of output primitives is not evaluated completely. The attributes (e.g., line type) have no precise definition within the standard. Thus the definition of evaluation conditions can be very subjective. Additional work must be done to solve this problem. In the case of CGI devices which do not provide the function GET PIXEL ARRAY (allowed in CGI) we have to find an additional way of pixel readback, e.g., picture capture by a camera (camera input). To decide whether camera input can be used as pixel
388
readback depends upon the question ‘Can we predict the camera frame buffer by a given graphical output generated by CGI?’. To answer this question we made some experiments with different sets of graphical output on a monochrome screen. The camera ‘sees’ the physical reality. In fact the pixels on a certain screen were not square. The x-size was smaller than the y-size. Therefore the horizontal lines seemed to be thicker than the vertical lines. The camera ‘sees’ this thicker line. This is also true for the visual tester. We did not investigate on the acceptance of users of these deficiencies. Unfortunately there arose an additional problem. As mentioned above the CGI output was limited to colour indices 0 and 1. In contrast the camera device provided a range of 256 grey levels (0..255). The experiments showed that dependent on the adjustment of the camera one pixel is ‘lightcr’ than another pixel. That means if one pixel will hit exactly one camera pixel, the edge of two camera pixels or the corner of four camera pixels the camera ‘sees’ different light intcnsity. These are the kinds of problems which still have to be resolved. One major problem is to find out the definition of the function which defines the mapping from one CGI device pixel to the camera pixels (different light intensities). The second major problem is the calibration of the camera (calibration phase). We have to analyse how the physical capabilities of the device under test (pixel aperture and size) have to be fed into our camera input pipeline. All in all if we find feasible solutions (which can be implemented) we can apply the same automatic test methods described above (testing at the raster interface). Then conformance testing by automatic means will become more and more important and indeed will be included within the test suite of CGI testing.
4. Conclusions This paper has described the test systems for testing of GKS and CGI implementations. The GKS testing service is the first testing service in the area of graphics standards. Since establishing the GKS testing service experience was gained. This experience showed that conformance testing services are very useful to get implementations of Graphics Standards conforming to the certain standard. The developers of the CGI testing tools aimed to build on the experience gained in building and using the GKS testing tools. In fact, the CGI standard changed and functions have been added (e.g.. the inquiry functions) during the standardization process. Therefore the test specification had to be adapted according to these changes. At least, that neither a language binding nor a data encoding is standardized yet, made (and still makes) matters worse. We described the issues and possible solution of automatic testing. In general, conformance testing of Graphics Standard software should be automated as much as possible and less subjective judgement would be required. This could be realized if standards will be more precise. The GKS and CGI standards leave some room for the implementors, e.g., there is no definition of line end styles. Thus the developers of automatic tests have to decide whether the generated test pictures are a “good” or a “bad” representation.
389
Finally, the experience developing the automatic tests for CGI testing showed that those tests which check graphical operations (e.g., segment visibility, clipping) are “safe” candidates. The geometric tests of output primitives are critical (but realisable) and need future research work. The use of camera for accessing the pictures of an implementation under test requires a larger set of verification strategies, including image processing capabilities.
Acknowledgement I want to thank all those who helped in the preparation of this report. In particular, I thank Alexander Bolloni who spent a lot of time and work in investigating the feasibility of automatic testing of CGI at the raster interface. Thanks also to Ann Roberts, Ken Brodlie and Roger Boyle which spent many hours in discussing and experimenting during my visit at the University of Leeds.
References [Brod-891Brodlie KW, Goebel M, Roberts A, Ziegler R. When is a Line a Line? Eurographics ‘89. Participants Edition. 1989; 427-438. [BrRo-891 Brodlie KW, Roberts A. Visual Testing of CGZ. Internal Report. School of Computer Science, University of Leeds, 1989. [KiPf-901 Kirsch B, Pflueger C. Conformance Testing for Computer Graphics Standards. Internal Project Report (CTS2-CGI-072), 1990. [ISO-851 ISODIS 79424raphical Kernel System (GKS).Functional Description. 1985. [ISO-891 ISO/DIS 963Womputer Graphics Interface (CGI).Functional Specification, 1989. [ISO-90] ISO/SC24/WG5 N474onformance Testing of Implementations of Graphics Standards. DP text, 1990. [Thorn-841 Thompson K. Graphics certification at the European Community level. Computer Graphics & Applications 1984; 8(1): 59-61.
This Page Intentionally Left Blank
Databases and D ocunz eIZta tioi z
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Automarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
393
CHAPTER 35
A System for Creating Collections of Chemical Compounds Based on Structures S. Bohanecl, M. Tusar2, L. Tusar2, T. Ljubic', and J. Zupanl 'Boris Kidric', Institute of Chemistry, Ljubljana and 2SRC Kemija. Ljubljana, Yugoslavia
Abstract A system for creating collections of properties of chemical compounds based on their
structures is described. This system enables the chemist to handle (characterize, save, retrieve, change, etc.) correlations between selected features and parts of chemical structures. At the present step of development, the system can handle only spectral features. Using the structure editor as the central module of the system, the chemist can generate any structure and search for all features saved in the databases with respect to it. Some features in the structure defined databases are already built-in the system (I3C NMR basic spectra collection, IR spectra collection). There are also substructure defined correlation tables (table of 13C NMR basic chemical shifts and increments defined with the neighborhood of an isotopic atom, similar tables for 1H Nh4R spectroscopy, etc.). The user can add new features to the system or delete existing ones. It is also possible to add, correct, or delete only one or more data.
1. Introduction The common goal of most chemical information systems can be described as a search for possible connections on relation: structure <-> features A feature is any structure-dependent characteristics of a compound in question (physi-
cal, chemical, biological, pharmacological, ecological, etc.). Some examples of features are spectra, activities, melting points, boiling points or merely chemical names of compounds. Typically, a database is composed of structures of chemical compounds and corresponding features.
394
tructure
"C NMR
Structure
Structure Chem. names
IR
1 CLL 0
I
a
I
Structure
Chem. namcs
1 2
4
3 4
Figure 1. Structure dependcnt databases can have the structures attached to all records in each database (top row) or only maintaining links to central structure database (bottom row).
The system we are describing was designed as a general and efficient tool for helping the chemist in the process of solving the structure oriented problems. A very popular example for such task is the identification and characterization of chemical compounds on the basis of different spectra [1,2].
395
The structure editor, which connects different databases with other components (information an/or expert system modules) is a central part of our system. Due to a unique representation of chemical structures handled by all parts of the system (structure editor, databases, information and/or expert system modules) the transfer of results and the connections between different parts or modules becomes easy and transparent for the uscr.
2. General concept To extract chemical structure from different features or the properties of intcrcst from chemical structurcs wc need different databases, general system for building and maintaining these databases and providing links and communications between them, and finally, a number of information and/or expert systems to obtain some specific information. The databases [3, 41 consist of uniformly composed records. In principle, each rccord should contain a chemical structure and the corresponding information (Fig. 1 , top row) most frequently a spectrum (I3C NMR, IR,mass, IH NMR, etc.) or some other complcx (multivariatc) information (ecologically dangerous effects and properties, recipes, technological pararnetcrs, ctc.). All such data (oftcn called supplemental information) contained in diffcrcnt databases are linked with the central structure database and can be accessed through it (Fig. 1, bottom row). The system for building and maintaining databases [5, 61 with structure editor as its ccntral part are used for: building new structures, deleting, correcting, searching, and chccking old structures ctc. The output of structure editor is the connection table [ 13 of the handlcd structure, while other components assure checking, handling, and updating of records with supplemental information. The system for building and maintaining the databases must provide a number of tasks completely hidden (transparcnt) to the uscr such as generation and update of inverted files, establishing links bctwcen diffcrcnt databases and keys, decomposition of structurcs on fragments, normalizing, base-line correction, smoothing, peak dctection of spectra, etc. The chemical informalion andfor expert systems [3, 7-17] enable the chemist to use various activities offcrcd by the system combincd with data pooled from the databases and finally to combine partial rcsults or resulting files obtained at different modules into complex information. The information and expert systems may be ranked from very simplc data scarchcs in various databases, to complex simulations (spectra simulation) and complete or partial structure predictions, etc. It is evident that most of these systems must be connected to an efficient chemical structure manipulation system.
396
EXPERT SYSTEMS
DATABASES
I
I
IRSYS
I
(CARBON
I
IGENSTR
I
-[W] I
AD1
13CNMR
STRUCTURE
collection of
I
-
I
-
I
1
IR
1
INES
1
I SIMULA I a a
8
8
a
I Figure 2. General scheme of our system. It is always possible to access the information and/or expert systems and the structure editor and over the editor any collection of the structures in the databases. The modules are also accessible directly or from the structure editor. The following information and expert systems are included in our system: UPGEN (for building, checking, correcting, and organizing databases, selecting compounds due to the common structure characteristics or other properties), VODIK (for 1H NMR spectra simulation and for the supplementation of correlation tables), SIMULA (for 13C NMR spectra simulation), IRSYS (for structure and IR spectra collection managing), AD1 (for the supplementation of tables of 13C NMR chemical shifts), GENSTR (for building all possible structures from structure fragments, generated with structure editor or obtained born CARBON or IRSYS systems).
397
I
llT0M BRIDGE BOND
/’”
,8-9
CHAIN ERllSE 1HSERT
\ /’
‘L6
RING
3\
/. l\
1
ROPEM RBOND “IBERS UNDO NEY SAUE LOllD RENRBE DELEIE S EARCH EKSfW FILE
Fl-Help
F2-Keys
--Print
ESC-Exit
Figure 3. Structure editor as seen from the PC monitor with displayed p-bromo acetophenone structure, which was built using commands: RING, RBOND, CHAIN, ATOM, BOND.
3. System description A general scheme of the whole system is represented in Figure 2. Every information and expert system (on the right) are accessible directly or via structure editor, which is the central part of the whole system. The databases (on the right) can also be accessed via the structure editor or from the information and expcrt systems (on the left) that require data from the databases. The information and expert systems can be used sequentially one after other or in cycles. For example: the results from the first system are input data for the second onc, etc. All logical operations (AND, OR, XOR, and NOT) can be applied on files and resulting files used again as input files of other systems. One specific application which has employed a number of different parts (modules) of our systems is described in section 5. As already mentioned, the structure editor is a central part of the whole system [51. With simple commands (CHAIN, BOND, RING, etc.) understandable to the chemist any chemical structure to a ccrtain size (in our case the limit is set to 60 non-hydrogen atoms) can be built (Fig. 3).
398
I. 2.
3. 4. 5. 6.
7. 8. 9. 0.
nToN BR I DGE BUM0
2= 6- 7c I - 3c 2- 4. C 3= 5- 8c 4- 0. c s= 1BR 1C 4- B= toC
o c
CHlIN ERASE INSERT RING
ROPEN RBOHD NUHBERS UNDO CT
8= 8-
UEY SlUE
Lolo REMNE DELETE I
SEllRCH EKS -S yt
Prsrr SPACE to continua Figure 4. Conncction table of p-bromo acetophenone. Each row of connection table contains data of one (not hydrogen) atom: identification number of atom, chemical symbol of atom, identification number of the first neighboring atoms and the type of bonds to the first neighboring atoms.
Transparent to the user, during the editing process of any structure, its connection table is maintained all the time (Fig. 4). As a matter of fact, in this form all structurcs in the database are handled in the entire systcm. With commands SAVE, LOAD, RENAME and DELETE h e structures can be savcd on or loaded from the temporary files, the temporary files can be renamed and/or deleted. The structure that is currently active in the structure editor can be used in three diffcrcnt ways: first, searched for (SEARCH command) in the central collection of structures or in any partial onc that was previously generated as output of another module (SEARCH for spectra, for example), second, used as an input for expert or information systcm (EKS-SYS command), or, third, written on pcrmancnt file (FILE command). A rcsult of the SEARCH (with complete or partial structure-substructure) in the collection is the list of the identification numbers of structures that match the query structure. If the sought structure is a substructure the SEARCH will yield all appearances of it in any structure, which mcans that thcre can be more hits for only one reference structure. Atom-to-atom connections bctwccn the query and reference structure arc given for all hits what makes a good tool for studying the symmetry of compounds.
399
In our scheme the expert systems are used for: spectra simulation (SIMULA and VODIK for simulation of I3CNMR and 'H NMR spectra), generation of possible structures from some substructures (GENSTR), decomposition of structures on atomic centered fragments, and classification of structures due to the common fragments (decomposition and classification is described in the next chapter as a part of UPGEN system). The edited chemical structures can be down-loaded on files. The structures on these files can be accessed sequentially or directly. In first type of files the connection tables of structures are saved one after another as alphanumeric records. This form is suitable for structure transfer, particularly between personal computers. Direct access files and inverted files [l], with structures classified according to common structural characteristics, are more suitable for versatile processing of structural data. The inverted direct access files enable fast searching through large collections of structures, specially in the case of substructure searching and fast access to partial supplemental information associated with only parts of structures (fragments), etc.
4. Database improvements While using a chemical system in the qucst for various information at any stage the data that are inadequate (misleading, faulty, incomplete, completely wrong, duplicates, etc.) can be found or at least assumed that they are such. System UPGEN enables the user to handle such cases and maintain database. In order to maintain databases in an adequate state the database manager (this can be any user if our system is implemented on a PC) has a direct access to any database to do one of the following actions:
- adding new data, - deleting data, - correcting old data, - organizing (classifying) whole database, - dclcting the whole database and preparing empty files for new
database.
After every correction or input of a new chemical structure the system automatically checks, if this new structure already exists in the collection or not. If it does, then user can choose bctwecn abandoning the update or incorporating the new structure (and supplemental information, if any) into the collection. Chemical structures entered using the module UPGEN are decomposed on atomic centered fragments and classified upon different characteristics (heteroatoms, bonds, topology, etc). The identification number of structure is written on records of inverted file as shown in Figure 5. The procedure according to which the structures are decomposed into fragments and stored into the inverted file is as follows:
400
Structure 164
Decomposition on fragments f1
Inverted file
... 164 ...
4536 164 ...
... 164 0c5 7720
... 164 ...
I Figurc 5. An example of decomposition of a chemical structure (ID = 164) on atomic centered fragments and updating the identification number on different records in the inverted file of fragments is shown. The code representing each fragment is calculated from the bit mapped pattern of atomic centered fragments [ 11. From each code the position of corresponding record in the inverted file is determined by hash algorithm [l].
- dccornposition of
the structures on atomic centered fragments, coding each fragment by formation of 64-bit mapped patterns [ 13, - obtaining one number for each fragmcnt from bit mapped patterns by XOR function, - calculation of a proper hash address for each number representing a fragment, - saving structure’s ID number to the address in the inverted file. -
Each ID number of a chemical structure is stored into as many records of the inverted file as thcre were different atomic ccntcrcd fragments found in these structure.
5. An application of our system At the cnd we would like to dcscribe a problem that was solved in our laboratory using the discussed system. During the work on 13C NMR spectroscopy of furan dcrivatives it was asccrtaincd that the system docs not contain enough data (neither complctc spectra nor corrcclions of chemical shifts due to the ncighborhoods of observed atoms for correct 13C NMR spectra simulation) for any of these compounds.
401
shifts 2 3
C
c
143.0 wn 1W.O PPR
4 5
C
109.9 pon
c
143.0
PPR
-2.3 o m
1
811 i n c r c m t tatus=1,2... No. o f nirring incrcncnts
I
Picture o f spcctrrm (Y/W?
Figure 6. Simulated spectrum of 2-methyl furan. Incomplete tables were used with missing increments for some substituents of furan. Number of missing data is expressed with STATUS.
The case started when the structure of 2-methyl furan was built with the structure editor and then it was established that there was no such structure in assigned collection of 13C NMR spectra (SEARCH module was used) and simulated spectra (SIMULA system was uscd) was not correct due to nonadequate data in the tables of the increments (Fig. 6). In any structure, the chemical shift of an isotopic carbon atom is simulated by adding to the basic chemical shift (standard chemical shift A, in equation (1)) [17] increments produced by all substituents. These increments are dependent on the type of substituent, presence of other substituents, and relative position of the substituent with the respect to thc isotopic atom: D i = A , + Ck B kJ D; is chemical shift of ith carbon atom, A , is basic chcmical shift for functional group z, CB,, is the sum of increments due to the substituents (the system can recognize 150 diffcrent substituents [17, 221 and then determines belonging increments for such substitucnt on distancej (& p, yor 9 from the isotopic carbon atom i.
402
TABLE 1 Increments, Bkj (k is a position of the substituent a n d j is a position of isotopic carbon i in furan ring) for furan rings with substituents on positions 2 or 5. ~
~
Substituent
zsp3 -C=C-
-CHO -CO4O-O-CH3
~~
Increments (ppm) B22=B 5 5
B23=R54
9.2 7.6 10.8 6.8 1.8
-2.8 6.8 11.7 10.1 8.0
B24=B53
0.7 3.0 3.0 2.4 2.0
B25=R52
-1 .o 2.9
5.7 4.8 3.4
The data base of chemical shifts and increments used in SIMULA, was taken from the literature [18-201 and at present contains about 40,000 values. Nevertheless, the adequate shifts and increments for furan derivatives were not at hand. In ow case two types of data were missing for the simulation of I3C NMR chemical shifts. The first type represents the influence of furan on chemical shift of methyl group substituted on the position 2 of the furan ring and the second type represents the influence of methyl group as a substituent of furan ring on the same position on the chemical shifts of all carbon atoms in this ring (for a, p, yand 6 positions [17]). In this case only first of two possiblc positions in furan (2 or 5 and 3 or 4) was in our interest. The missing increments were determined with AD1 system and by generation of small specialized database (UPGEN system was used). From existing collection and from literature [21] all compounds with furan ring were extracted (SEARCH and UPGEN systems were used). A new small collection of 10 assigned specua was used as input to AD1 system. It is necessary to emphasize that only assigned specua should be included in such collection, otherwise new increments cannot be determined. In the first run of AD1 program new increments can be calculated only from the cases where in the simulation of one shift exactly one increment is missing (STATUS = 1).In the second and next runs, however, after the increments obtained from prcvious runs are already updated into the tables, the other increments can be determined as well. To be precise, the increments determined in the described process are written on a special temporary file. Only after checking the simulation with a number of cases the new increments are updated permanently into the tables with a special command. The increments for some substituents on furans obtained with the described procedure are given in Table 1. This data are related only with the substituents on position 2 or 5 on furan ring. Besides the increments for methyl group, a complete set of data for four additional substituents were obtained.
403
Ilo. Bton
2 3 4
c
5
c
&
C
C C
tatus0 tatur=1,2..
Chemical shifts 152.2 107.1 110.4 142.0 12.7
Status
ppm P P ~
0
opn
0 0 0
PQlr PQS
0
All increments
. No.
o f missing ineraments
Pictura o f reactrun ( Y A ) ? Fl-Ulp
F2-Keys
F5-Print
F3-In
rt
F6-
Figure 7. Simulated spectrum of 2-methyl furan obtained with improved tables of chemical shifts. Experimental chemical shifts for this compound are: Dc-2 = 152.0 ppm, Dc-3 = 105.7 ppm. D ~ A = 110.5 ppm, Dc-5 = 141.0 ppm, Dc. = 12.9ppm.
The basic chemical shirk, A,, on the positions 2 or 5, Ac-2.5, and 3 or 4, A~-3,4,of the furan ring are 143.0 and 109.9 ppm, respectively 1181. With the same procedure the system AD1 was able to determine the increment to chemical shift of methyl group due to the furan ring on a position. The increment is 15.0 ppm to the basic chemical shift -2.3 ppm for alkanes. With all necessary increments (Bkj for furans and alkanes) determined in due process by the system ADI, the 13C NMR spectrum of the 2-methyl furan was simulated again. The simulated values (Fig. 7) are very similar to the real values which was in the meantime obtained from the literature [20]. Denotation of atoms on Figure 7 corresponds to that on Figure 6. The differences between the simulated and corrcsponding experimental chemical shifts are small and amount for the positions C-2, C-3, C-4, C-5, and C-6 for 0.2, 1.4, 0.1, 1.0, and 0.2 ppm, respectively. The standard deviation of a difference being 0.5 ppm. As shown in the above example, using the module AD1 the table of increments can be successfully supplemented for any [22] functional group of user’s choice. However, bcsidcs the AD1 system a rcprcsentative collection of assigned I3C-NMR spectra containing of structures (compounds) containing the functional group is mandatory.
,
404
6. Conclusion We hope that the explained approach of a combination between the databases and information and/or expert systems where the links are provided with a set of powerful structure handling tools has been shown convincingly, Without a flexible structure handling capability no data base, expert system, or a knowledge base can be fully exploited. The extension of this work is aimed towards a system similar to UPGEN but with much more power in dealing with general type of databases containing (besides the chemical structures) urbirrury other data, enabling the cross-links between the structural data and different textual, numeric, spectral and other types of databases. The presented system is implemented and runs IBM PC/XT/AT/PS or compatible computers under VGA/EGA/Hercules graphics environment. In part (infrared spectra and chemical structures) the described system is additionally implemented on the Institute’s rnicroVax system under VMS 0 s and can be accessed via JUPAK (official Yugoslav data communication net) at no charge. For the access procedure and arrangements contact the authors.
References 1. 2. 3.
4. 5. 6. 7.
8.
9.
10. 11.
12.
Zupan J. Algorithms for Chemists. Chichester: John Wiley & Sons, 1989. Gray NAB. Computer-Assisted Structure Elucidation, New South Wales: John Wiley & Sons, 1986. Zupan J, Ed. Computer-Supported Spectroscopic Databases. Chichester: Ellis Horwood. Int., 1986. Bremser W, Ernst L, Franke B, Gerhards R, Hardt A. Carbon-I3 NMR Spectral Data. Weinheim: Verlag Chemie. 3rd ed.,1981. Zupan J, Bohanec S. Creation and Use of Chemical Data Bases with Substructure Search Capability. VestnSlov Kem Drust 1987; 34(1): 71-81. Zupan J, Razinger M, Bohanec S, Novic M, Tusar M, Lah L. Building Knowledge into an Expert System. Chem Intell Lab Syst 1988; 4: 307-314. Zupan J, Novic M, Bohanec S, Razinger M, Lah L, Tusar M, Kosir I. Expert System for Solving Problems in Carbon-13 Nuclear Magnetic Resonance Spectroscopy. Anal Chim Acta 1987; 200: 333-345. Lindsay RK, Buchanan BG, Feigenbaum EA, Lederberg J. Applications of Artificial Intelligence for Organic C h e m i s t y T h e Dendral Project. McGraw-Hill, New York, 1980. Picchiottino R, Sicouri G, Dubois E.DARC-SYNOPSYS Expert System. Production Rules in Organic Chemistry and Application to Synthesis Design. In: Z. Hippe, Dubois JE Eds. Computer Science and Data Bank. Polish Academy of Sciences, Warsaw, 1984. Milne GWA, Fisk CL, Heller SR, Potenzone R. Environmental Uses of the NTH-EPA Chemical Information System. Science 1982; 215: 371. Milne GWA, Heller SR. NIH-EPA Chemical Information System. J Chem Inf Comput Sci 1980; 20: 204. Zupan J, Penca M, Razinger M, Barlic B, Hadzi D. KISIK, Combined Chemical Information System for a Minicomputer.Am1 Chim Acta 1980; 112: 103.
405
13. Sasaki S , Abe H, Hirota Y, Ishida Y, Kuda Y, Ochiai S , Saito K, Yamasaki K. CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds. J Chem Inf Comput Sci 1978; 18: 211. 14. Oshima T, Ishida Y,Saito K, Sasaki S . CHEMICS-UBE, A Modified System of CHEMICS. Anal Chim Acta 1980; 122: 95. 15. Shelley CA, Munk ME. CASE, Computer Model of the Structure Elucidation Process. Anal ChimActa 1981; 133: 507. 16. Robien W. Computer-Assisted Structure Elucidation of Organic Compounds III: Automatic Fragment Generation from 13C-NMR Spectra. Mikrochim Acta, Wien, 1987; 1986-11: 27 1-279. 17. Lah L, Tusar M, Zupan J. Simulation of 13C NMR Spectra. TetrahedronComputer Methodology 1989; 2(2): 5-15. 18. Pretsch E, Clerc JT, Seibl J, Simon W. Tabellen zur Strukturaujklarung organischer Verbindungen mit spectroskopischenMethoden. Berlin: Springer-Verlang, 1976. 19. Brown DW. A Short Set of C-13 NMR Correlation Tables, J Chem Education 1985; 62(3): 209-2 12. 20. Stothers IB.Carbon-13NMR Spectroscopy. New York and London: Academic Press, 1972. 21. Johnson LF, Jankowski CW. Carbon-13 NMR Spectra, A Collection of Assigned, Coded, and Indexed Spectra. Wiley & Sons, 1972. 22. With described simulation of 13C-NMR spectra 65 different functional groups with belonging basic chemical shifts were determined. In specba simulation [ 171 only 20 different functional goups were determined.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
407
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 36
TICA: A Program for the Extraction of Analytical Chemical Information from Texts G.J. Postma, B. van der Linden, J.R.M. Smits, and G. Kateman Department of Analytical Chemislry, University of Nijmegen, Toerrwoiveld,6525 ED, Nijmegen, The Netherlands
Abstract A program for the extraction of factual and methodological information from abstract-
like texts on analytical chemical methods is described. The system consists of a parser/intcrpretcr and a frame-based reasoning system. The current domain is inorganic rcdox titrimetry. Some results are given. Possible sources of information on analytical chemical methods are discussed.
1. Introduction Within Analytical Chemistry the analytical instruments are being equipped with computers. Thcsc computcrs oftcn have knowledge and reasoning systems connectcd to the tcchniqucs in which the instrumcnt is used. They can assist in the development of the analytical mcthod and can perform the actual management of the analytical procedure [l-31. In the future analytical instruments are expected to become automatic analytical units for which the only information that is needed is the analyte and information on the sample to be analyscd. For the development of an analytical procedure usually the literature (or some inhousc mclhod database) is checked first and if a suitable method is found an expert system is consulted for the finetuning or modification of the procedure to the situation at hand. If there is no directly applicable method an expert system can be used for the development of a mcthod. Such an expcrt system has to be continuously updated with new knowledgc. For this task beside a human expert also literature data could be used in combination with some lcarning expert system. Dircctly by computer accessiblc and usable analytical literature databases hardly cxist. Information on analytical methods has to be manually extracted from the literaturc. In thc secondary litcrature such as Chemical Abstracts and Analytical Abstracts [4, 51
408
much analytical interesting information is also not directly accessible. In Analytical Abstracts (the online database) there are a number of search fields with which the functions of the contents of the field are determined (indexed). For instance the field ANALYTE determines that the chemical in that field is used in the corresponding article as analyte (has the role of analyte). There are also ficlds like CONCEPT and MATRIX but for for most of chemicals, data, equipment, etc., their role is not directly searchable and accessible. The procedural information must be extracted by human effort, too. Still, this information exists in the abstract or can be inferred from the abstract. Chemical Abstracts lacks even these from an analytical chemical point of view interesting search fields. The text analysis system outlined in this article is aiming at the automation of the extraction of factual and procedural analytical method information from texts. The information contained in the descriptions of analytical methods can be subdivided into factual and procedural information. Factual information is data on, e.g., the analyte, thc working-range of the procedure, the accuracy of the procedure, the composition of the reagents, etc. Procedural information entails all the actions that have to be performed, inclusive information on the roles of the chemicals, solutions, instruments, etc. that participate in the actions and information on the circumstances under which the actions take place. This information can be extracted from text by means of Natural Language Processing techniques. Text analysis consists of morphological, syntactic, semantic and discourse analysis (Fig. 1). Morphological analysis deals with the structure of words, its inflections and how lcxcmcs can be derived. Syntactic analysis deals with the relative ordering of words within sentences and sentence elements in terms of their syntactic classes. Semantic analysis
morphological analysis
syntactic ana Iysis
text a n alysis semantic analysis
Figure 1. Natural language text processing parts.
discourse analysis
409
produces information on the meaning function of the various sentence elements and their relationships. After the semantic analysis of sentences some kind of semantic representation is produced in which as much as possible the meaning of the individual sentences is represented. The semantic representations of each sentence serve as input for the discourse analysis. During the discourse analysis various kinds of intra and inter sentential references and ambiguities are resolved and some kind of discourse representation is produced by comparing the input with background knowledge on the domain of the subject of the text.
2. The program TICA The program TICA consists of two parts. The first part performs the sentence analysis, the second part performs the discourse analysis and extracts the information. For sentence analysis we have chosen for the method introduced by Riesbech, Schank [61 and Gershman [7]. The morphological, syntactic and semantic analyses are performed concurrently. Initially we have chosen a semantic representation that was close to that of Shank (the Conceptual Dependency theory) [ 8 ] . His theory uses a limited set of types of concepts (Actions, Picture Producers, Properties and Relations) and these types of concepts are subdivided into a limited set of members. All the concepts that appear in a sentence are represented by means of these basic concepts. This representation proved to be too abstract and distant from the actual meaning and use of the sentences and was difficult to handle. Because of this at the moment a case based representation is used and implemented (see Fillmore [91 for the original ideas and, e.g., Nishida [lo] for a adopted and extended set of cases). Cases are relations mainly between the main verb of a sentence or clause and the other elements of the sentence or clause. These relations represent the semantic function that the various elements fulfil in relation to the main verb. These relations can be the Actor of the action represented by the verb, the Object, the Location, the Manner, the Instrument, the Purpose, the Goal or product, etc. These case relations are most of the times linked to the syntactic functions or leading prepositions of the various sentence elements and determined by the semantic class of the main verb and by that of the main part (noun or verb) of the sentence element. The verbs are not represented by means of a limited set of primitive Actions but used as they are, sometimes replaced by a synonym. The semantic representation is represented in frames. After the sentence analysis the discourse analysis is performed. For the discourse analysis a relative simple ‘script’approach (Schank [ l l , 123 and Cullingford [13, 141) is implemented in a frame-based reasoning system for the description of the background knowledge. The principle of this approach is that texts frequently describe a story in which the sentences and sentence parts describe a sequence of events and states. These events are ordered and this order can be captured in a script. Within a script about a certain story there are a number of possible routes or tracks describing different event
410
sequences which lead from the start to the end of the story. A script can furthermore be subdivided into small units of related events and states, called scenes or episodes. A text about an analytical method can also be captured within a script, e.g., titration. A titration consists of a limited number of analytical actions but the existence and order of these actions can differ along different routes such as a direct titration and a backtitration. The discourse analysis part of the program takes care of the reference resolution and uses the script information about the domain to determine the function and meaning of the sentences within the text. After this all relevant information is extracted. The determination of the relevant information is done via marking of all relevant concepts in the knowledge base. The program is written in Prolog. More information about the program and the semantic representation can be found in rcfcrencc 15.
3. Results and discussion The current program is capable of analysing short abstract like texts within the domain of rcdox titrimetry. An example of this text is: Thc determinationof iodine. Samples containing 62-254 mg iodine are reduced with an excess of 0.1 N potas1. sium ferrocyanate. The ferrocyanate is oxidized by the iodine to the ferricyanate. 2. The unreacted ferrocyanate is titrated with ascorbic acid. 3. The titration is carried out in a solution buffered with bicarbonate. 4. The indicator is 2-hydroxyvariaminblue. 5. A solution of it is preparcd by mixing 1 g of 2-hydroxyvariamin blue wilh 500 ml 6. sodium chloride solution. A portion of this mixture weighing 0.3-0.9 g is used for each titration. 7. The standard deviation is 0.11 %. 8. Some of the questions that are resolved by the program are: What is the analyte? What is the function of the second sentence? What is the type of titration? What is the titrant? To what docs “it” in sentence 6 refer? - To what refers “this mixture” of sentence 7?
-
411
The extracted information is: Analyte: I2 - Working-range: 62-254 mg Method: backtiuation Titrant: ascorbic acid - Reagent: K4Fe(CN)6 - Reagent-concentration: 0.1 N - Indication-method: indicator - Indicator: 2-hydroxyvariamin blue - etc. Beside factual information also procedural information such as the preparation of the indicator solution can be extracted. This procedural information can be represented using a recursive frame representation consisting of: frame, attribute, value. In this representation ‘attribute’ stands for some property or case and ‘value’ is the value of the property or case and can be a frame itself. If in the above example the preparation of the indicator consisted of the mentioned mixing followed by some filtration this could be represented in the simplified nested list form of Figure 2.
[Preparation, [object, [ ‘ I n d i c a t o r s o l u t i o n ’ 1 1, [ m e t hod, [mix, [object, [ ‘2-hydroxyvariamin b l u e
...1 1 ,
[applied, [solution, [has-part, [ ‘ sodium c h l o r i d e ‘
... 1 1 1 1 1
[output, [’mix o u t p u t l’]], [ followed-by, [filter, [object, [‘mix output 1’1 I , [output, [‘indicatorsolution’]]]]]]].
Figure 2. An example of procedural information represented in a nested list form.
412
TABLE 1 Kesult of an analysis of 40 abstracts from Analytical Abstracts. Type of information
%
(n = 40) working range detection limit conditions analyte (matrix, inferents) complete sample pretreatment (*) main reagent (*) figures on main reagent (*) complete method description method performance figures
48 37 66 65 98 73 85
52
~~
* means: relative to those abstracts that contain the type of information.
The procedural information can be stored in relational tables such as described by Nishida et a1 [lo, 161 or represented along the method used by the TOSAR system for organic synthesis representation and storage (Fugmann et al [17]) extended by specific case information of the participants of each action (reaction, process, etc.). If the information is to be used by an expert system linked to,e.g., a robot and/or analytical instrument the information could be transferred directly in the form of frames. There are different sources of text on analytical methods. The current program is being developed for abstracts. One of the drawbacks of abstracts is that they are not complete. This, of course, by virtue of the nature of abstracts. But even when a basic set of types of method description data is selected these data are frequently not present. The results of an investigation of 40 abstracts from Analytical Abstracts is presented in Table 1. The abstracts are taken from the end of 1988 and the start of 1989. The percentages of ‘complete descriptions of sample pretreatment’ and ‘complete analytical method description’ are rough: the completeness is only evaluated using general knowledge on the analytical techniques and not by comparing the abstracts with the articles itsclf. The category ‘main reagent’ includes reagents for the production of coloured compounds which are measured, eluents for chromatography and titrants. In the category ‘complete method description’ the main reagent (if it exists) is not included for the evalualion. The division of method description data in the presented types can of course be improved but the incorporation of the most important information about the applied or developed analytical methods in abstracts facilitates a better access to these methods. This study will be continued. The predominant source of information is the article itself. When the same list of types of method description data is used, a manual pilot study on 6 randomly selected
41 3
articles describing one or more analytical methods from 5 different frequently used journals on Analytical Chemistry show that for none of the articles all the information can be found in the Material and Method section. Most of the times the complete article must be analysed in order to obtain all relevant information. This was also observed in three randomly choosen recent articles of the Journal of the Association of Official Analytical Chemists. Although it is even possible to extract information from graphs [IS]the full automatic extraction of information from complete articles is seen as troublesome at the moment. Perhaps for a number of articles reasonable results can be obtained by combining text analysis techniques for the Material and Method section and some combination of a keyword search on relevant factual information (such as method statistics) and textual analysis of the environmentof the keywords found. The situation would be improved if all relevant method information (also) appears in one closed section. Another source of information is Official Methods of Analysis [19]. A drawback of this source is that only a small number of the published methods are included in this volume (after extensive testing and if necessary modification) and that the methods are not recent (because of the evaluation procedures).
4. Conclusions It is possible to extract information from short texts on a subdomain of Analytical Chemistry wilh the methods presented. Further work will be undertaken to incorporate other fields of analytical chemistry. The extraction of all relevant method information from articles will be difficult because of the spreading of the information all over the article.
References 1.
2.
3. 4.
5. 6. 7. 8.
Goulder D, Blaffert T, Blokland A, et al. Expert Systems for Chemical Analysis (ESPRIT Project 1570), Chromatographia 1988; 26:237-243. van Leeuwen JA. Buydens LMC, Vandeginste BGM, Kateman G . Expert Systems in Chemical Analysis, Trends in Analytical Chemistry 1990; 9:49-54. Isenhour TL,Eckert SE. Marshall JC. Intelligent Robots-The Next Step in Laboratory Automation.Analytical Chemistry 1989; 61:805A-814A. The American Chemical Society. Chemical Abstracts, Chemical Abstract Service, Columbus, USA. Analytical Abstracts, the Royal Society of Chemistry, Letchworth, Herts, England. Riesbeck CK,Schank RC. Comprehension by Computer. Technical Report 78, Yale University, New Haven, 1976. Gershman AV. Knowledge-based Parsing. Research Report 156. Department of Computer Science, Yale University, New Haven, 1979. Schank RC. ConceptuulInformation Processing. Fundamental Studies in Computer Science, volume 3, Amsterdam: North-Holland Publishing Company, 1975.
414
9.
10. 11. 12. 13. 14.
15.
16. 17.
18. 19.
Fillmore C. Some problems for case gammer. In: O’Brien RJ, Ed. Report of the twenty-second annual round table meeting on linguistics and language studies. Monograph Series on Languages and Linguistics. no. 24, Georgetown University Press, Washington, DC. 1971: 35-56. Nishida F, Takamatsu S. Structured-information extraction from patent-claim sentences. Information Processing & Management 1982; 18: 1-1 3. Schank RC. SAM, u story understunder. Research Report 43. Department of Computer science, Yale University, New Haven, 1975. Schank RC, Abelson RP. Scripts, Plans, Goals and Understanding.An inquiry into IIwnan Knowledge Structures. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1977. Cullingford R. Script Applications. Computer understanding of Newspaper Stories. Technical Report 116, Yale University New Haven, 1978. Cullingford R. S A M . In: Schank RC, Riesbeck CK. eds. Inside Computer Understanding: Five Programs Plus Miniatures. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1981. Postma GJ, van dcr Linden B, Smits JRM,Kateman G. TICA: a system for the extraction of data from analytical chemical text. Chemometrics and Intelligent Laboratory Systems accepted. Nishida F, Takamatsu S, Fujita Y. Semiautomatic Indexing of Structured Information of Tcxt. Journal of Chemical Information and Computer Sciences 1984; 24: 15-20. Fugmann R, Nickelsen H, Nickelsen 1,Winter JH. Representation of Concept Relations Using the TOSAR System of the IDC: Treatise I11 on Information Retrieval Theory. Journal of the American Society for Informution Science 1974; 25: 287-307. Rozas R. Fcrnandez H. Automatic Processing of Graphics for Image Databases in Science, Journal of Chemical Information and Computer Sciences 1990; 30: 7-12. Official Methods of Analysis. Williams S, ed. The Association of Official Analytical Chemists Inc., Arlington, Virginia. 1984.
E.J. Karjalainen (Editor), Scientific Compuring and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
415
CHAPTER 37
Databases for Geodetic Applications D. Rulandl and R. Ruland2* lSiemens AG, Dept. ZU S3, 0-8000 Miinchen, FRG and 2Stanford Linear Accelerator Center, Stanford University,Stanford, CA 94309, USA
Abstract Geodetic applications even for a defined project consist of various different activities, and access a vast amount of heterogeneous data. Geodetic activities need a hybrid and heterogeneous hardware environment. This paper gives a brief introduction to the gcodctic data flow using a sample application in survey engineering data. It states a general multi-level integration model providing an open system architecture. The model yields the GEOMANAGER project. Its data management aspect is addressed by this paper.
1. Introduction 1.I Preface Over the last dccadc data handling in applied geodesy and surveying has changed dramatically. What usdc to bc the ficldbook is now a portable computer and the fieldbook kccper has bcen substituted by a microprocessor and an interface. Further down the data processing line one can see the same changes, least-squares adjustments used to rcquirc thc computational power of mainframes, now, thcre is a multitude of sophisticated program systcms available which run on Pcrsonal Computers (PC) and provide an even more clcgant human interface. Also, Lhcre are solutions available for the automated data preprocessing, i.e., for the data handling and preparation from the electronic fieldbook to the creation of input files for the least-squares adjustments, [FrPuRRu87, RRuFr861. However, an cqually important step has not found much consideration in geodetic discussions and publications, the integration of the gcodetic data flow, i.e., the management of geodctic data in large projects. This paper will summarize the geodetic activities and the data flow shown at a sample and representative gcodctic application. A two-lcvcl intcgration model is introduced, consisting of communication integration and information
* Work supported by thc Dcpamncnt of Energy, contract DE-hC03-SF00515
416
integration. Whereas an integrated communication system can be implemented using today’s market standard components, an information integration requires a customization of new database management systems. The goals, requirements, and solutions for gcodctic database management systems, especially for the GEOMANAGER of our sample application are emphasized.
1.2 Dataflow The geodetic data flow is summarized in Figure 1. First, the readings are stored in measurement instruments or data collectors. The data then is uploaded and prepared by DATA PREPARATION programs yielding a measurement data file for each considered observable. Thcse measurement data are raw measurement data, which must be processed by PREPROCESSING programs yielding reduced data. To do so, the preprocessing
Figure 1. Geodetic data flow.
417
programs need to access the calibration data. Furthermore, point identifiers are normalized to standard point identifiers, i.e., alias or synonyms are replaced by standard identifiers. The preprocessed measurement data form the input of various DATA ANALYSIS programs, which compute (new) coordinates for the considered points. For each point all sets of new and previously measured coordinates are stored. Each set of coordinates rcfers to a common or measurement specific underlying coordinate system. Thus, a huge amount of highly structured data is generated and accessed by various activities from data collection in the field to time consuming data analysis programs. Due Lo the nature of geodetic applications, the geographical sites of the activities are widely sprcad. The observables are collected in the field using portable microcomputers (e.g., HP Portable Plus or 71 computers) running the specialized data collection programs. The observation data are either manually entered or the data collectors are intcrfaced with the survey instruments (e.g., KERN E2 or WILD T3000 theodolites) to transmit bi-directional signals. Preparation and preprocessing of the collected field data is exccuted on a departmental cluster of workstation and personal computers, respectively. Hence, the field data collection is off-line connected to the cluster. Other activities, like calibration of survey instruments, is performed in sites, located several miles away from the cluster. The data analysis programs run mainly on the cluster. Only some special analysis programs still need a mainframe computer. Summarizing, most of the geodetic activities are performed on the PC/WS cluster. As shown, the different geodetic actions deal with various data which can be classified as follows (see Fig. 2): Measurement Data (Data concerned with different observables)
- Height Data - Distance Data - Direction Data
Calibration Data (Data about instruments) - TapeData
- RodData - Circle Data
Point Data - Point Identification Data - Coordinate Data - Coordinate System Data
(Synonym identifiers)
418
-
Measurement Data Distance Height Direction
Calibration Data
. ..
EDM
Point Data Point Names, Point Description
Coordinate Data
Rod
Circle
.. .
-
- Synonym Identifications Alias Names
- Coordinate System Data
xyz, T/S Comments
Origin, System Definitions
Figure 2. Data classification.
I .3 Enhanced data integration Thc GEONET data management approach [FrPuRRu87] was developed to handle thc hugc amount of gcodctic data originated during the construction survey and subsequent rcalignmcnt surveys of the Stanford Linear Collider (SLC), built by the Stanford Linear Accclcrator Ccntcr (SLAC) [Er84]. The SLC is a high energy physics particle collidcr for thc research into the bchavior and properties of the smallcst constitucnts of matter. During the construction alone, somc 100,000 coordinates had to bc dctermincd [OrRRu85, Pi86J. The GEONET approach was bascd on a hierarchical database managemcnt (DBM) conccpl which requircd thc hardwiring of data structures. Ncvcrtheless, GEONET proved to bc very successful and has found many applications in thc high energy physics survey and alignmcnt community. Howevcr, the concept does not provide flcxibility of easy assimilation to changing rcquirements, of easy integration of new tools and of establishing ncw data relationships. Thcrefore, future projects likc the Supraconducting Super Collidcr (SSC) which will produce an at least 20-fold increase in the amount of data and will show more complex data relationships due to an increase of observables and more sophisticated and complex mathcmatical modclling will require new conccpts. This situation triggered Lhc project GEOMANAGER.
2. Integration of the geodetic data flow As pointcd out in the introduction an integration of the data flow among the various gcodctic activities is ncccssary.
419
The major goals and requirements of an integration of the data flow result from the following characteristics: - Geodetic software tools have different communication interfaces. - Geodetic software tools (e.g., data gathering, data analysis programs, etc.) use a huge amount of different data. - Geodetic software tools share same data. - Geodetic software tools run in different project environments. - New geodetic software tools must be easily integrated. - Geodetic data are highly smctured and need heterogeneous types. - Gcodctic data own various complex consistency constraints. An integration must provide an open system architecture for an easy integration of new tools and instruments. The geodetic integration concept provided by the GEOMANAGER project consists of two levels: - Communication integration - Information integration Information integration requires an integrated communication management. Communication integration emphasizes a full interfacing of all used computers and instruments. The interfaces must be suitable for the required communication. The requirements of the main interfaces of the sample application are: - Interface: Survey instruments and data collection computers Special purpose low lcvel signal transmission communication - Interface: WS/PC cluster and data collector computers Transmission of small amounts of field data files - Interface: WSPC cluster High speed local area network - Intcrface: WS/PC cluster and mainframes Transmission of large amounts of various data Interfacing an hybrid computer environment can use today well equipped communication standards. But in some cases (e.g., interfacing survey instruments) the customization of special interface boxes is required. The major goal of the information integration is a unified high level data management, such that all activities can access the data on a high level of abstraction and in a unified way. Information integration is best fulfilled by a database approach, providing the following concepts: - Conceptual data centralization - Data redundancy elimination
420
SHARED PLOTTERS
SHARED PRINTERS
\
SHARED DISK STORAGE
PCNSCLUSTER
I
GATEWAY
0 0
0
E2/T3000 TH EODELITE
-@-f MAINFRAME
DATA COLLECTORS
Figure 3. IIybrid computer environment. - Data sharing -
Data indcpcndcnce
- High level intcrfaccs
- Open system architecture Thc GEOMANAGERs database resides on the WSPC cluster, because all major geodetic activities take place here (see Fig. 3).
421
Databases provide some further well-known functions and capabilities, which are also required by geodetic applications. They are not discussed here [DRuRRu87]. There are various problems and aspects in applying a geodetic database system. Because of space limitations, we focused only on the following aspects: - Data modelling - Database interface
3. Data modelling As already pointed out, geodetic data are highly structured and use heterogeneous data types. However, traditional data models do not support all relationship and data types as well as more sophisticated data abstraction concepts. These limited data modelling capabilities complicate the database design process and the database usage. The lack of semantics becomes more important the more complex the data structure of the application is (especially in more sophisticated "nonstandard" database applications, like engineering design, office automation, geographic applications, etc.). Furthermore, geodetic tools run in a wide range of project environments using database systems based on different data models. Thus, the same application data structure must be modelled in different data models, which causes redundant database design processes. These gaps between applications and traditional data models are bridged by semantic data models. We use the EntityBelationship model (ER model) extended by the data abstraction concepts of aggregation and generalization hierarchies. Extended ER schemes are developed for the following major geodetic data classes: - distance measurement data - height measurement data - point data.
3.1 ER schemes for the sample geodetic application In Figures 4,5 and 6 ER diagrams are given for distance measurement data, height measurement data, calibration data, and point data. These ER schemes contain 17 entity types and relationship types, respectively. Because of space limitations, and since the ER diagrams are self-explaining, only a few aspects are pointed out in the following. The entity types DISTANCE-MEASUREMENT, TAPE-METHOD, EDM-METHOD, DISTINVAR-METHOD, and INTERFEROMETER-METHOD describe the distance measurement data. Entities of the latter entity types specialize the distance measurement data by adding property properties of a specific method. A DISTANCE-MEASUREMENT entity describes the method-independent properties. It must be related to exactly one entity of exactly one METHOD entity type. Thus, DM-METHOD represents a generalization
422
among Lhe generalized DISTANCE-MEASUREMENT cntity typc and thc 4 individual METHOD cntity typcs. The entity types HEIGHT-MEASUREMENT and READING dcscribe thc height mcasurcment data. Since a height measurement consisls of several rcadings, which arc cxistcnce and idcntification dependent, a PART-OF-relationship typc represents these associations. The cntity typcs TAPE-INSTRUMENT, EDM-INSTRUMENT, DISTINVARINSTRUMENT-WIRE, and INTERFEROMETER-INSTRUMENT, as well as RODlNSTRUMENT dcscribc the calibration data of the instrument used for distance and height mcasuremcnts, rcspectivcly. These cntity types arc conncctcd to the cntity typcs describing the mcasurcment data. Notice, that the relationship types USED-ROD-1 and USED-ROD-2 are thc only relationship typcs with attributes. Thr :tributes rcprcscnt the raw and reduced readings on the two scales on each of thc two rods used. Finally, the cntity typcs POINT, SYNONYM, COORDINATE, and COORDINATE-SYSTEM dcscribe the point data. Each point owns several coordinatc data sets. Notice, that thc relationship type SAME-SERIES is the only recursivc relationship type. It rclatcs coordinates, which rcsult from the same measurcmcnt epoch.
Figurc 4. ER diagram: distance measurement data.
423
HeightMeasurement
-
T/S Ah
HM point
Point n:!
In:!
HMSlorlPoint
Offset
Figure 5. ER diagram: height measurement data.
- _T/S __ CoordinateSystem
- Zero P o i n t - x - Axis
- y-
Axis
424
4. Database interface The database interface is based on the used data model and must meet the data access and m‘anipulation requirements of the geodetic tools. GEOMANAGER’s interface is a hybrid data intcrface, combining descriptive and procedural elements. First of all, the interface supports elementary operation Lo access sets of entities or relationships of a single type. The entities or relationships must be qualificd by their identificrs. Thus, the elementary operations support a proccdural interface. But, most gcodetic tools need an access to aggregates of associated entities of sevcral types. Thus, opcrations for accessing data aggregates must be supported by the interface. These aggregate operations dcfine a descriptive intcrface. Its dcsign is bascd on the following propcrtics of geodetic applications. First, for each geodetic tool a set of generic data aggregate types accessed by this tool can be specified. Hence, the set of used data aggrcgates are pre-known. Second, some geodetic applications do not have any direct access to the database provided by the communication system (e.g., data analysis programs running on mainframe computers). Other existing geodetic tools do not yet support any database interface. They use their own dedicated file structures. Thus, the interface supports a prc-dcfined set of the parametrized access modulcs for data aggregates. In a first step, the data aggregate type is specified. If data aggregates are rcuieved or modified, their qualification is also given. The spccification model for qualification statements is derived from predicate logic extended by concepts for handling hierarchies for objcct classes. This information is especially used by the transaction management for concurrency control and recovery. The second step dcpcnds on the communication mode. If direct access is possible, then the specified data aggregates can be rcuieved, modified, or written using elementary operations. Thus, this second step access is a procedural one. If there is no direct access possible, a retrieved data aggregates are downloadcd from the database in a data stream using standardized interchange format. The interchange format is derived from the database scheme. If data aggregates are entcrcd, they must bc given as datasueam, which is uploaded to the database.
5. Conclusion In this paper, the need of a data flow integration in geodetic applications is shown. Thc
goal of this paper is: - to providc some understanding of
geodetic activities and of geodetic data flow intcgration the geodetic data flow - to introduce a two-level intcgration model - to show the problems in applying software tools (i.e., DBMSs) in today’s market place for this “non-standard” application - to evaluate the gcncral potential of
425
The proposed integration model provides an open system architecture and has two integration levels: - Communication integration - Information and data integration This paper addresses the information and data integration level. The requirements are: - Access to a huge amount of data by the tools - Tool migration among various projects - Open system architecture - Highly structured data - Complex consistency constraints
The goals of thc information and data integration are to provide: - Conceptual data centralization - Data redundancy
elimination
- Data sharing - Data
indcpendence and high level database interfaces, using a database approach
Howevcr, DBMSs are commonly used in commercial applications, and not frequently in "non-standard" applications, like engineering and scientific applications. The papcr mentioned three problems in using geodetic DBMSs in geodetic applications, i.e., GEOMANAGER project: - Data modelling - EntityBclationship schemes, extended by aggregation and generalization hierarchies are developed for our sample application. - Database interface. A hybrid, i.e., procedural and descriptive database interface is developed for accessing simple entities/relationships as well as complex data aggregates. Furthermore, up- and downloading of data is possible.
References DRuRRu87 Ruland D, Ruland R. Integrated Database Approach for Geodetic Applications. IV. International Working Conference on Statistical and Scientific Data Base Management. Rom, 1988. Er84 Erickson R, ed. SLC Design Ilandbook. Stanford Linear Accelerator Center, Stanford University, CA. FrPuRRu87 Friedsam H, Pushor R, Ruland R. The GEONET Approach-A Realization of an Automated Data Flow for Data Collecting. Processing, Storing and Retrieving. ASPRS-ACSM Fall Convention, Reno, 1987.
426
OrRRu85 Oren W, Ruland R. Survey Computation Problems Associated with Multi-Planar Electron Positron Collidcrs. In: Proceedings of the ASPRS-ASCM Convention. Washington, 1985: 3 38-347. Pi85 I’ictryka M. Friedsam H, Oren W, Pitthan R, Ruland R. The Alignment of Stanford’s New Linear Electron Positron Collider. In: Proceedings of the ASPRS-ASCM Convention. Washington, 1985: 321-329. KRuFr86 Kuland R, Friedsam H. GEONET-A Realization of an Automated Data Flow for Collcction. Processing, Storing and Retrieving Geodetic Data for Accelerator Alignment Applicntiom. Invited Paper, XVIII Congress, Federation Internationale de Geodesie, Toronto, 1986.
E.J. Karjalainen (Editor), Scientific Computingand Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
427
CHAPTER 38
Automatic Documentation of Graphical Schematics M. May Academy of Sciences of the GDR, Central Institute of Cybernetics and Information Processes (ZKI),KurstraJe 33, DDR-1086 Berlin, GDR
Abstract Graphical documentation is still one of the less supported time consuming and errorprone engineering activities, even in an era of sophisticated CAE/CAD-systems and tools. Among graphical documents schematic diagrams represent a specific class of 2Ddrawings. They arc characterized mainly by structural information, i.e., by graphical (macro-) symbols and their interconnection lines. Schematic drawings are not true-toscale graphics and may thus be derived automatically from a description, containing only simple structural information. The resulting layout problem for graphical schematics applies to various branches in design automation. The entire and very difficult layout process may be decomposed into scheme partitioning, placement of the graphical symbols and routing of interconnection lines. Their is no unique layout algorithm for this problem. We found that different typcs of diagrams are to be classified and gcncratcd by different tcchniques. rinecrThese new layout methods have bcen proven to be very efficient in various enb’ ing applications such as system design and graphical documentation, e.g., in automation, clccuonics, technology, and software engineering. Schematics that have been generated by our Computer Aided Schematics- (CAS-) approach are electric and wiring diagrams, logic schematics, flow charts and tcchnological schematics.
1. Introduction When designing a complex technical system or process the engineer or designer starts by fixing its gcncral suucturc and main functions. No matter whether structural, functional or implcmenlalional design is considered he or she has to think over system components, subfunctions, and proccdurcs and how they intcrconnect and interact.
428
In most of the technical disciplines, such as electronics and automation or process and software engineering, schematic diagrams are used to depict these difficult interrelations. Typical Schematics from those application fields are elecuic and wiring diagrams, logic and technological schematics, block diagrams, control and programming flow charts, graphs and networks. It is widely accepted that the manual preparation and updating of diagrams is a time consuming, error-prone, costly, and not very creative job. This makes schematics generation a work lo be intensively assisted by the computer. However, despite some promising results mainly in electric and logic diagram drawing the automatic generation of general schematics did not receive much attention in the CAD/CAE-community. To cope with the inherent fuyout problems a more general view and unifying approach is necessary.
2. Schematics structure Schematics are not true-to-scale graphics. Essentially, they are characterized by graphical symbols, by inferconnecfions(lines) between input and output connectors (pins) of thcsc symbols, and possibly some lettering. The detailed arrangement of these graphical constituents on the layout area (display or sheet of paper) is of secondary importance. Consequently, schematics are mainly determined by their structure which can be described in different ways. Most frequently, an interconnection list is used, specifying the pins to be connected and their relative positions on the symbols they belong to. Figure 1 shows two possible structure descriptions for schematics and a corresponding graphical representation. Future efforts are supposed to concentrate on standardized structural schcmatics intcrfaces. Additionally, there are user-oriented structural descripfion funguages supporting also hicrarchical design [l]. It should be noted that structural schematics descriptions need not necessarily be the rcsult of manual input but may result from CAD preprocesses.
3. The layout problem The major objective in the automatic generalion (layout) of schematic diagrams is to produce correct, readable, and easily comprehensible graphics while taking into account the conditions imposed by standards and the aesthetic viewpoint. These objectives are reflectcd by the layout requirements, such as grouping together strongly interconnected symbols and subschemes, maintenance of the main signal or information flow, small interconnection length, few intersections and bends in the line routing, and uniform utilization of the layout area.
429
From Structure Description to Graphical Representation
b
-
V1,TZ p 3 - - 7
PI
s1
I
Pz
-
V2,Tl
I
'-
P1
s3 Pt
P3
-
S -symbol P -interconncction point (pin) V - interconnection net T - interconneelion- (line-) type
Figure 1. Two p o s s i b l e structure descriptions for graphical schematics.
Formalizing the general layout problem for schematics is a very complex task [ 2 ] .So, we give only a few remarks on those parts of the model that have to be specified. Without loss of generality the symbols are considered to have a rectangular boundary where the pins are located on integer coordinates of a supposed grid. Nets are subsets of pins to be connected by certain line structures. In general these structures are trees. The layout area is represented by a scction of the Euclidean plane. It is typical for schematics to embed the interconnection lines into a rectangular grid which is called rectilinear routing. Furthermore, besides some application specific layout rules there exist some general ones. Symbols and lines must not overlap. Orthogonal line intersections are generally allowed. Complex schematics have to be divided into several sheets of a given format. Intersheet references ought to be done by connectors . By the layout rules an admissible layout is defined. The most difficult modelling part is the selection of an appropriate goal function for layout optimization 121. Usually, a function of the symbol @in) positions and line tracing has to be minimized. Because of its inherent complexity this global optimization problem must be simplified in two ways.
430
Figurc 2. A grid schematic.
At first, schcmatics must be classified into groups with similar charactcristics, such that within one group the same layout techniques can be applied. Secondly, it is indispcnsiblc to decompose the gcncral layout problem into easicr subproblcms that can be trcatcd scqucntially. Gcncrally, these subproblems are: decomposition of complex schematics, symbol placement, and line routing.
4. Types of graphical schematics Looking at the varicty of schematic diagrams we can distinguish bctween thrcc classes, each rcpresenting similar layout charactcristics [2].
4.1 Grid schematics Thcy arc characterized by graphical symbols of nearly equal size, whcrc the symbols Ki are rcprcsented by thcir enclosing rectangle. Hence, these symbols are arrangcd on grid points of an (cquidistant) rectangular grid (matrix), where each symbol is assigned to cxaclly one grid point (matrix clement). The space on the layout arca not occupicd by symbols is available for line routing. A sample grid schematic is illustratcd in Figurc 2. Thc main critcria for laying out a grid schematic are interconnection length and bcnd (corncr) minimization. Typical representatives of grid schematics are: - block diagrams
43 1
Y Figure 3. A row schematic. - simple logic diagrams - hydraulic schematics - technological diagrams
- graphs and nctworks. In principle, any diagram can be considered a grid schematic if each grid point reprcscnts an area bigger than the size of the biggest symbol. However, if symbols differ considerably in size this model results in a very inefficient space utilization. Sometimes the grid model may be too resuictive even for equal sized symbols, since a certain degree of frcedom in symbol displacement often tends to reduce line comers and thus to improve the readability of a graphic.
4.2 Row schematics Here the symbols are arranged on consecutive parallel rows [2, 31 of a certain width. We restrict our consideration to vertical rows. The row width dcpends on the symbol length. Symbols are supposed to have similar length but arbitrary height. Usually, row schematics are signal flow oriented representations, i.e., there exists a preferred direction (from left to right) along which the signal flow is to be watched. Beside acslhctic line routing the main layout objective for row schematics is to maintain this signal flow, resulting in a reduction of line crossings and feedbacks. Bctwccn two symbol rows there is always another row (channel) left for embedding the intcrconncction lines. This special topology allows usage of very efficient (channel)
432
1 2 3 4 s
I0
15
20
2s X
Figure 4. A free schematic.
routing procedures. Examples of row schematics are: - (arbitrary) logic diagrams - programming and control flow charts - Peui nets - signal flow graphs - rclay ladder diagrams - state-transition diagrams.
4.3 Free schematics This is the most general class of diagrams. Symbols are al!owed 3 ake any size. There is no restriction of the placement area, i.e., the free plane is available for symbol arrangement and line tracing. This requires considerable effort in designing acceptable layout procedures. A typical free schematic is depicted in Figure 4. As for grid schematics the major layout objective is aesthetic and complete line routing as well as uniform utilization of the layout area. Representatives of this class are: - electric circuit schematics - wiring diagrams - general block diagrams - entity-relationship diagrams - enginwring schematics.
433
5. Automatic scheme generation Generally, the entire generation cycle of schematic diagrams is separated into three layout steps: 1) scheme decomposition 2) symbol placement 3) line routing.
5.1 Scheme decomposition Decomposition means deciding which symbols of a schematic have to be assigned to one sheet and determining intersheet connections. We distinguish between a priori and a posteriori decomposition. In the a priori approach, also called partitioning, symbols have to be decomposed into groups prior to symbol placement and line routing. The objective is to group those symbols on a sheet that belong strongly together such that the number of intersheet connectors is minimized. Before partitioning the diagram has to be replaced by an appropriate graph model (e.g., weighted star, clique or hypergraph). After this transformation the decomposition is obtained by size-constrained clustering [2] similar to IC partitioning [4]. Schematics do usually not exceed the size of several hundred symbols. Efficient placement (and routing) procedures often manage to handle the entire diagram without a priori decomposition. In this case the diagram can be generated without taking into consideration the format of the output sheets. Then an a posteriori decomposition is to tear the overall picture into several subgraphics of sheet size. This may include a slight (local) displacement of those symbols that are cut by sheet boundaries. If routing on the overall schematic is too expensive a similar a posteriori decomposition can be performed immediately after symbol placement. In this way the entire routing problem reduces to routing on single sheets. A posteriori decomposition can be applied to all types of diagrams, but it is especially suited to grid and row schematics [2, 53.
5.2 Symbol placement Placement is the most critical part in schematics layout, differing much from placement problems appearing in PCB- and IC-design [4].The objective is not a very compact layout but a placement that allows complete and aesthetically pleasing line routing. For grid schematics placement can be transformed to the standard (NP-hard) Quadratic Assignment Problem, for which heuristic solution techniques are well known. Placement for row schematics is usually divided into three steps: assigning symbols to rows, ordering the symbols within their rows, and detailed placement [3,6]. The first step
434
leads to a modified version of the Feedback Arc Set Problem [2] and the second one to the Crossing Number Problem in multi-partite graphs [71. Detailed placement is obtained by local displacement opcrations such as to maximize the number of straight line segments in routing. An alternative approach to row placement consists in dctermining and joining horizontal signal chains of symbols, which results in solving a modified Optimal Lincar Arrangement Problem. The most difficult placement problem is that forfree schematics. It is not yet deeply investigated. It is related to building block placement and floorplan design in circuit layout [4, 51. Hence, force-directed and min-cut algorithms could be adopted to symbol placement 121. However, in this approach it is hardly possible to take into account conditions imposed by routing requirements. Neverthcless, this technique can be applied to obtain an admissible symbol arrangement with acceptable global characteristics. Local propcrtics can be considered more appropriately in a sequential placemcnt algorithm, where in one step exactly one symbol is to be placed on thc sheet in such a position that allows casy line routing. Here, very often global effects are neglected when local decisions are takcn. Consequently, we suggest a combination of both views resulting in a hierarchical (bottom-up) placement for free schematics. In this way, subschematics are built and mcrgcd step by step until the overall diagram is generated. Using this method the routing could be pcrformed in a similar hierarchical way on each subschcmatic.
5.3 Line routing Routing means the automatic generation of interconnection lincs bctween the symbols of a schematic. Since lines on most diagrams are made by orthogonal segments we restrict our consideration to rectilinear routing. The objcctivc of routing is to embcd each interconnection net under certain resuictions in the routing area (rectangular grid). An embedding of a net is a rectilinear Steiner tree with its leaves just being the net pins. Trees with a minimum total length and/or number of bends are to be found. Whereas differcnt placement techniques are necessary for the differcnt classes of diagrams, routing can be treated in an universul approach. However, this does not mean, that certain routing problems could not be solved more efficiently by specific techniques. The most flcxible (universal) strategy for routing on general schematics is to connect net by net and to reduce the Steiner tree layout to sequential routing of two-point structures using path finding algorithms [4, 51. Its most popular representative is the Lee algorifhm [8], operating on a matrix where cach matrix element (cell) corresponds to exactly one grid point. It is a breadth-first search algorithm, somctimcs callcd wave propagation method. This algorithm operates
435
on a very general class of monotone path cost functions and may be adopted to the spec i k needs of schematics routing [2]. Based on this principle an universal line router 191 CARO (Computer Aided Routing) has been developed and applied to many different diagram and document types. Emphasis has been put on performance rate and speed by dynamic sorting of pins and nets, anti-blocking technique, directed target search, and multi-level routing hierarchy. Figure 5 shows a section of an electric diagram automatically routed by CARO. Due to the complexity of this routing task a two-level hierarchy was used. The other major strategy for generating two-point connections is the so-called line search method [4,103. Unlike the Lee algorithm in which a path is represented by a sequence of grid points, the line search algorithms search a path as a sequence of line segments. Starting simultancously from the source and target point, horizontal and vertical lines are expanded until they hit an obstacle or the routing boundary. From these lincs again perpendicular extension lines are constructed, etc., until a polyline originating from the source intersects one from the target. This line search strategy was used in [ll]. The third strategy for two-point routing is to explore the principle of pattern routing, i.e., lo rind simple-shaped interconnection paths efficiently (e.g., straight lines, comers, u- and z-shaped paths). Pattern routing is especially suited to schematics, but the number
I
Figure 5. A CARO routing result.
I
436
of topological routing patterns grows exponentially with the number of line segments. Thus, pattern routing is applicable either for rather simple diagrams or may be used as initial step in complex routing procedures [ll, 121. Beside thcse sequential (net by net) routing algorithms for general diagrams there exist highly efficient (semi-parallel) channel routing techniques [4] especially for diagrams with rather regular structure. A channel is a rectangular section of the routing arca with pins only on two opposite sides. For grid and row schematics thc decomposition of the entire routing area into channels is obtained in a natural way. The assignment of intcrconnections to channels, called global routing, can easily be adopted from the IC global routing methods [41. During channel routing a subset of nets is embedded simultaneously with a minimum number of bcnds. Furthermore, channel routing on schcmatics is to aim at a minimum number of line crossings rather than at conventional minimum channel width [2,5].
6. Application fields The first indications of using layout techniques in computer aided schematics design can bc found in the early sixties. Lee [8] presented a small electric diagram automatically routed by his well-known breadth-first search path finding algorithm. We also find in the sixties first efforts to use line generation [lo, 13-17], symbol placement [18, 191 and partitioning [20] in the documentation of logic schematics. With the rapidly growing complexity of electronic components in the seventies, these first experiences were extended to integrated circuit design. In producing electric diagrams refincd layout techniques were dcveloped resulting in more aesthetic and readable drawings. For example, diagram-specific line routing by recognizing simple line patterns (pattern routing). So far the main application field of schcmatics layout has bcen the automatic generation and docurnentation of logic and electric diagrams. This is due to the pioncering role of microelectronics in the CAD/CAE/CAM era. However, CAS-techniques are on their way to penetrating into new engineering disciplines. Several efforts are being made to document automation and control systems [2, 21,221 automatically. Promising results have been obtained in graphical programming and docurnentation of programmable controllers by logic diagrams and control flow charts [21,23]. Another field of interest in CAS is design specificulion and documentation by schematics, such as programming flow charts [24], data flow diagrams [25], PERT networks [26], Pcui nets [6],and entity-relationship diagrams [27]. First experiences in automatic documentation of workshop drawings by applying channel routing to the generation of dimcnsion lines [28] have been made.
437
C D
f f
WIJ
I
& H I
n
1
o
Figure 6. An automatically generated sketch of a logic diagram.
Furthermore, in engineering as well as in scientific research we encounter the problem of embedding a graph or network in the plane. Again schematics layout techniques providc an appropriate graphical representation. Finally, single CAS-components such as a line router can be used as additional and efficient drawing tools in any graphics system [2].
7. Results Our first application of schematics layout was the automatic documentation and re-documentation of programmable controllers by logic diagrams [21, 291. Logic diagrams belong to the class of row schematics. Here new graph theoretic placement and channcl routing methods were used to derive the correct graphics from the control programs, describing the schematics structure. Figure 6 illustrates a sketch of an automatically generated logic diagram.
438
X1205.2 4
~-
X 1205.2.6
x11
11
X16102 EO X 1 6 2 0 2 B3
X1617.2 8 3 X 1 1 ’ 12 Y1615.2 ‘A 3
x N
W
Figure 7. An automatically documcnted backboard wiring diagram.
439
Figure 8. An automatically routed electric diagram.
Based on Lhe general routing system CARO a number of different application packages for design and documentation purposes was developed. In Figure 7 we have the documentation result of a backboard wiring diagram. This graphic was dcrived automatically from structural information supplied by the preceding hardware design process. Furthermore, in Figure 8 an electric diagram is presented which was routed by CARO. Based on CARO thc system FLOWCAD [23] for efficient graphical programming and documentation of programmable controllers by so-called control flow charts was
440
M 162.6
s p e l l flschen ~.
E
.
l ? ..
11 zdrehlw M !62.7
Schutzgltter Lackwerk geoeffnet
YAfflLW
A
STCW-32 A
13.
15.6
Stoerung P e h ueberbfachung
Figure 9. A control flow chart generated by FLOWCAD.
dcveloped. Here interactive and/or automatic symbol placement, automatic routing of directed interconnections and text handling was combined. Figure 9 shows a typical control flow chart generated by FLOWCAD. The application results presented show that general layout tools are in a position to gcncratc a wide range of schematic documentations very efficiently. In our examples the efliiciency in schematics documentation compared to conventional drawing methods increased by about 300% to 6,000%.
8. Conclusion Thc promising results mcnlioned should make computer aided schematics a field of more intcnsivc research including experts from different engineering and scientific disciplines. Futurc CAS developments will comprise new mathematical models and methods, rule based-techniques, generative computer graphics, flexible schematics description
44 1
techniques, interfacing and the integration of postprocessing, such as list generation, simulation and manufacturing. In this way CAS is going to become a standard tool in many design and documentation processes. A so-called CAS-system comprising all of these aspects is under development at the ZKI. Among these activities we focus our attention on new layout techniques (e.g., bus routing, free placement, partitioning, hierarchical design) as well as new application fields.
References 1. 2. 3. 4.
5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18.
Plessow M , Simeonov P. Netlike schematics and their structure description. Proc VI1 Bilateral Workshop GDR-Italy on Informatics in Industrial Automation, Berlin 3 1 O c t . 4 Nov. 1989: 144-163. May M. O n the layout of netlike schematics. (In German) Doctors thesis (B). Berlin: Academy of Sciences, 1989. May M. Computer-generated multi-row schematics. Comp-Aided Design 1985; 17( 1): 25-29. Ohtsuki T. ed. Layout design and verification. Amsterdam-New York-Oxford-Tokyo: NorthHolland, 1986. May M, Nehrlich W, Weese M, eds. Layout desigmnathematical problems and procedures. (In German), Berlin: ZKVAdW, 1988. Rouzeyre B, Alali R. Automatic generation of logic schemata. Proc COMPINT'85 Conf 1985: 414420. May M, Szkatula K. On the bipartite crossing number. Confrol and Cybernetics 1988; 17(1): 85-98. Lee CY. An algorithm for path connection and its application. IRE Trans on Electr Comp 1961; EC-10: 346-365. May M, Doering S , Kluge S , Thiede F, Vigerske W. Automatic line routing on system documentations. ZKI-Report 82-1/90, Berlin, 1990. Hightowcr DW. A solution to line-routing problems on the continuous plane. Proc 6th Design Autom Workshop 1969: 1-24. Venkataraman VV, Wilcox CD. GEMS: an automatic layout tool for MIMOLA schematics. Proc 23rd Design Autom Conf 1986: 131-137. Brcnnan RJ. An algorithm for automatic line routing on schematic drawings. Proc 12th Design Autom Conf 1975: 324-330. Warburton CR. Automation of logic page printing. IBM Data Systems Division Techn Report No 00,720,1961. Dehaan WR. The Bell Telephone Laboratories automatic graphic schematic drawing program. Proc 3rd Design Autom Workshop 1966: 1-25. Friedman TD. Alert: a program to produce logic designs from preliminary machine descriptions. IBM Research Report RC-1578,1966. Balducci EG. Automated logic implementation. Proc 23rd Nut Conf of the ACM 1968: 223-240. Wise D K . LIDO-an integrated system for computer layout and documentation of digital electronics. Proc Iru Confon Comp Aided Design 1969: 72-81. Rocket FA. A systematic method for computer simplification of logic diagrams. IRE Int Convention Record 1961; Part 2: 217-223.
442 19. Kalish HM. Machine aided preparation of electrical diagrams. Bell Lab Record 1963; 41(9): 338-345. 20. Roth JP. Systematic design of automata. Prep of the Fall Joint Computer Conf 1965; 27(1): 1093-1100. 21. May M. CAS approach to graphical programming and documentation of programmable controllers. Prep 4th IFAC Symp on Comp-Aided Design in Control Systems. Bcijing 23-25 Aug. 1988: 262-268. 22. Barkcr HA, Chen M, Townsend. Algorithms for transformations betwccn block diagrams and signal flow graphs. Proc 4th. IFAC Symp on Comp-Aided Design in Control Systems. Reijing 23-25 Aug. 1988: 231-236. 23. May M, Thiede F. Rcchnergestuctzter Entwurf und Dokumentation von SPS mittels Stcucrungsablaufplacnen.Tagungsbeitr Rechnergest Entwurf binaerer Steuerungen. Dresdcn 15. Mai 1990: 25-26. 24. Yamada A, Kawaguchi A, Takahashi K, Kato S. Microprogramming design support system. Proc 11th Design A u o m Conf 1974: 137-142. 25. Ratini C, Nardelli E, Tamassia, R. A layout algorithm for data-flow diagrams. IEEE Trans Sofho Eng 1986; SE-12(4): 538-546. 26. Sandanadurai R. Private communication. Dec. 1984. 27. Batini C, Talamo M, Tamassia R. Computer aided layout of cntity-relationship diagrams, J Syst SO@ 1984; 4: 163-173. 28. Iwainsky A, Kaiser D, May M. Computer graphics and layout design in documcntation proccsscs. To appear in: Computers and Graphics 1990; 14(3). 29. May M, Mcnnecke P. Layout of schematic drawings. Syst Anal Model Simul 1984; l(4): 307-338.
Tools for Spectroscopy
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Compuring and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
445
CHAPTER 39
Developments in Scientific Data Transfer A.N. Davies, H. Hillig, and M. Linscheid ISAS, Institut fur Spekfrochemieund angewandte Spektroskopie, 0-4600 Dortmund 1 , FR G
Abstract The development and publication of the JCAMP-DX standard transfer format for infrared spectra has opened up a new era of format standardization in spectroscopy. The acceptance of the standard by IR instrument manufacturers has ensured the broad implementation of the standard and has opened the way for a multitude of data exchange and comparison possibilities previously limited to the mass spectroscopists with their EPA format. The simplicity and success of this standard has brought about the call for and development of JCAMP-DX similar standard formats for structure information, NMR spectra, UV/Vis spectra, Mass spectra and crystallographic information so that data exchange bctween scientists can now be a simple matter of decoding ASCII files with standard software. In this paper we present some of our work as the Data Standards Test Center for the German Unified Spectroscopic Database Project where we look into the current state of the implemented software handling these transfer standards and at future developments. Some of the benefits of this standardization will also be discussed.
1. Background The Gcrman government initiative “Informationssystem-Spektroskopie” has been running for several years with the aim of producing high quality spectroscopic databases and multi-spectroscopic software packages to make this data available. The Institut fur Spektrochemic und angewandte Spektroskopie in Dortmund, FRG,has taken on a number of roles within this project: i. Software development for spectra quality control, valuation and exchange. ii. Spectral data Quality evaluation in the fields of NMR and infrared spectroscopy. iii. Collaboration in the development of an X-windows version of the software package “SpccInfo”.
446
Figure 1. Scvere problems exist in the field of scientific data transfer between scientists with diffcrent opcrating systems and data stations.
iv. Coordination of University projects with Chemical Concepts GmbH. v. Development of sections of the “SpecInfo” package concentrating on the Infrared components and advising on algorithm quality. We arc however, primarily responsible for spectra collection and evaluation in the fields of infrared and NMR speclroscopy. Through the work carried out in this area the project tcam at ISAS havc become all too acutely aware of the problems currently prevalent in the field of scicntific data transfer. Following a ‘call for spectra’ in the forerunner project “SpektrendatcnbankcnVcrbundsystem” scientists ran into difficulties when their request for data in any format on any media resulted in such a diverse collection of magnetic tapes and Winchester discs landing at the collection institute that significant amounts of submitted data was not readable and ended up bcing rcturned to the submitting organizations. When ISAS took over the task of coordinating the collection and distribution of spectroscopic and othcr data it was soon obvious that significant of work was needed towards standardizing transfer formats.
2. Some problems and solutions This diversity of internal formats has lcad to the somewhat sad situation where often two scicntists within the same company or rescarch institute find themsclvcs unable to communicate with one another. The incompatibility of intcrnal data storage formats
441
Figure 2. Some organizations have introduced ‘Island’ solutions to overcome the data transfer problems internally but this doesn’t help external data transfer and the solution requires constant software maintenance.
implemented by different manufacturers is like a wall between scientists and prohibits the free flow of information (Fig. 1). One possible solution is of course to purchase only equipment from a single manufacturer but this is rarely a viable option as no manufacturer can possible provide for all the needs of a large organization, and the dangers involved in becoming dependent on a single particular supplier of equipment are all too obvious. Many organizations have solved this problem by inuoducing their own organizationwide data storage format and writing their own software to convert the manufacturers formats present within the organization into some other format. ‘Island’ solutions make life somewhat easier for those on the ‘Islands’ but require constant software maintenance. These solutions however present the same difficulties when contact with the outside world is required, as in our project, where the same lack of conformity problem exists (Fig. 2). Fortunately, there are now standards for data transfer being developed and implemented which should make life much easier. In 1988 McDonald and Wilks published the specifications for a data transfer standard for infrared spectra and interferograms [l]. This format was developed under the Joint Committee on Atomic and Molecular Physical Data (JCAMP) and was given the name JCAMP-DX. The implementation of this standard by infrared equipment manufacturers and software houses now provides an alternative to the ‘single supplier’ solution allowing transfer of spectra between infrared systems regardless of internal system format (Figs. 3 and 4).
448
Figure 3. The introduction of data transfer standards solves the problem of transfer of data bctwecn different operating systems.
Figure 4. The organization type ‘Island’ solutions can also communicate with each another by using a standard transfer format.
449
1.750
1
1.500 1.2
TEST 5 TEST 5
Figure 5. Two infrared spectral curves showing the problem of wrongly implemented software.
3. Some problems with the solution ! Unfortunately the JCAMP organization did not watch over the implementation of the standard amongst manufacturers and this has lead to a rather piecemeal and sometimes catastrophic array of so-called JCAMP-DX compatible software. The errors that have come to our attention were detailed recently [2] and include such unexpected problems and the failure of infrared manufacturers to tell the difference between Transmittance and Transmission. Some more subtle errors were the failure of a major manufacturers software to convert the fractional laser wavelength x-axis increment of another manufacturer into the regular x-axis required by their own internal format. The rounding error produced resulted in a severe shift in the infrared band position at low wavenumbers (Fig. 5). Recent developments in the right direction have been the interest taken in the development and maintenance of JCAMP-DX standards for specific techniques in the sphere of ASTM E49, (Computerization of Materials Property Data) [3].
4. The demand for more standards The usefulness of the JCAMP-DX standard has been clearly shown by the free adoption of the software written for infrarc.d spectrometers by manufacturers in other fields of
450
3 2.5
2 I.5 1
0.s 0 BINARY INTEGER CLEANED DIFDUP DIFDUP DIFDUP DIFDUP (X**(Y .Y))
TESTSPEC
(X**(R..R))
TESTFID
(X**(L.I))
tX**(RI..RI))
0TESTFIDP
Figure 6 . NMR file sizes relative to the original binary file showing the reduction in size when good quality information is coded in the JCAMP-DX DIFDUP format.
spectroscopy. That JCAMP have not succeeded in publishing standards in other ficlds two years aftcr the first publication has meant that UV/Vis and NMR spectra are currently being exchanged with the JCAMP-DX label ##DATATYPE=Infrared Spectrum so that the available software will handle thc data! This is obviously not desirable and this ycar has seen much activity in the field of ncw JCAMP-DX compatible standard formats. The lirst to be published will be JCAMP-CS, a standard format for the exchange of chcmical structures [4]. Scveral drafts currently exist for a JCAMP-DX standard in NMR spccuoscopy [5, 61 which only essentially differ in the proposed method of storing the data curves themselves. The American proposal wishing to adopt a binary data storagc format and the European proposal retaining the ASCII option to allow the data and hcadcr information to remain as one file and more importantly to avoid the enormous problems associatcd with transferring binary files between computers with different operating systems and internal word lengths, The reason bchind the pressure to allow binary storage in a transport format is one of file size. It is a generally held belief that a spcctrum or FID coded as an ASCII file would be far bigger than the same spectra or FIJI coded in binary. To probe this assumption several format tests were carried out and the preliminary rcsults will be given below. Three NMR data files were laken and coded them into several ASCII formats allowcd by Lhc JCAMP-DX specifications for infrarcd spcctra and a new format we have dcveloped for NMR spectra. The three files were:
451
1.4
.......................................................
1.2
.........................
1 0 -8
0.6 0.4
0.2 0
I
DIFDUP
BINARY
(X++(RI..RI))
TESTSPEC
TESTFID
0TESTFIDP
Figure 7. The reduction in transfer time is even more drastic when binary and JCAMP-DX DIFDUP files are compared due to the necessity of inserting control codes into binary files for transfer purposes. Transfers were between a 20 MHz AT-386 compatible personal computer running Kermit-MS Version 2.32/A, 21. Jan.1989, and a DEC GPX running Vax Kermit-32, over a RS-232C line at 9600 baud.
1. TESTFID, a noisy high resolution FID, (noisy data sets cause size problems with JCAMP-DX ASDF data compression formats due to the lack of correlation between
neighboring data points.) 2. TESTSPEC, the transformed version of TESTFID. 3. TESTFID2, an FID with a good signal to noise ratio. All three data sets consisted of alternating real and imaginary data points and only the Yvalues are prescnt in the original data set. The data were first converted to a fixed integer format and then compressed to remove superfluous blanks from the data file. The data was then converted to a standard JCAMP-DX DIFDUP format [l]. Normal FID data sets actually contain paired real and imaginary points where there is no direct correlation to the value data point and it's nearest neighbor in the data file, an assumption in JCAMP-DX DIFDUP coding. This lack of correlation causes a similar effect to large random fast noise signals where the coding is concerned increasing the file storage size significantly. To introduce some degree of correlation between neighboring points the two data sets in each file were then separated into two files and coded independently JCAMP-DX DIFDUP format. Finally, a new idea on FID coding was tested where the two curves were DIFDUP encoded applying the algorithm to each curve independently but leaving the pair-wise point storage. This format was given the new ASDF code (X+ +(RI..RI) (see [l]).
452
Thc prcliminary results were surprising and encouraging (Fig. 6). The worst case scenario showcd only a doubling of the storage space required for the ASCII DIFDUP files and a reduction in the storagc space for the ASCII format over that of the binary filc by more than a factor of 2 was obtained for the good quality FID! This would seem to invalidate the argument that binary files arc necessary because of the excessive storage requircmcnt for ASCII files and leave only the disadvantage of a non-transportablc transport format if binary is adopted for data storage. For a transport format the nctwork transport timcs are often more important than the actual file size and here an even biggcr advantage is shown by thc good TESTFlD2 DIFDUP file over the original binary file as the transfcr program needs to inscrt control codes into the binary file to facilitate transfer (Fig. 7)
5. Other new standards Several other JCAMP compatible formats arc also currently under discussion including a proposal from the American Society of Mass Spectroscopists (ASMS) for a Mass Spectra Standard [7], and a specification for X-Ray diffractograms from the International Ccnlcr for Diffraction Data [8]. Anyone intcrcstcd in contributing to the development of these standards should conuct the author for further information.
6. Standards for multidimensional experiments Thcsc standards arc excellent for single dimensional data but the expansion of multidimcnsional experiments has revealcd a weakness in the original JCAMP-DX conccpt. This can be best explained if we take the guidelines of the Coblentz Socicty for infrared refcrcnce spectra for GC-IR [9]. The spectral evaluations committee of thc Coblentz society namc 34 mandatory labels and an additional 5 desired labels for each IR rcfcrence spectrum. Taking this as the norm for good data content at least 39 lines of text should be added to each spectral curve. For a typical GC-IR experiment vcry little information actually changes between subscqucnt spectra except the retention time of the measurcmcnt and perhaps the oven temperature but as that is programmed the information could be detailcd at thc beginning of thc experiment anyway. This means that if cross-referencing between spcctra were possible each spectra in a GC-IR experiment following the first should be codeable with only a two line header instead of 39, the two refercncing the initial spectrum header file and the time of measurement or retention time. This type of block structure has been published in the Standard Molecular Data (SMD) Format developed by the European chemical and pharmaceutical companies [lo]. Here complex files containing Scopes, Sections, Blocks, and Subblocks all inter-referable are del-incd and a block structuring format of this nature is required within JCAMP-DX.
453
7. Conclusion It can be seen from the flowering of new standards that great interest exists in the standardization of data transfer. The problems with the early JCAMP-DX implementations have shown the need for watchdog organization for these standards and with the involvement of ASTM and hopefully other standards organizations the future of improved data transfcr looks good.
References McDonald RS, Wilks PA. JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form. Applied Spectroscopy 1988; 42(1): 151-162. 2. Davies AN, Hillig H, Linscheid M. JCAMP-DX, A Standard? Sofhuare-Development in Chemistry 4, JGasteiger Ed. Springer Verlag, 1990. 3. McDonald RS. Private Communication. 7. May 1990. 4. Gasteiger J, Hendriks BMP, Hoever P, Jochum C, Somberg H. JCAMP-CS: A Standard Exchange Format for Chemical Structure Idormation in Computer Readable Form. Applied Spectroscopy, in press. 5 . Davies AN. Proposal for a JCAMP-DX NMR Spectroscopy Standard. ISAS Dortmund, Postfach 1013 52,4600 Dortmund 1, FRG. 6. Thibault CG. Proposed JCAMP-DX Standard for NMR Data. Software Dept., Bruker Instruments Inc., Manning Park, Billerica MA 01821, USA. 7. Campbell S, Christopher R. Davis TS, Hegedus JKJ, James C, Onstot J, Stranz DD, Watt JG. A Data Exchange Format for Mass Spectrometry. ASMS; c/o David D. Stranz, Hewlctt Packard, Scientific Instr.Div., 1601 California Ave., Palo Alto, California 94304, USA. 8. Dismore PF, Hamill GP, Holomany M, Jenkins R, Schreiner WN, Snyder RL, Toby RH. (chair). Specifications for Storing X-Ray Dvfractograms in a JCAMP-DX Compatible Format. Draft Document-September 1989; PDF-3 Task Group, JCPDS-International Centre for Diffraction Data, Swarthmore, P.A., USA. 9. Kalasinsky KS, Griffiths PR, Gurka DF, Lowry SR, Boruta M. Coblentz Society Specifications for Infrared Reference Spectra of Materials in the Vapour Phase above Ambient Temperature. Applied Spectroscopy 1990; 44(2): 21 1-215. 10. Latest version: Barnard JM. Draft Specification for Revised Version of the Standard Molecular Data (SMD) Format. 1 Chem Inf Compuf Sci 1990; 30: 81-96. 1.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V.. Amsterdam
455
CHAPTER 40
Hypermedia Tools for Structure Elucidation Based on Spectroscopic Methods M. Farkas, M. Cadisch, and E. Pretsch Department of Organic Chemistry, Swiss Federal Institute of Technology, CH-8092 Zurich, Switzerland
Summary SpecTool, a software package having hypermedia features is a collection of 1H-NMR, 13C-NMR,MS, IR and UV/VIS data and reference spectra, as well as heuristic rules and computer programs used for the interpretation of such spectra. It can be looked at as an “electronic book”. Some of the pages contain only navigation tools and others mainly numerical and/or graphical information.From some pages programs can be started. A high degree of flexibility and forgivingness is built in, so that the same piece of information can be obtained many different ways. It supports associative searching, i.e., “browsing and looking for something relevant”, a feature of “hypermedia” that is hardly possible with other types of computer programs. A series of simpler navigation tools helps promote flexible usage and to avoid the feeling of “being lost” in the system. A distinct feature of SpecTool is that in contrast to expert systems it does not make decisions. It just shows or calculates data as proposals or aids for the decisions of the user.
1. Introduction The interpretation of molecular spectra for the structure elucidation of organic compounds relies mainly on empirical correlations and heuristic rules as well as reference data and reference spectra. The necessary information is spread out in many printed volumes for the most relevant techniques: MS (mass spectrometry), NMR (nuclear magnetic resonance), IR (infrared) and UV/VIS (spectroscopy in the ultraviolet and visible spectral range). Because of the complementary nature of the available information from the individual spectroscopicmethods, a multimethod view is especially powerful [l]. Although printed media are still the most frequently used sources of information, computer programs are more and more widely applied for various subtasks of the structure elucidation process. Most available programs, including spectroscopic databases, are
4 56
out of reach to the majority of potential users. User interfaces, not adequate for occasional users, create a further barrier. Today’s spectroscopists have thus a wealth of fragmented pieces of information distributed in many books and spectra catalogs as well as databases and other computer programs (running on different computers in different environments). Finding the necessary information means often manual scarches in various books and catalogs and takes often a large amount of work. The purpose of this contribution is to present a medium which accommodates the necessary information for spectroscopist’s everyday work within one unique environment. Its usage is as simple as using a book. The system contains interfaces to external programs and will also at a later date contain interfaces to external databases. It can be viewed as an electronic book which is also capable of performing calculations. It is a tool supporting the decisions of the spectroscopists. It has, however, on purpose no dccision making features within the systcm. In this paper the overall structure and the most important features at the present state of the development will be described.
2. Hardware and software Software and hardware for the development of hypermedia became available recently [2, 31. In hypermedia virtually any links can be made between discrete pieces of information (including computer programs). They can contain a large amount of information (data, spectra, programs) within one unique environment. Macintosh computers are rather widely available in chemical laboratories and their price is at the low-end of computers for which adequate hypermedia tools are available. Hypermedia is the combination of multimedia and hypertext in a computer system. Hypertext means that the user may read the information not only in one, sequential way. The information is stored in many chunks. Pieces of information that are related to cach other are connected by links. In a hypertext document the pieces of information are cmbcddcd therefore in a network of links. Information is provided both by what is stored in cach node and in the way information nodes are linked to each other. Rcading the document, the user can follow these links according to his interests and information needs. “Multimedia”, i.e., graphics, animation and sound, helps to present the data in a more flexible way and help the user to be comfortable in the system. Hypercard is a “hypermedia” developing and running tool kit for Lhe object-based programming language HyperTalk. What the user sees on the screen is a card. On a card there can be text (stored into fields), pictures and so-called “buttons”. An action is invoked by clicking at a buttons. If the same object occurs on different cards, instead of storing it for each occurrence separately, it may be stored at one location called background. Objects stored in the background are visible on all cards belonging to this
451
particular background. The above-described objects are stored together in a file that is called a stack. Different stacks can easily be linked together. Besides buttons, fields, cards, backgrounds and stacks may contain scripts, as HyperTalk programs are called. If scripts are attached to objects they may create active areas on the screen that react, e.g., on actions performed by the mouse of the Hypercard user. HyperTalk scripts may modify card pictures, show or hide texts or buttons, show animation, play sounds, do calculations, call compiled programs written in other programming languages or navigate to other cards or stacks.
3. Results and discussion At present the system is still under development. Many features and overall structures have been established. A part of the reference data and reference spectra is entered into the system. This experimental version of SpecTool uses over 5 MBytes of disk storage. A CD-ROM is envisaged as storage medium for the distributed version.
3.1 Organization The file structure of the system has becn designed to provide transparency and to serve the development and maintenance of the system (Fig. 1). The main organization groups are the individual spectroscopic methods. A further group contains a collection of
1 manager
u
Ch=\
1spec t r o s c o pi inf o r m a t i on
MS
applications
CNMR
-( data ) -(r)
Figure 1. Organizationof SpecTool.
programs
data
data
458
Figurc 2. Logical organization of SpecTml.
“~ools”,i.e., programs for activities other than just navigation or information presentalion. External programs which can be called from the hypercard environment build a further group. Finally there is an overall organization unit, the “manager”. The logical structure, the structure seen by the user can look quite different. It is designed to achieve a transparency from the user’s point of view. Several logical structures, e.g., for several types of users, can easily be added to one existing physical structure. At present one such logical structure exists. The top level structure, which appears to the user when he starts the system, can be imagined as hierarchical network of tables of contents (Fig. 2). Every node is the table of contents of the next deeper level. A step to such a sublevel (achieved by a mouse-click on the corresponding item) shows the table of contents of the sub-sublevels.Technically this step can be done by a jump to another card or by the display of a further field within the same card. With this simple hierarchic organization the user can address a huge amount of data with a few steps. At the same time the inspection of the possibilities for the next step is fast, since only a limited information is displayed on the screen at any moment. This structure is not only efficient but easy to use intuitively, i.e., it avoids the feeling of getting lost in the system. The real strength and user friendliness are achieved by adding further connections bctween selectcd points of the hicrarchy (represented symbolically with dotted lines in
459
Top l e v e l menu
I
Reference Data menu (A1 kanes, Alkenes ,...,A1 cohols,...1 (HS; CNtlR; HNHR; IR; UV/VlS)
I HNMR Reference Spectra
ALiphatic A1 cohol s Data IR ALlphatic A1cohol s Data
CNMR
ALiphatic Alcohols Data
Figure 3. Example of a simple navigation, showing some possible orthogonal movements at the data card “HNMR, Aliphatic alcohols” (bold lines).
Fig. 2). These allow “orthogonal” movements. For example the user can enter the system by selecting the Reference Data submenu (Fig. 3 and Fig. 4 top) and here the “HNMR of Alcohols” sub-submenu (Fig. 3 and Fig. 4 middle). With another step he can arrive at the data for aliphatic alcohols (Fig. 4 bottom). Now if he is interested in 13C-NMR,IR, MS or UV/VIS data (all at the same hierarchical level but within different methods), or in reference spectra of the same type of compound, he does not need to go back in the hierarchy. With one mouse click he can address any of these items (bold lines in Fig. 3 and bottom line in Fig. 4 bottom).
3.2 Navigation Virtually any connections can be made between the cards. The only limitation is that the user must not be confused by the offering of too many possible choices. The purpose of the design of various navigation tools is thus to achieve the maximum number of possible choices and to avoid confusing complexity. One part of the navigation is a collection of structured paths through the system (these wcrc described in the previous section). Further navigation tools are presented here. On the lower part of the cards navigation buttons are included (Fig. 4 bottom). The buttons in
460
Figure 4. Navigation within SpecTool. Top: Reference Data Mcnu. The user can select a chemical class (top part) and a spectroscopic method (bottom line). Middle: Card selected by choosing “-OH” and “HNMR’ in dic Rcfcrcncc Data Menu. Boftom: Card sclcctcd by choosing “Alkyl” of “Aliphatic Alcohols”.
46 1
the left-hand side lead to corresponding reference data for the other spectroscopic methods (HNMR is written in inverse style because this card belongs to HNMR). Two general navigation buttons are found on each card of every stack. The first one is the “back-arrow” at the right hand side (Fig. 4 bottom), which allows browsing backwards through the previously seen cards. Another button “myCds”(= my cards) can be used to mark a card, i.e., to note the card name with its path name into a file. At any time this file can be consultcd and direct access to any of the marked cards by clicking onto its name is possible. This feature is analogous to putting a bookmark at some pages in a printed book. The left-hand and right-hand arrow buttons are for browsing within logically connected pages. If there are no more related pages, a stop-bar appears (Fig. 4 bottom). Two final buttons are navigation tools related to the structure of the reference data files. The upward arrow leads to a logically higher order of level, from a data card (Fig. 4 bottom) to a submenu card (Fig. 4 middle). The system automatically saves the submenuof the type shown in Figure 4 middle-which was opened the last time and the upward arrow leads to this submenu card. The button “toMenu” brings the user to a main menu (Fig. 2). As stated above, whenever sensible, direct access from one data card to corresponding cards for the other spectroscopic methods is possible. Thus clicking at CNMR on the card shown in Figure 4 bottom directly shows 13C-NMR data for aliphatic alcohols. In some cases no 1:l correspondence is sensible. In such cases the selection of the corresponding method leads to one lever higher, i.e., to the submenu card ‘‘Alcohols’’, corresponding to the one shown in Figure 4 middle of the selected method. Submenu cards exist for all compound classes of each method. A jump between them is therefore always possible. This overall organization results in three main types of cards: 1. Cards serving mainly or exclusively navigation 2. Cards mainly presenting data or spectra 3. Cards on which programs can be started to perform some kinds of calculations (see 3.4)
3.3 Selection and presentation of the information on a card In many cases tables are of interest, which contain so many items that only a part of them can be displayed on the screen. Such tables are collected into scrolling fields. Scrolling is the equivalent of a linear search in a printed medium table. In SpecTool further possibilities are provided which help a more efficient localization of a table entry. First of all, coarse indices are added to such tables (Fig. 5). They are either permanently on the cards or, if there is not enough room for them, they can be blended in upon a mouse-click. Clicking to an item on the coarse index scrolls the field to the region of this item.
462
m
Isotopes. Nasses, Abundances [Number o f
Isotopes, mass
IlaSSeS,
lsoloue lm W
ISO!
Abundances INumber o f mass 105.9032
1801
re1 abundance 100
106.9050
1
113.9036
I
Figure 5. Table of naturally occurring isotopes. Top: Card as displayed upon opening. Right side: table entries ordered according to increasing masses in a scrolling field. Left side of the table: coarse index for the selection of mass ranges. Leftmost: a scrolling field with alphabetically ordered list of element symbols and corresponding coarse index. Bottom: Upon selecting of an element in the table of isotopes the isotope abundancies are displayed graphically.
If sensible, various indices can be added to one table. The table of isotopes (Fig. 5) exhibits three different indices. The table is ordered according to increasing masses of the elements. A coarse index on the left-hand side of the table can be used for the selection of a mass range (this saves scrolling time). The next index is an alphabetically ordered list of the element symbols. Finally the left most column is a coarse index of this list. With these tools the selection of an element can be accomplished in various ways. Another example is shown in Figure 6 . Here about 250 proton chemical shifts are compiled within one table. The primary order is a list of substituents (y-axis) and skeletons (x-axis). A coarse index of the substituents is blended in through a mouse-click on the button “to choose” in the table heading (Fig. 6 middle). Such an order is not ideal if
463 1 H-Chemical Shifts in Monosubstituted
Alkanes
-H
o.06
o m
2.w
1.m
2.16 2.83
1.15 1.21 1.24
2.10
4.36 3.47 3 37 3.10
3.m
3.50
1.10
3.24
3.37 1,
1.15
0.n C
-CWCHI -C
WRopyl -CH2 -CH2 -CH3 0.91 I . = 0 . w
Methyl Em1 -CH3 -CHZ -CH3
SIW. tochoose
*
*
.
1.71
*
1.M 2.35
-
*
mL
-F
-c I -or I
1-011 -0-olhyl
4m=ru,
4 27
. . 3 . 0 2.w
1 I
1.a I.1 . 1
I ,
. . *
2.10 2.50
-
0.01
I.=
0.00
- 1.SO 0.07
- 2.S
1.02
I M 0 M
1.10
I32
-
-
3.47 3.35 1.10
1.11 1.w 1.10 1.w I 1 1.01
3.49 3.27
in
-
t-nutyl -CH3
Isopmpyl -CH -CH3
.
o.w*
-
-
1.S 0.03
*
1.22
1.15 125
-
-
1.w *
4.14
1.71
4.21 4.24
1.10
1.34
I.W
* 1.m * 1.w
.-
3 , s 1.m.
1.24
1.S l.M
1.24
-
-
2
111 ‘!I
1 H-Chemical Shifts i n Monosubstituted Alkanes Methyl Ethyl -CHJ -CHZ -CH3
Subst. loChoot0
on
-H
n-Pmpyl -CHZ -CHZ -CH3
om o m
OPI
113 091.
lpopropyl - C H -CH3
t-Swl -CH3
1 3 3 O P ~ *o
w
1 I
I
C
I H-Chemical Shifts in nonosubstituted Alkanes n-Pmpyl -CHZ -CH3
Methyl Ethyl -CH3 - -CH2 -CH3 4.20 3 . 1
1 67
370 3TJ
3 R 3 2w
324 2 71 2 16
23¶ 2 a 2.47 1.35
2.m 2.00
4 37 4 37 4 05 4 07 309 3 59 3 47 3 17 3 37 3 21 3 16 2 92 2 67 2 74 263 2 44
1 51 1 31
I21 I 30 I 31 1 I0 1 33
4 21 4 23 a 08 3 PI
2 01 I 76 I 51
3 49
0 P3 I I I 1 on I 10 1 00 I 55 0 03 I 55 0 06 118 1 0 3 1 72 1 02 171 103 I 4 1 o 03
3 47
1 no 1 IS 1 12
3 35 3 27
I 11
l l b
1 10
2
I 35 I 10 I 21 I31
261 2 ni 2 50
3 1I
so
2 40 243
1 00 170 I 53
103 1 07 87 0 05
o
!M
1 65
0 PI
1 57
1 02 0 0 .
I50
F E y:I$ 4.04
1.22
4.70 4.51
1.25 1.11
4.24
1.W
1.S
Figure 6. Table of 1H-NMR chemical shifts of monosubstituted alkanes. Top: Standard arrangement according to the substituents. Middle: Coarse index of the substituents blended in upon clicking at the %Choose” button. Bottom: Rearranged table displayed upon clicking at “-CH2” of the n-propyl skeleton.
464
somcbody searches for a given value of a selected group (e.g., the chemical shift of about 3 ppm for a methylene group in substituted propanes). To give an immediate response in such situations, SpccTool replaces the table by another one, which is accordingly rcarranged (Fig. 6 bottom). A great number of similarly organized tables are implemented and can be automatically gencrated after each update of the corresponding master table. This is another example how a wealth of data can be searched efficiently. The benefit against printcd mcdia is evident. Oftcn one searches according to some features in a table. For example the selection of all clemcnts with a given number of isotopes might be of intercst (e.g., if an unusual isotope pattern appears in a mass spectrum). Such a task is trivial but time consuming if pcrformcd manually. A corrcsponding script solves this task automatically. It can be started with a mouse-click at “No. of isotopes” (upper right comer in Fig. 5). Oftcn a switch between graphical and numerical information is of interest. Mass spcctra arc, for cxample, more useful for intcrpretation if displayed graphically, but for rcading out exact intcnsities (for example for identification of some isotopes), the numerical information is nccdcd. Another example is rclated to Figure 5. Sometimes the graphical information would be more useful (Fig. 5 bottom). For such tasks tools arc added to the systcm which generate the altcrnative information with onc mouse-click. The present framcwork allows one to structure the pieces of information so that various dcpths can be consultcd. Less detailed information can be given for a non-specialist at onc lcvcl and more detailcd material can be presented at another one. One possibility is to have furthcr fields on a card, which will only be shown (as an overlay ovcr the other content of the card) upon a user action. There is a button for example, which is a switch, to overlay or to hide bibliographic refcrcnce to the data, if any.
3.4 Algorithmic tools and external programs Up to now only the conventional hypermedia features of SpecTool have been prescntcd. Thcrc is also however a unique possibility to integrate programs for performing calculations, which arc trivial but time consuming if done without the help of a computer. A series of additivity rules for the estimation of NMR parameters (chemical shifts and coupling constants) is implcrnented. A representative cxample is shown in Figure 7. Hcre substituents can be selected from a linear list (with the help of a coarse index) and the choscn substituent can be placcd on any position of the aromatic ring. The buttons “HNMR” and “CNMR” display the corresponding estimated shift values (Fig. 7). Other examplcs of implemented programs are the identification of homologous ion series from mass spccua, or programs performing the Karplus type estimation of coupling constants [41.
46 5
Ring Protons in Benzene
127.0
149.1
1 1 ,-CHXH-phenyl
Figure 7. Estimation of 1H- and 13C-chemical shifts of substituted benzenes. The user can pick up a substituent from the list given on the left-hand side (together with a coarse index). The chosen substituent can be placed on the benzene ring. Upon clicking at the HNh4R and CNMR buttons, the corresponding estimated shifts will be displayed (see bottom picture).
466
Finally several more demanding tasks can be solved with external programs (written in C, Pascal or Fortran). They can be started within SpecTool (the user does not necessarily realize that he leaves SpecTool) and the results may or may not be imported into the Hypercard environment. Examples of such programs already attached to SpecTool are the molecular formula calculation [ S ] , the estimation of C-13chemical shifts with automatic selection of the appropriate additivity rule [6],and the calculation of isotope abundances for any combination of elements [7].
4. Conclusions Enormous activity has taken place in recent years to use expert systems to collect scientific data and statements. Thc purpose of the activity was to use the “cxpert knowledge” for decision making. There are however many cases where an automatic decision making is not necessary or even worse, not adequate. SpecTool, the program presented in this contribution shows that hypermedia is more comprehensive. It provides a uniform environment for expressing different kinds of knowledge, including direct attachments to specrroscopic databases. Front-ends of such databases can easily be constructed within the hypermedia environment. Wherever sensible, direct attachment to expert systems could also be done. Many problems, which werc tackled with expert systems, could be more appropriately solved with hypermedia. We are convinced that many applications of this kind will emerge in various scientific fields in the coming years.
References Pretsch E, Clerc JT. Seibl J , Simon W. Spectral Data for Structure Determination of Organic Compounds.Springer-Verlag, Second Edition, 1989. Parsay K , Chignell M, Khoshafian S, Wong H. Intelligent Databases. John Wiley & Sons, Inc., 1989. Apple Computer Inc., Hypercard Script Language Guide: The HyperTalk Language. Addison-Wesley, 1988. Haasnoot CAG, de Leeuw FAAM, Altona C. Tetrahedron 1980; 36: 2783-2792. Fiirst A, Clerc JT, Pretsch E. Chemometrics and Intelligent Laboratory Systems 1989; 5: 329-334. Fiirst A, Pretsch E. Anal Chim Acta 1990; 229: 17-25. Kubinyi H. Personal communication.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
461
CHAPTER 41
Synergistic Use of Multi-Spectral Data: Missing Pieces of the Workstation Piiwle C.L. Wilkins, E.R. Baumeister, and C.D.West Department of Chemistry, University of California, Riverside, Riverside, C A 92521, USA
As a result of recent technological improvements, laboratory scientists now have available high speed computer workstations which provide large-scale data storage, manipulation, and display capabilities. In addition, direct instrument control and data acquisition is commonplace. Parallel improvements in analytical chemistry instruments make possible the concurrent and synergistic use of multi-instrument combinations for organic mixture analysis. In spite of these facts, the full potential of the laboratory workstation is, as yet, unrealized. Accordingly, in this paper we consider the present status of laboratory workstation-instrument combinations, illustrated with recent results from research on directlinked gas chromatography-inliarcd-mass spectrometry (GC-IR-MS) in our laboratory. Finally, the prerequisites for more full realization of laboratory workstation capabilities in the laboratory will be addresscd. It is by now well-established that the benefits of complementary use of IR and MS spectral information derived from linked GC-IR-MS for organic mixture analysis include increased speed and reliability of component identification. These advantages were discussed in a recent review [l]. As a consequence of this recognition, commercial instrument systcms for GC-IR-MS, based upon use of both lightpipe and matrix-isolation approaches to GC-IR are now available. To date, applications of GC-IR-MS have included analysis of environmentally important samples [2-81 and joint IR-MS data has been used for analysis of polycyclic aromatic hydrocarbons [9]. Here, the advantages of linked GC-IR-MS will be reviewed briefly and the implications with respect to workstations considered. The vehicle for this discussion is comparison of results obtained with two such commercial systcms in our laboratory.
1. Direct-linked GC-IR-MS The primary method used for analysis of GC-IR-MS information is the usc of separntc databases and well-known library search algorithms [2, 101. The databases most-used arc disparatc in size ‘and content, as shown in Table 1, with the mass spcctral dambase being
468
TARLE 1 Spectral databases used. Spectral Type
Phase
Mass Spectra Infrarcd Spcctra Infrarcd Spectra
Gas Gas Argon-Matrix
a
Number of Spectra
44,261 5,127 5,000
Source NISTa Aldrich/Nicolet Mattson Instr.
National Institute of Science and Technology
about an order of magnitude larger than the common infrared spectral librarics. In this approach, separated mixture component identification is accomplished by separate infrared and mass spectral library searches, followed by comparison of the resulting closest matches for coincidence. When such coincidences are found, it is infcrred that the unknown corresponds to the coincident compound (or, if more than one coincidence occurs, to the compound with the closest overall matches). This was the method employed in the first published GC-IR-MS results [ll] and is prcsently the technique employcd in the commercial systems. In the early papers it was emphasized that one of h e advantages of this p a r h h r spectral combination (IR-MS) is the established capability of infrared spectroscopy to provide isomer and functional group information complcmcnting the highly sensitive and information-rich fingerprint patterns provided by mass spcclromctry [ll, 121. In the work to be describcd here, the databases listed in Tablc 1 wcrc uscd in the traditional manner on joint IR-MS data to perform scparate library searches, followcd by subsequcnt comparisons and, in addition, the gas phase librarics scrvcd as a sourcc of a single 3,304 member combined library representing the common iiicmbcrship of thc two source librarics. A detailed study of the relative mcrits of lhc lattcr approach was published recently [8], so only one represcntativc result will be discussed here.
(GCI
2. Lightpipe-based GC-IR-MS
Splil
There are two common logical expcrimcntal arrangements for GC-IR-MS. Figure 1 depicts both of thesc. Due to t h e nondestructive nature of lightpipe GC-IR, it is possiblc lo arrange a systcm in which the infrarcd spectrometer is placed immcdiatcly bcfore thc mass spccuomctcr and samples Figure 1. Block diagram or logical arrangccmcrging from thc gas chromatograph pass ments for CC-IR-MS.
469
TABLE 2
I
Hewlett Packard 5890 Gas Chromatograph
Constituents of 30-component mixture.*
I
Compound
1-octene octane ethyl butyrate 3-heptanone butyl propionate Compuler isobutyl methacrylate 2-chlorotoluene 4-chlorotoluene 1,3,5-trimethylbenzene Figure 2. Block diagram of the lightpipe-based pen tachloroethane GC-IR-MSsystem. butyl butyrate 1A-dichlorobenzene a-terpinene through them in a serial fashion. A disad1,2-dichlorobenzene vantage of this arrangement is that it has the indene potential for degrading the chromatographic 'y-terpinene resolution, due to the imposition of the light- 3-methylphenol pipe volume (typically 100 to 150 micro- 1-undecene undecane liter) between the GC and the mass spec- 1,3,5-trichlorobenzene trometer. A second disadvantage is the need I-bromooctane to adjust the time scales of the spectral mea- 2,4-dimethylphenol surements due to the asynchronous nature of 1,2,4-trichlombenzene 4-~hl0rophen01 the spectral sampling. Finally, because of the a,a,a-trichlorotoluene greater sensitivity of the mass spectrometer, decyl alcohol the serial linkage can result in too much tridecane sample being introduced into the MS. 2-methylnaphthalene Nevertheless, in practice, these disadvan- biphenyl dodecyl alcoliol 5965A Infrared
597013 Mass
Quantity Injected (nanograms) 30 impurity 15 30 15 20 80 80 80 25 15 40 50 80 60 30 40 20 10 20 20 40 20 40 40 10 10 40 40 20
tages are rarely limiting and the simplicity of the serial linkage is an argument for its * Reprinted from reference 8. with permisuse. Figure 2 is a block diagram of such a sion of the American Chemical Society. serially-linked system which was used to analyze a known 30-component model mixture with the composition specified in Table 2 [8]. Figure 3 shows the IRD and MSD reconstructed chromatograms for the separation. As Table 3 shows, the use of combined IR-MS information yielded superior results and reliable identification of the thirty compounds in the test mixture. Because the compounds were all represented in the joint library used, it was not necessary to deal with the expected problems caused by absence of an unknown from the rcferencc library. In that case, the best that could be hoped for is
470
A
10
15 Time
B
u
2 00
C
150
u 0
n A a
Cmfn.)
20
MSD TI(
20
IRD TR(
100 50
Q!
0
5
1B
15 Time
Cmln.)
Figure 3. Kcconstructed IKD and MSD chromatograms for 30 component mixture.
that library scarches would rcturn lists or close matches which would be structurally similar to the unknown. Algorithms to automatically detcct that circumstance and provide thc analyst with partial structural information have not yet becn developed and are an obvious need. TABLE 3 Comparison of scarch rcsults for 30-component mixture.8
Search positionb 1
2 3 4
5 6-10 >10 No match ~~
Infrared
MSD
PBMc
IR-MS
20 4
19
19
28
7
2
1
0 2 0 2 0
0 0 3 0 0
6 2 1 0 0
1 0 0 1 0 0 0
1 1
~~~
a) Mixture described in Table 2 analyzed with the GC-IRD-MSD system diagrammed in Figurc 2. Experimental dctails appear in Ref. 8. b) Position in the library search list whcrc the component’s actual identity appears. c) Probability-based match scarch, as provided in standard Hewlctt Packard software.
471 ~~
20
Hewlett Packard 5890
+
P
Flame Ionization Detector
59700 Mass Selective Detector
* Spectra Physics 4290 Integrator I
59970A Data Station
Mattson Cryolect 4800
Interface
Mattson Cryolect Data Station
Figure 4. Block diagram of the matrix isolation-based GC-IR-MS system.
3. Matrix isolation-based GC-IR-MS As mentioncd above, one possible difficulty in combined GC-IR-MS is the disparity in sensitivity bctwccn the mass and infrared spectrometers. Even though lightpipe GC-IR systems havc been much improvcd in recent years, as the example above shows, for routine use they still rcquire tens of nanograms of each mixture component injected. One way to address this mismatch is the matrix isolation approach to GC-IR-MS. Figure 4 is a block diagram of a commercial system that has been used in our laboratory to evaluate the merit of this approach. This systcm employs a flame ionization detector, in addition to allowing trapping of GC effluent in an argon matrix at temperatures between 12 and 20 K. In parallel, approximately 40% of the GC output is routed to the MSD for mass spectral analysis. Figure 5 is a plot of the FID response for an unknown mixture which was analyzed with this systcm and dctermined to be a mixture of dimethylnaphthalenes. This particular mixlure contained a wide concentration range of components and served as a practical test of the analytical protocol. Table 4 summarizes the results of the spectral analyses of the separated mixture components. Not surprisingly, the probability-based match software using the MSD data was only capable of indicating the components were dimcthyl naphthalenes. In each case, the top eight hits were dimethylnaphthalene isomcrs. On the other hand, the argon matrix isolated infrared spectral library searches in cvcry case but one provided the correct identification of the component. In the one remaining casc, component three, identification failed because this isomer is not rcpresentcd in the spcctral library used.
412
FID of Dimethyl Naphthalene Isomers
I
Figure 5. Flame ionization detector response for unknown mixture.
TABLE4 Dimethylnaphthalene isomer search results.
Peak#
MSa
IRb
Peak identity
1 2 3 4 5 6
Dimethylnaphthalene Dimethylnaphthalene Dimethylnaphthalene Dimethylnaphthalene Dimethylnaphthalene Dimethylnaphthalene
2.6-Dimethylnaphthalene 1,3-Dunethylnaphthalene Not Identified 1.4-Dimethylnaphthalene 1,5-Dimethylnaphthalene 1,2-DimethyInaphthalene
2,6-Dimethylnaphthalene 1,3-Dimethylnaphthalene Not Identifled 1,4-Dimethylnaphthalene 1,5-Dimethylnaphthalene 1,2-Dimethylnaphthalene
a) The NBS/NIH/EPA/MSDC Mass Spectral probability based match (PBM) library (42,621 entries) was utilized and in every case the top eight hits were dimethylnaphthalene isomers. b) Best match found by searching the Mattson Cryolect library of Matrix Isolated FTIR spectra, containing 5,000 entries.
473
0.5uL of PngluL ISBM hexane, splitless, 4 crn-1, 1000 scans Mluons
-
3.0
4.0
6
8
I
I
I
I
10
I
15
I
0.0060-
A b S
-
0.0050
0
r b a n C
e
3M)O
2000
1500
lo00
Wavenum ber
Figure 6. Matrix isolation GC-FTIR spectrum obtained when 1 nanogram of isobutylmethyacrylate is injected.
Turning to the issue of sensitivity, it is clear that the GC-MI-IR-MSD system does provide improved performance; as Figure 6 shows, at the 1 nanogram injected level isobutyl methacrylate provides a spectrum with an excellent signal to noise ratio. Clearly, detection limits are much better than this. Studies in progress will determine what the lowest practical sample quantities will be for efficacious GC-MI-IR-MSD analysis of mixtures. At this juncture, it seems likely that a few hundred picograms per component will be required for non-target compound analysis applications. Here, as with the GCIRD-MSD system, spectral library size also becomes a factor limiting the generality of library-based analytical paradigms. Accordingly, although this instrument system will provide somewhat increased sensitivity over the lightpipe-based system, it does not obviate the need for new algorithms capable of capitalizing on the computational capabilities of laboratory workstations.
4. Needs for full utilization of workstations In the GC-IR-MS analytical systems mentioned above, it is obvious that the experimental problems with obtaining such joint data have largely been solved. It is equally apparent
474
that we have not yet developed algorithms which can take full advantage of the new computational capabilities of modern laboratory workstations. Although there are an increasing number of commercial analytical chemistry instruments employing dcdicatcd workstations (nuclear magnetic resonance spectrometers and high performance doublefocussing mass spectrometers, to name but two), the necessary supporting software, documentation, and databases are far from complete. These will be essential if the full power of laboratory computer workstations is to be realized. With respect to databases, it is highly desirable that they incorporate objective carefully-designed quality indices so that uscrs may assess the quality of the archival data with respect to their own analytical needs. In the areas relevant to the GC-IR-MS applications discussed here, a mass spectral quality index dcvelopcd by McLafferty under an EPA contract has been in use for some years as a tool for devclopmcnt of the NIST (National Institute of Science and Tcchnology) mass spectral database [13], in which no compound is represented more than once, with the goal of including only the best spectrum available. More recently, an infrared spectral quality index has bccn proposed [14] and assignments of quality indiccs to the extant vapor phase infrared libraries completed. Thus, for infrared and mass spccual libraries there is at least some measure of quality information available. A final issue which should be mentioned is the need for larger evaluated databases using quality measures such as those mentioned above and capable of being rcadily exchanged among their users. This latter requirement suggests a need for agreement upon a common data exchange format for spectral information. Promising progress in this direction has been made in the infrared specuoscopy area with the acceptancc and support of most instrument manufacturers for the JCAMP (Joint Committee on Atomic and Molecular Physical Data) ASCII-based exchange format [ 141. Currently, the possibility of adapting the same basic format for solution nuclear magnctic resonance spectral exchange and mass spectral data exchange is under study by an International Union on Purc and Applicd Chemistry working party and a committee of the American Society for Mass Spectrometry, respectively.
5. Conclusion In summary, thcre are a number of remaining issues which must be addressed if workstations are to be implemented to their full potential. Most important is the necessity for investing in algorithm development and adequate software support and documentation. The latter two are the most labor-intensive and the most often slighted by manufacturers of analytical instruments. Equally important for analytical applications of the type considercd here is the need for large high quality, evaluated databases to support algorithm development and new more sophisticated software.
415
Acknowledgements Support from the National Science Foundation under grant CHE-89-11685 is gratefully acknowledged. Additional support under US-Environmental Protection Agency Cooperative agreements CR-8 1-3714 (Environmental Monitoring Systems Laboratory, Research Triangle, North Carolina) and CR-81-4755 (Environmental Monitoring Systems Laboratory, Las Vegas, Nevada) is also acknowledged. We are also grateful for helpful discussions with Dr. Don Gurka (EMSL, Las Vegas) and Dr. Donald Scott (EMSL, Research Triangle) on many of the issues raised in this paper.
References 1.
2. 3.
4.
5. 6.
7. 8. 9.
10.
11. 12.
Wilkins CL. Linked Gas Chromatography-Infrared-Mass Spectrometry. Anal Chem 1987; 59: 571A. Gurka DF, Titus R. Rapid Non-Target Screening of Environmental Extracts by Directly Linked Gas Chromatography/Fourier Transform Infrared/Mass Spectrometry. Anal Chem 1986; 58: 2189. Gurka DF, Hiatt M, Titus R. Analysis of Hazardous Waste and Environmental Extracts by Capillary Gas Chromatography/Fourier Transform Infrared Spectrometry and Capillary Gas Chromatography/Mass Spectrometry. Anal Chem 1984; 56: 1102. Shafer KH, Hayes TL, Braasch TW, Jakobsen RJ. Analysis of Hazardous Waste by Fused Silica Capillary Gas Chromatography/Fourier Transform Infrared Spectrometry and Capillary Gas Chromatography/Mass Spectrometry. Anal Chem 1984; 56: 237. Gurka DF, Pyle SM. Qualititative and Quantitative Environmental Analysis by Capillary Column Gas Chromatography/Fourier Tansform Infrared Spectrometry. Environ Sci & Tech 1988; 22: 963. Gurka DF, Titus R. Hazardous Waste Analysis by Direct-Linked Fused Silica Capillary Column Gas Chromatography/Fourier Transform Infrared Spectrometry/Mass Spectrometry. In: Laing WR. Ed. Proc of 28th Conf on Anal Chem in Energy Tech, Oct. 1-3, 1985. In: Anal Chem Instrwn, Lewis Pub1 Inc, 1986: 17-22. Gurka DF, Betowski LD. Gas Chromatography/Fourier Transform Infrared Spectrometric Identification of Hazardous Waste Extract Components. Anal Chem 1982; 54: 1819. Cooper JR, Wilkins CL. Utilization of Spectrometric Information in Linked Gas Chromatography-Fourier Transform Infrared-Mass Spectrometry. Anal Chem 1989; 61: 1571. Chiu KS, Biemann K, Krishnan K, Hill SL. Structural Characterization of Polycyclic Aromatic Compounds by Combined Gas Chromatography/Mass Spectrometry and Gas Chromatography/Fourier Transform Infrared Spectrometry, Anal Chem 1984; 56: 1610. Laude DA Jr, Johlman CL, Cooper JR, Wilkins CL. Accurate Mass Measurement in the Absence of Calibrant for Capillary Column Gas Chromatography/Fourier Transform Mass Spectrometry. Anal Chem 1985; 57: 1044. Wilkins CL, Giss GN, Brissey GM, Steiner S. Direct-Linked Gas Chromatography-Fourier Transform Infrared-Mass Spectrometer Analysis System. Anal Chem 1981; 53: 113. Crawford RW, Hirschfeld T, Sanbom RH. Wong CM. Organic Analysis with a Combined Capillary GC/MS/Fourier Transform Infrared Spectrometer. Anal Chem 1982; 54: 817.
476
13. Milne GWA, Budde WL, Heller SR. Martinson DP, Oldham RG. Quality Control and Evaluation of Mass Spectra. Org Mass Spectrom 1982; 17: 547. 14. Griffiths PR, Wilkms CL. Quality Criteria for Digital Infrared Referencc Spectra. Appl Spectrosc 1988;42: 538.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe} 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
411
CHAPTER 42
Spectrum Reconstruction in GCIMS. The Robustness of the Solution Found with Alternating Regression E.J. Karjalainen Department of Clinical Chemistry, University of Helsinki, Helsinki, Finland
Abstract Biological samples are only partially resolved by gas chromatography. Computers can be used to isolate the component spectra in mixtures. For this we record full mass spectra as the second dimension of the observations. GC/MS is a favorable data source due to the high information content of the mass spectra. Alternating Regression (AR) is a very fast method for isolating component spectra. AR does not use principal components or factor analysis. It modifies the starting spectra iteratively until the solution converges. Initially the spectra are filled with random numbers. In most cases four iterations of AR are enough to produce a stable solution. Any novel compound found with AR must be verified by repeated analysis under diffcrcnt scts of experimental conditions. The quality of the solution obtained from a single GC/MS run can be estimated by calculating the robustness of resulting spectra and concentration profiles. The results are checked for robustness by repeatedly adding noise to observations and recalculating the results. Stable components in the solution tolerate added noise.
1. Introduction The practical motivation for this work has been the pressure to produce results from a limited amount of sample in a short time. These pressures on the analytical laboratory were encountered in doping analysis for sports events. The results must be able to stand scrutiny in court, false positive results would be catastrophic for the sportsman and the laboratory. This provided the motivation for a search of methods that would better utilize the data available from GC/MS instruments.The amount of sample is limited, in practice often less than the recommended 50 ml of urine. This restricts the number of repeat
478
analyses that can be performed. These needs brought to focus deficiencies in the way data from spectroscopic instruments is used. The complete information collected by hyphenated instruments is not generally fully uscd. A GC/MS instrument operating in a continuous-scan mode produces large data matrices in short time. The analytical chemist is facing data overload. He cannot digest the dala collected by the instrument. The customary reaction to this overload is to ignore most of the data. Instead of w i n g to analyze every number the chemist looks at slices through the data matrix. He looks at single spectra and makes library searches with them. Thc slices can be longitudinal, too. He focuses on an interesling ion and follows a single mass number during the run. This is all very useful, but we are not really getting the full information in the observations. We should see pure spectra from our instruments, not raw spectra of varying mixtures. Wc should see concentration profiles of compounds, not just ion traces. We are currently making library searches with spectra that contain overlapping species. The library searchcs would be easier if the spectra were pure. The quality of the spectrum libraries is open to some doubt. We cannot be fully certain that the library spectra are pure. What is nccdcd? We need computer software that cleans up the raw data into chcmical entities. We should get out pure spectra and an indication how much we can trust the results. We need data reduction. If the two-dimensional spectra arc decomposed inlo the spectra of compounds the information can be greatly reduced. There is a ten-fold reduction in the numbers needed to describe the observations. The large observation matrix can be replaced by two smaller matrices. The first matrix is the spectrum matrix. The second one is the concentralion matrix. When these two smaller matrices are multiplied they produce the obscrvation malrix. With pure spectra the problem of identifying unknown substances is easier to solve. Library matches can identify small components hidden under large chromatographic peaks. The cornpuler in the instrument should not be used to overwhelm the chemist with a mountain of raw numbers. The computer should analyze the data and offer them in a more digestible form. Our work with the data reduction aspects started in late seventies togethcr with Lhe usc of GC/MS instruments. The computers of that time were too small to support serious data rcduction. Still it was possible to experiment with new methods of data analysis using minicomputers (Data General Eclipse) and array processors (FPS-100, Floating Point Systems). The current desktop micros have enough power to do complete analyses on full data matrices. What they perhaps lack in raw CPU power is cornpensated by the largc RAM. We run routinely these analyses on a micro that has a 68020 microprocessor and 8 megabytes of RAM (Macintosh 11, Apple Computer Co.). After this introduction we shall see how a simple algorithm, Alternating Regression, helps to locate the spectra in a GC/MS run.
419
2. The AR algorithm When confronted with the problem of resolving overlapping chromatographic components, most people have used variations of factor analysis to decompose the observations into factors and loadings [l]. The process for factor analysis is rather straightforward. What is difficult, though, is returning from the factor-based solution into physically meaningful spectra. We have developed a different approach to the problem. The algorithm is called Alternating Regression (AR). The algorithm was initially applied to problems encountered in doping control of sports events [ 2 ] .Gradually the algorithm has been adapted to more general system identification problems [3]. Alternating Regression (AR) is not based on factor analysis, it uses direct approximations to the spectra and concentrations during all phases of the process. The strong point of AR is its high speed. The speed makes it possible to repeat the analysis hundreds of times from many different starting points and with added noise. With enough statistics we get a picture of the reliability of the solutions found with it. The AR repeats a cycle of two sub-problems. Both sub-problems are solved using regression analysis. In the first phase of the algorithm, the concentrations are solved on the basis of spectra. In the second phase, the spectra are solved on the basis of concentrations. The spectra and concentration profiles are constrained between the regression steps. The starting point for AR is a spectrum matrix filled with random numbers. We can start equally well with the concentration profiles, starting with the spectra is just a convention used in the programs. All concentrations and spectra are taken to be positive. A minimal form constraint is applied to the chromatographic peaks. It is assumed that each compound has only one peak with a single local maximum. The concentration peaks are forced to a unimodal shape during the analysis. No shape functions such as Gaussian or Lorentzian peaks are used for the concentration profiles. We give a more detailed description of the AR algorithm in terms of matrix calculations. We reconstruct two matrices S' and C that form a product P, which corresponds to the observations 0 plus the matrix E that contains the error terms. In reality we have available only the matrix 0, the rest must be constructed. S'C = P = 0 + E .
First we define the matrices: S = spectrum matrix with dimensions k by n, S' = transpose of spectrum matrix with dimensions n by k, C = concentration matrix with dimensions k by m, C' = transpose of concentration matrix with dimensions m by k,
480
0 = observation matrix with dimensions n by m;experimental data, 0’ = transpose of observation matrix with dimensions m by n, P = prediction matrix with dimensions n by rn, E = prediction error matrix with dimensions n by M.
The dimensions are: n = the number of spectral lines, m = the number of scanned spectra in the observation matrix, k = the number of components in the sample (k < n, k c m). The steps in the algorithm are: 1) Make a guess for k, the number of components.
2) Fill the matrix S’with random numbers. 3) Solve for the concentration mauix C using regression. The solution process is based on the generalized inverse method, so it can be presented in matrix terms as
c = (SS7-1 SO 4) Constrain all elements in C to positive values. Constrain each row in C to have just
one local maximum. 5) Solve for the spectrum matrix S using regression. The solution process is similar to the earlier step 3.
s = (CC7-1CO’ 6 ) Constrain all elements in S to positive values. Scale all spectral vectors to unit length. 7) Calculate the fit by forming the sum of squares (SSE) between the observations 0 and the prediction P, which is formed as a product of S’and C.
P = S‘C 8) Go back to step 3.
The algorithm converges in less than ten cycles. Four cycles are enough in most cases. The correct number of components (k) is found empirically by trying different values for it. If k is too small, the sum of squares SSE remains large. If k is too large, the solutions converge more slowly and give a larger range of results when repeated. Figure 1 shows how the AR algorithm converges. The diagram shows how the fitting error is reduced as a function of the iteration. In this example a section of a GC/MS run
481
E r r 0
r %
1
2
3
4
5
6
7
8
9
10
Number of Iterations Figure 1. The AR algorithm converges typically in less than ten iterations to a stable value. This figure shows how a problem was solved twenty times. Each time the starting point was a different set of random numbers. The program isolated five components from a section of 40 GUMS spectra.
was solved for five components. A different set of random numbers was chosen 20 times. The fitting error is initially 70-80 per cent. With ten iterations, the error gets generally under two per cent. One of the examples gets worse with the second iteration but converges finally to a stable result. A permanent error of about one per cent remains due to the instrument noise. In practical analysis the same solution (within a tolerance) must be obtained from at least ten different starting points. For this to be practical we require a high-speed method like AR. AR has some advantages. The process needs no library of spectra or retention times. The method is useful for new and unexpected components. In cases where new molecules are found one has to check the solution very carefully. All the usual precautions in analytical work must be applied. It is necessary to verify any new component by additional experiments. The component should be identified in the form of several different derivatives. AR does not attempt to fit any global function to the observations. The overall fit between the components found and original observations gets optimized, but this is a byproduct of the overall process. There is no over-all objective function, just a collection of
482
local functions in local regressions. This simple, local nature of the AR algorithm is the reason for its high speed.
3. Tapering the data matrix-the One facet of the AR algorithm deserves a special mention. Before the actual analysis the region of interest is chosen. We choose a set of spectra and apply a tapering function to them. The tapering function is also called a windowing function or apodization function (Fig. 2). The windowing function is familiar to those performing Fourier analysis. Before calculating a Fourier spectrum, a time series is multiplied by a windowing function. The result of the windowing function is that both ends of the series gradually go to zero. In Fourier analysis the windowing avoids artifacts in the spectrum that are caused by the sharp cutoffs at both ends of the series. The windowing operation improves the AR process. The windowing function guarantees that all concenlration curves return to zero. This is cssential for the stable operation of the AR. The windowing is so important that it should perhaps be included in the description of the algorithm itself.
role of windowing
d
”
I
4. Problems due to isomers The AR algorithm is based on the monotonic nature of the chromatographic peak. In rcality there are cases where the same mass spectrum is found in several chromatographic peaks. Isomeric structures have different retention times but the mass spectra can be identical. Sugars and steroids are typical compounds that have isomeric smcLures. Variation in the orientation of an OH-
Figure 2. The original data matrix without tapering is shown in a. The tapering function that multiplies the original data elementwise is shown in b. Finally the result of Ihe tapering operation is shown in c.
483
group changes the overall shape of the molecule. The shape change is enough to change the retention time in GC, but the mass spectra remain identical. To avoid this the data should be divided into sections in such a manner that two isomers with identical spectra do not occur in the same segment during analysis by AR. The isomers can be detected by calculating the correlations between all spectra. Too high correlations betwccn non-successive spectra are a warning sign for isomers.
5. Verification of results from AR The results from the AR should be verified. We must be certain that the results are rcproducible, not spurious. We describe in detail how the analysis is performed. The number of components is a parameter that is found empirically. The number of components is varied from a lower limit to an upper limit. The number of components in a typical section of forty spectra is varied from one to ten. The optimization is started from a different point every time. A typical problem is solved one hundred times from as many starting points. With a given number of parameters we repeat the analysis ten times. The results are stored on disk. After the AR step the results arc compared by a separate program that checks the reproducibility. This program decides which number of components is used to describe the observations. The behaviour of the AR algorithm is stable with the optimal number of components. The fit is near its minimum value. The same spectra and concentration profiles are produced from different random starting points. The fits remain constant in a narrow interval when the number of components is at its optimum value. When the number of components exceeds the optimal number the algorithm starts producing different results from different starting points. The fits have a large variation. When the optimal spectra and concentration profiles have been found in the postprocessing phase, the next step is the library matching. The spectra are compared with those in the library and those not in the library arc flagged for closer study. Any new spectrum is initially suspect. The new spectrum must be found by repeated analyses and by changing the chromatographic conditions. The amount of the new compound in the sample must be sufficient to precisely determine the spectrum.
6. Verification of AR results in a single GC/MS run Often the time for analysis is limited. This is the case in doping analysis, where the results must be reported in a fixed time span. If a forbidden drug is found, the laboratory must verify that the sample has been metabolizcd by the body. Otherwise, the drug could have been added to the sample later. To get evidcnce of metabolism the laboratory identifics a numbcr of different mctabolitcs in the sample.
484
Sometimes it is difficult to find the original unmetabolized drug because the small amount of original molecule is hidden under a large peak of metabolite. AR is a useful tool for finding metabolites that are hidden under other compounds. In those cases it is necessary to have a method that makes it possible to estimate confidence intervals for the solution. We can use the computer to make a more detailed analysis of the results and get an indication how reliable the solution is. Robustness means that the solution is resistant to small perturbations in data. It means that small amounts of added noise do not essentially change the solution that was obtained. If a component changes its spectrum or concentration profile after noise addition it is an artifact of the analysis, not a real component. We routinely estimate the robustness in the following manner. We add some extra noise to the original data. The reading is the number of ions that hits the electron multiplier in the mass spectrometer. We simulate the Poisson distribution by adding noise that is proportional to the square root of the reading. The uncertainty in the reading is proportional to the square root. We simulate noise by adding or subtracting the square root of the reading from each original reading. The AR analysis is repeated ten times with the perturbed readings and the statistics of the variation in results are calculated. qpical variation coefficients for the concentration profiles and spectra are one to three per cent of the original solution values. The variation is smaller for components that are well separated from the other components.
7. How separate are the components initially? The success of any mathematical analysis depends on the particular data set. We should be able to express how difficult a given problem is. We can compare different data sets by analyzing the solutions found for each. If the components are initially well separated, the mathematical analysis is simpler. What we need is a measure for the degree of mutual overlap between components. This mcasure should describe both spectra and chromatographic profilcs for each compound in the mixture. The “overlap index” can be calculated in several different ways. The starting point for the calculation is the formation of the two-dimensional data matrix for each compound separately. The two-dimensional mamx is formed by multiplying the concentration profile with the spectrum. The outer product we obtain is the two-dimensional spectrum of a compound. If we have five different components present in a sample, we get six different matrices. The fivo matrices formed from five compounds are shown in Figures 3a-3e. The sum of the five matrices is shown in Figure 3f. The next step in the calculation estimates how similar the two-dimensional spectra are with each other. There are several possibilities here. We can calculate correlation coefficients between all two-dimensional compound spectra. The first step in the calculation is the transformation of the two-dimensional spectrum into vector format. One way to do
485
I
C
d
C
i
f
d
Figure 3. The spectral components (a+) and their s u m (f). The components are the product of the spectrum and the corresponding concentration profile.
this “pulling out” of a matrix into a vector is to join all columns into a vector. We could equally well join all rows into a vector, the order is not critical. The only restriction is that the order of stringing together the matrix elements must be the same for all matrices.
Now we have a set of vectors, each corresponding to one compound. Next, we can calculate the correlation coefficients between the vectors. We can also calculate the variance of each vector and the covariances between them. We can estimate, how large a
486
percentage of the total variance of the covariance matrix is on the diagonal and how much on the off-diagonal elements. The proportion of the off-diagonal elements in the total can be called “ovcrlap index”.
8. Discussion There is a large mathematical literature about system identification problems. In other disciplines there are examples that are equally difficult than the spectrum reconstruction from two-dimensional spectroscopies. Although the chemical problem has not bccn treated in this mathematical literature it clearly belongs to this wide collection of invcrse problems [4]. AR solves a typical inverse problem. From samples of output me must identify the system structure. The output itself is not enough to solve the identification problem. If we add some a priori constraints to the problem we can solve it. The range of possible solutions is narrowed down to a smaller range by using these added constraints. In AR we use constraints of two kinds. The first constraint excludes negative values for concentrations and spectral lines. The second constraint restricts all concentration profiles to unimodal shapes. This type of shape constraint cannot be called a shape function, but it is the first step on the way to the shape function. Mathematicians have known the difficulty of inverse problems for a long time. They have a name for a typical identification problem of this type. They call them “ill-posed problems”. This means that without extra constraints the solution is far from robust, small changes in observed data are enough to cause large fluctuations in the solution. The solution to these problems is also generally known in the mathematical litcrature. The way to more stable solutions is called “regularization”. In practice, a priori constraints are used to make the solution process bchave in a more stable way. In the chemical problem of spectrum identification the stability of the solution is highly data-dependent. If the overlap index is high the range of possible solutions remains wide, the solution tends to fluctuate in a wide range. The solutions that are obtained for highly correlated data are not robust. If we wish to get a narrower range of solutions we must use more constraints. We can add more constraints to the shape functions for chromatographic profiles. For example we can require that the chromatographic profiles are “smooth”. “Smoothness” has to be defined in exact mathematical tcrms. For best results a global objective function for the spectrum identification problcm should be defined. The main component in the objective function is fit, the sum of squared residuals. Fit alone is not enough and other components are nccdcd in the objective function. The objective function should include a component that measures the entropy of the solution. The optimal solution has a certain simplicity, we can call it parsimony.
487
XI06
10
9
C 0
n C
8
7
e
6
t
5
a t i
4
n
r
3
0
n
2 1 0
Number of Spectrum Figure 4. Results from testing the robustness of an AR analysis. The solid line shows the mean value + 2 standard deviations (s.d.) and the dashed line shows the mean value - 2 s.d. To solve such an objective function we are forced use nonlinear optimization. The technique has practical difficulties. The number of parameters that can be handled by nonlinear optimization is of the order of one hundred, not some thousands which are needed in practical situations. A second difficulty with nonlinear optimization is the risk of finding only local optima. Nonlinear optimization is slow making for long run times. If we have a good starting point for the optimization initially, non-linear optimization can nearly always improve the solution. AR produces rapidly a good starting point for nonlinear optimization methods that are computationally heavy. This is the best combination of two very different tools. Alternating Regression produces rapidly a solution that is near the optimum. Figure 4 shows a solution for an application of AR. The solution is relatively robust. The slower nonlinear method takes the result from AR and adjusts it to the precise optimum.
9. Conclusion Alternating Regression is a rapid way to decompose overlapping spectra into their components. Used with proper precautions the method can produce results that are computationally robust and make chemical sense.
488
References Gemperline PJ, Hamilton JC. Factor Analysis of Spectro-Chromatographic Data. In: Meuzelaar HLC ed. Computer-Enhanced Analytical Spectroscopy. New York Plenum Press, 1990: 27-47. 2. Karjalainen EJ, Karjalainen UP. “Mathematical Chromatography”-Resolution of Overlapping Spectra in GC/MS. Medical Informatics Europe 85, Proceedings. Springer-Verlag 1985: 572-578. 3. Karjalainen EJ. Isolation of Pure Spectra in GC/MS by Mathematical Chromatography: Entropy Considerations. In: Meuzelaar HLC ed. Computer-Enhanced Analytical Spectroscopy. New York: Plenum Press, 1990: 49-70. 4. Smith CR, Grandy WT Jr, eds. Maximum-Entropy and Bayesian Methods in Inverse Problems. Dordrecht: D. Reidel Publishing Company, 1985. 1.
Indexes
This Page Intentionally Left Blank
491
Author Index
AndricsK
97
BarrcttAN 55 BarrierR 249 Baumeister ER 467 Blankenship J 259 deBoerJH 85 BohanecS 393 105 BorthDM Broad LA 315 Bulliot H 249 CadishM 455 Casuccio GS 173 Costello S 259 Coulet PR 221 Davies AN 445 Dckeyser MA 105 DcmingSN 71 Desmurs JR 249 Dickinson K 307 Doombos DA 85 Dorizzi RM 285 Duarte JC 211 Duinevcld CAA 85 Efremov RG
21
FarkasM 455 FanieJMT 351 Ferreira EC 211 FolkM 3 FriihaufM 31
HirthB 329 VanHoofJ 97 Hopke PK 9,173,179 JokinenPA 133 Josses P 249 JouxB 249 Karjalainen EJ 477 Kateman G 151.407 KennedyR 307 KuntzID 117 LastBJ 351 LeeRJ 173 LewiPJ 97,199 LewisRA 117 van der Linden B 407 LinderM 273 Linscheid M 445 Ljubic T 393 MajSP 293 Maloney TA 315 301 Mattes DC MayM 427 McDowall RD 301 MershonWJ 173 Mttivier P 249 MiY 179 MolnirP 237 DeMoorG 371 MurphyMT 341 Noothoven van Goor JM
GaBlL 237 GypenLM 199 HaarioH 133 Hardin J 3 Hibbcrt DB 161 Hillig H 445
Ploquin Y 249 PostmaCJ 407 PradcllaM 285 PretschE 455 RulandD
415
365
492
RulandR
415
TusarM
Settle Jr FA
259 Sevens C 371 SmildeAK 85 SmithKP 351 SmithP 307 Smits JRh4 151,407 SprouseM 259 SubakJrEJ 315 SummerbellD 55
Taavitsainen V-M TimmU 329 TusarL 393
133
393
Vandew alle C 37 1 VekemansB 199 WestCD 467 WickP 259 WilkmCL 467 Yendle PW
35 1
ZhangYJ 39 ZieglerR 379 ZupanJ 393
493
Subject Index
2D 60,427 3D 3,7,31,60 3D image analysis 39 3D reconstruction 39,47 3L Parallel FORTRAN 24
Biornetrics 137 Biosensors 221 Biplot 101, 137,205,224 Block diagrams 428 Bootstrapping 9
Absorption measurement 266,269,270,271 Absorption spectra 22,23,27,28,30 ACLTRAN 189 Adaptive control of process reactor 211 Advanced Informatics in Medicine (AIM) 365,371 AIM 365, 371 AIM project 367 Airborne particles 173, 179 Alliant FX/8 10 Alternating Regression (AR) 477 Ampicillin 211 Analog Concept Learning Translator 189 Animation 3,5,456 ANOVA 133,244 Antiviral drugs 98 Apodization function 481 AR 477 AR algorithm 480 Archiving 309,329,333,338,339,340,367 Array processor 10,478 ASCLI 264,269,280,339,445,448,452, 474 ASDF code 451 Associative searching 455 ASTM 449,453 At-line analysis 277 Automated manufacturing systems 293 Automated structure generation 120,121
C 137,175,355,384,466 CAD 294,427 CAE 294,366,427 Calorimetric sensors 230 Cambridge Structural Database 119,121 CANDA project 342 CAPM 294 CAQA 294 CAR0 435,439 CASE 355 Catalytic monoclonal antibody 233 CCITT 372 CCSEM 173,179 CD-ROM 457 CDC Cyber 200 10 CEN 342,365,369 CENELEC 342,365,369 Censored data 110 Censored data regression 105, 111 Certification 341,379,380 CGI 379,382 CGM 379 Chemical Abstracts 408 Chemical mass balance 180 Chemical shift 401,462,464 Chemometrics 9,133,136,265 Chemoreseptor 233 CIM 293 CLARA 13 Cluster analysis 9, 109, 181 Cluster analysis, hierarchical 187 Color palette 5 Comitt Europten de Normalisation (CEN) 369 Composite designs 136 Compression/decompression 377,451
Backpropagation algorithm Balanced design 139 Bar code reader 283 Basic 269,287,288 Best subset regression 139 Bioelectronics 221
143
494
Computer Aided Design (CAD) 294,427 Computer Aided Engineering (CAE) 294, 366,427 Computer Aided Production Management (CAPM) 294 Computer Aided Quality Assurance (CAQA) 294 Computer Aided Schematics diagram 427 Computer Aided Software Engineering (CASE) 355 Computer graphics 55 Computcr Graphics Interface (CGI) 379,382 Computer Integrated Manufacture (CIM) 293,294 Conccptual schema 295 Conformance Testing Service programme (CTS) 379 Conncction table 395,398 Constraints, apriori 486 Constraints, linear 139,140 Control charts 75 Cray 3. 10 Crystallographic information 445 CT 31 CTSl 379 CTS2 380 Data analysis, exploratory 97 Data compression 451 Database interface 424 Database searching 120 Database system, Paradox 308 Database, Cambridge structural 119, 121 Database, distributed 295 Database, IK spectra 393 Databasc, mass spectra 395 Database, NMK spectra 393 Database, relational 296,354,355,412 Database, spectral 456 Database, structure dependcnt 394 Databases, geodetic 415 DataScope 4 DRM 418 Debugging tools 5 Decision analysis technique 321 Decision making 85 Decision making, multi criteria 93 Diagrams, block 428 Diagrams, CAS 427
Dielectric breakdown 162 Diffusion limited aggregate 163 Distributed database 295 DOCK program 120,124,125 Documentation 427 Doping analysis 477 Dosage forms 85,316 Dose-response models 137 Drug design 117, 118, 130 Drug design, computer aided 118 Drug-receptor interactions 117 Dynamic modelling 21 1 EASYLAB 256,265 EFMI 365,369 EIA 285 Electrode 157, 162.164, 165, 168, 225,226, 227 Electrodeposition 162 Encrypt/decrypt 377 ENFET 234 Entity/relationship model 421 Enzymatic reactions 21 1 Enzyme 30,124,211,221,223,22.5,226, 227 Enzyme field effect transistors (ENFET) 234 EPA 341,445,474 EPAformat 445 Ethernet 308,330 ETSI 369 EUCLIDES 371 European Federation for Medical Informatics (EFMI) 365,369 European Standard for Clinical Laboratory Data Exchange (EUCLIDES) 371 European Workshop for Opcn Systems (EWOS) 365 EWOS 365,369 EX-TRAN 189 Experimental design 80.85, 136,263 Experimental optimization 7 1.76 Expert systems 151,179,262,284,407 Exploratory data analysis 97 External quality assessment 372 External schema 295 Factor analysis 479 Factorial design 71.87, 136, 139,265 FDA 341
495
FIA 278 Fiber optics 231 Finite element analysis 56 Flexible automation 279 Flexible Manufacturing Systems (FMS) 293 Flow charts 46,428 Flow cytornetry 157 Flow injection analysis (FIA) 278 FLOWCAD 439 FMS 293 FORTRAN 10,28.137,175,189,384,466 FORTRAN, 3L Parallel 24,28 Fourier analysis 173,482 Fourier transform infrared 157 Fractal analysis 174 Fractal dimension 173 Fractals 161 Fractional factorial design 87, 136,265 Frame-based reasoning system 407 Free schematics 432 GC-IR 452 GC-IR-MS 467 GC/MS 477 GCP 333 Gcodctic databases 415 GEOMANAGER 418 Geometric modelling 55 Geometric modelling, 2D model 58 Geometric modelling, 3D model 60 GEONET 418 GKS 379 GKS-3D 379 GLCP 333 Global function 481 GLP 237,273,283,297,329,333,342,351 GLP-journal 337 GMP 342 Good Clinical Practices (GCP) 333 Good Computing Practice 342 Good Laboratory Computing Practice (GLCP) 333 Good Laboratory Practice (GLP) 237,273. 297,329,333,351 Good Manufacturing Practice (GMP) 342 Graphical Kernel System (GKS) 379 Graphics 456 Graphics standards 379 Grid schematics 430
Hausdorff-Bescovitch dimension 161 HDF 6 Health care informatics 365 Heuristic rules 455 Heuristics 151 Hidden layer 143,152 Hierarchical clustering analysis 187 HPLC 211,257 Hypercard 456 Hypermedia 455 HyperTalk 456 Hypertext 456 Hyphenated intruments 467,478 IMACS 367 Immobilization 212,225 Immunoassay 285 IMSL 136,137 In-line analysis 277 Infrared spectra 157,159,404,445,447, 452,455,468 Instrument automation 273 Interferograms 447 International Standards Organisation (ISO) 294 Inverse problem 486 Inverted files 395 IR spectra 157,159,404,445,447,452,455. 468 IR spectral database or library 393,468,474 ISAS 445 IS0 294,372,380 1s09000 342 Isomers 124,212,471,482 Jackknifing 9, 192 JCAMP 474 JCAMP-CS 450 JCAMP-DX 445 k-medoid method 12 k-nearest neighbour 158 Kermit 330 KINLIMS 329 Knowledge bases 264,367 Laboratory Information Management Systems (LIMS) 263,273, 301,307,315,329 Latin squares 139
496
Lattice-based drug design 130 Layout rules 429 Lead compounds 117 Least median of squares 13 Least-squares adjustment 415 Library search 467,468,471 Library spectra 468,477 Lightpipe 467 LIMS 263,273,301,307,315,329 LIMS, architecture 301,303 LIMS, custom-made 310 LIMS. data collection 303 LIMS, data reporting 303 LIMS, database 303 LIMS, lab management 303 LIMS, selection of 315 Line routing 434 Linear programming 139 Linear regression 286 LIS 285 LMS 13 Magnetic Resonance Images (MRI) 31 Manufacturing Automation Protocols (MAP) 295 MAP 295 Mass spectra 445,455,468 Mass spectral database or library 393,468, 474 Materials Requirements Planning (MRP) 296 MATLAB 133 Matrix isolation 467 Maximum likelihood estimate 110 Medical images 31 Medical informatics 367 Medical Informatics Europe (ME) 370 Medical information systems 371 Megakaryocyte cells 39.47 Mcssage Handling System (MHS) 372 Message Transfer Agents (MTA) 372 MEX-files 137,144 MHS 372 Michaelis-Menten 212,213,226 M E 370 Mixture designs 88, 136 Modified simplex method 86, 87 Molecular graphics 120 Molecular lattices 117
Molecular recognition 223 Monte Car10 simulation 22, 164 Morphogenetic behaviour 56 Morphological analysis 408 Morphology 165 MRI 31 MRP 296 MTA 372 Multi criteria decision making 93 Multi-spectral data 467 Multiblock PLS 199 Multimedia 456 Multiple linear regression 136 NAG 137 Natural Language Processing 408 NCSAImage 5 Neural networks 133,151 Neuroleptic activity, prediction 202 Neuroleptics 202 Neuron 142.152 Neuronal unit activity 237 NIRspectra 145 NIST 474 NMR spectra 445,450,455 NMR spectral database 393 Nonlinear models 141,211 Nonlinear optimization 487 Object reconstruction 39 Object-oriented syntax 377 Off-line analysis 277 OMEGA 93 On-line analysis 277 Open system architecture 415 Open Systems Interconnection (OSI) Optimization of spectrophotometric experiments 259 Optimization, experimental 7 1,76 Oracle 330 Ordination methods 137 Organic synthesis 249 OSI 295 Overlapping spectra 487
295
PAC 276 Paradox database system 308 Partial least squares (PLS) 137, 145, 199 Pascal 287,288,466
497
Pascal, Turbo 237 Pattern matching 41 Patternrecognition 39,41, 151, 152, 153, 156,158,182 Pattern similarity 45 PCA 99,136 Pharmacokinetics 329 PHIGS 379 PLS 137,145,199 PLS biplot 205 PLS, multiblock 199 POEM 93 PolyView 7 Prediction of adverse drug reaction 157 Prediction of neuroleptic activity 202 Prediction of protein structures 157 Principal components analysis 99, 136 Process Analytical Chemistry (PAC) 276 Process control, statistical 73 Process reactor, adaptive control 21 1 Procrustes rotations 139 PROGRESS 16 Prolog 410 Protein engineering 233 Protein structures, prediction 157 QSAR 105,118 Quadratic programming 139 Quality assessment, external 372 Quality assurance 307,344 Quality Assurance Unit 346 Quality control 273,344,351 Quality in manufacturing 80 Quality index 474 Quality standards 342 Quantitative Structure Activity Relations (QSAR) 105,118 Random numbers 17,479 Reaction vessel 25 1 Receptor binding site 117 Receptors 203 Reconstruction, spectrum 477 Registration problem 39 Regression model 110 Regression, AR 477 Regression, best subset 139 Regression, linear 286 Regression, multiple linear 136
Regression, ordinary 112,134,136 Regression, robust 9, 15 Rendering, volume 3 1,36 Resonance Raman 21 Response surface methodology 85 RIA 285 RISC processors 21,356 Robotics 155,259,249,260,273 Robust regression 9,15 Robustness 86,477 RS/1 330 Sample preparation 274.283 Scanning electron microscopy 173 Scanning electron microscopy, computercontrolled 173,181 Scientific visualization 3, 32 Screening reactions 117,212,249 SDLC 351 SEM 179 Sensors 221 Sensors, calorimetric 230 Sequential simplex 86 Serial sections, object reconstruction 39 SGI Personal Iris 7 Simplex method 86 Simplex, sequential 87 Simulated NMR spectrum 401,402 Singular value decomposition 99 SMA 98 SMDformat 452 Software Development Life Cycle (SDLC) 35 1,352 SOP’S 297,329,343,345 Sound 456 Sparse designs 91 SpecTools 455,457 Spectra 145,445,447,450,455 Spectra, IR 445,447,455 Spectra, MS 445,455 Spectra,MR 145 Spectra, NMR 445,450,455 Spectra, overlapping 487 Spectra, simulated NMR 401,402 Spectra, UVWis 445,450,455 Spectral library 468,477 Spectral map analysis (SMA) 98 Spectrophotometricexperiments 259 Spectrum reconstruction 477
498
Spline smoothing 139 Splines 286 SQL 354 SSADM 297,307,311 Standard Molecular Data (SMD)format 452 Standard Operating Procedures (SOP'S) 297 Standardization, format in spectroscopy 445 Standards for multidimensional experiments 452 Standards in health care informatics 365 Standards, graphics 379 State space model 211 Statistical design 263 Statistical process control 73 Structure editor 395 Structure generation 121 Swuctured Query Language (SQL) 354 Structured Systems Analysis and Design Mcthodology (SSADM) 297,307,311 Substituent selection 106,462,463 Supercomputers 3 , 9 SVD 99 Syntactic analysis 408 Tapering function 481 Target-MCDM 94 Theodolites 417 TICA 407.409
Tolerance-MCDM 94 Tomograms 31 Tools, debugging 5 Tools, SpecTools 455 Transputer array 21 Turbo Pascal 237 Ultrasound 31 UNIFACE 330 User interface 31,282,415 UV/v;s spectra 445,450,455 VaLID system 308 Validation 216,275,309,341,351 Vectorization 10 Verification 351,483 Visualization 3,32 Volume rendering 31.36 Voxel 33 Voxel, color 33 Voxel, transparency 33 VP-Expert 264 Windowing function 481 Workstations 3.28 1,467 X-Windows 7,354,445 X.400 372