This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
14 CROSSING THE DIVIDE OBJECT PERSISTENCE IN PHP Forgetting storage and focusing on functionality
10 TIPS & TRICKS CAPTCHA That Form Dealing with misuse of online forms by Ben Ramsey
55 TEST PATTERN The Construction Industry Dependencies and Object Construction by Marcus Baker
by Theo Spears
23 An OO Layered Approach to Web Apps You can more confidently develop code by knowing its place and responsibilities by Ronel Rumibcay
37 References in PHP: An In-Depth Look PHP’s handling of variables explained
61 PRODUCT REVIEW Agata 7 Report Generator A cross-platform database reporting tool by Peter B. MacIntyre
67 Exit(0); Tales from the Script
by Derick Rethans
47 Homo Xapian: The Search for a Better Search...Engine Open-source search technology that you can integrate directly into your PHP scripts by Marco Tabini
by Marco Tabini
Download this month’s code at: http://www.phparch.com/code/
EDITORIAL
Political Internals T
here’s an interesting discussion taking place, on the PHP-Internals mailing list, as I type this. A couple of days ago, Andi bravely resurrected the PHP 5.1 thread that began many months ago, and it’s started a flood of discussion. The Internals list is a strange beast. It can lay nearly dormant for weeks at a time, and then, overnight it seems, the sleeping giant is awakened with an onslaught of comments and opinions. It’s a really strange feeling to wake up and find a list that usually gets 5 or fewer posts overnight, suddenly dominating my inbox with several dozen loud messages. The topic du jour, this time around is, once again, GOTO support in PHP. What seems like a little more than half of the “voters” (not that many of them actually carry much weight amongst the PHP core developers) are for GOTO support, while the other bunch, a slightly more conservative (some might even say “wise”) bunch are against it. I, for one, am completely undecided on this issue—I see benefits to both sides. Yes, GOTO would be nice for certain types of deeply nested parsing algorithms (I love playing with the PHP tokenizer, for example), and for other things like code generation (many current code-generating packages—for other languages—employ liberal use of GOTO-like constructs). But the other side of the story is that I know that within months of a released GOTO implementation, I’ll get stuck debugging a huge plate of PHP-spaghetti. Some of the code I’ve had to maintain without GOTO has been knotty enough, thank you. So, once again, we’re at the crossroads of power and simplicity. The really interesting part of this thread, for me, though, is the political stance that people have taken with this discussion. I’ve got a certain amount of respect for an equal amount of zealotry. But don’t cross the line. Fortunately, we haven’t seen much mudslinging, yet. I’ll give it another few hours—it’s only a matter of time before someone starts whining about how “[favorite language] has this feature, we must have it!” and this someone is met with a swift verbal kick from a member of the PHP Group, reminding him that “PHP is not [favorite language]!” It should be interesting to see how this all pans out. Fortunately, I’ve got a fresh beer, and an air-conditioned pub to keep me company while I write this, and the drama plays out. As far as this issue goes, I believe it’s the best one since I started editing, three issues ago. We’ve got a lot of really great content, this month, especially interesting, for me at least, is the in-depth look at PHP variable internals, by Derick Rethans. Security Corner is on a mini-hiatus to make room for the return of Tips & Tricks, which has a new author (whom you’ll meet when you flip the pages). Summer is here! Well, at least in the northern hemisphere. Enjoy reading this issue on your patio, or while floating on your pool (unless, of course, you’re a PDF subscriber; in which case, the pool might not be such a good idea, unless you’ve printed a copy).
php|architect
TM
Volume IV - Issue 6 June, 2005
Publisher Marco Tabini
Editorial Team Arbi Arzoumani Peter MacIntyre Eddie Peloke
Authors Marcus Baker, Peter B. MacIntyre, Ben Ramsey, Derick Rethans, Theo Spears, Ronel Sumibcay, Marco Tabini
php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material.
PHP-MultiShop 0.7 Php-MultiShop.com releases the latest version of their CMS and eCommerce system, version 0.7. The website describes php-multishop as:" Php-MultiShop is a CMS & eCommerce System, an OpenSource platform to realize a virtual mall that includes various shops and contents. The user will have a global vision of the portal, to read the most interesting content (news, forums, curiosities, suggestions, reviews, cultural or commercial events, fairs, recipes, tourist itineraries,...) and will have the possibility to visit the shop desired. Every shop will have all the functions and the personalization of a traditional e-commerce web-site, as if it were independent from the virtual mall. It will have its own internet domain and could be administrated in full autonomy by its own administrator. At the same time, it can be distinct from the mall and other shops thanks to the personalized graphics, individual style, organization, contents and products, like every shop in a real market place. Besides, being part of a large place able to attract different typologies of visitors and consumers, it will be visible and more easily findable, increasing its audience and potential market. Php-MultiShop is written in PHP, run on Apache webserver and MySQL database server, and is able to run on any PHP and MySQL environment, including Linux, Solaris, BSD, Mac OS X, and Microsoft Windows environments. To realize the portal, the popular CMS PhpNuke is used, and for each shop the efficient osCommerce e-commerce suite."
ez.no announces:" eZ systems is proud to announce the release of eZ publish 3.6. This release presents yet another big step forward for eZ publish, with many improvements throughout the system. eZ publish 3.6 is loaded with new features. The most significant new features are: • Support for database transactions • Real preview of new content in the administration interface • HTML caching of static pages • Improved support for internal links in XML fields • Vastly improved template syntax • A developer toolbar to clear cache and enable debug features on the fly" Visit ez.no for all the latest information or to download.
Check out Php-MultiShop at php-multishop.com.
ZEND Core for IBM Beta
MySQL 5.0.6 MySQL 5.0.6 has been released and is ready for download. Some changes in this release include: • The GRANT and REVOKE statements now support an object_type clause to be used for disambiguating whether the grant object is a table, a stored function, or a stored procedure. Use of this clause requires that you upgrade your grant tables. • Added a --show-warnings option to mysql to cause warnings to be shown after each statement if there are any. This option applies to interactive and batch mode. In interactive mode, \w and \W may be used to enable and disable warning display. • SHOW VARIABLES now shows the slave_compresed_protocol, slave_load_tmpdir and slave_skip_errors system variables. • If strict SQL mode is enabled, VARCHAR and VARBINARY columns with a length greater than 65,535 no longer are silently converted to TEXT or BLOB columns. Instead, an error occurs. Check out http://dev.mysql.com/doc/mysql/en/news-5-0-6.html for more changes.
June 2005
●
PHP Architect
●
www.phparch.com
IBM announces the release of the ZEND Core for IBM Beta. IBM describes the core as: "a seamless out-of-thebox, easy to install and supported PHP development and production environment. The product includes tight integration with DB2, the IBM Cloudscape database server, and native support for XML and Web Services, while also supporting increased adoption of Service Oriented Architectures (SOA). It delivers a rapid development and deployment foundation for database driven applications and offers an upgrade path from the easy-to-use, lightweight Cloudscape database to the mission critical DB2, by providing a consistent API between the two." Get all of the latest information from http://www-306.ibm.com/ software/data/info/zendcore/
7
What’s New?>
Check out some of the hottest new releases from PEAR.
File_Fstab 2.0.2 File_Fstab is an easy-to-use package which can read & write UNIX fstab files. It presents a pleasant object-oriented interface to the fstab. Features: • Supports blockdev, label, and UUID specification of mount device. • Extendable to parse non-standard fstab formats by defining a new Entry class for that format. • Easily examine and set mount options for an entry. • Stable, functional interface. • Fully documented with PHPDoc.
SOAP 0.9.1 Implementation of SOAP protocol and services
File_Archive 1.3.0 This library is strongly object oriented. It makes it very easy to use, writing simple code, yet the library is very powerfull. It lets you easily read or generate tar, gz, tgz, bz2, tbz, zip, ar (or deb) archives to files, memory, mail or standard output. See http://poocl.la-grotte.org for a tutorial
Crypt_Blowfish 1.0.1 This package allows you to perform two-way blowfish on the fly using only PHP. This package does not require the Mcrypt PHP extension to work.
Looking for a new PHP Extension? Check out some of the lastest offerings from PECL.
big_int 1.0.7 Functions from this package are useful for number theory applications. For example, in two-keys cryptography. See /tests/RSA.php in the package for example of simple implementation of RSA-like cryptoalgorithm. See http://pear.php.net/packages/Crypt_RSA/ project for more complex implementation of RSA-like crypto, which supports key generating, encrypting/decrypting, generating and validating of digital sign. The package has many bitset functions, which allow to work with arbitrary length bitsets. This package is much faster than bundled into PHP BCMath and consists almost all functions, which are implemented in PHP GMP extension, but it needn't any external libraries.
svn 0.1 Bindings for libsvn.
WinBinder 0.41.154 WinBinder is a new extension that allows PHP programmers to build native Windows applications. It wraps the Windows API in a lightweight, easy-to-use library so that program creation is quick and straightforward.
intercept 0.3.0 Allows the user to have a user-space function called when the specified function or method is called.
ingres 1.0 This extension supports Computer Associates's Ingres Relational Database.
June 2005
●
PHP Architect
●
www.phparch.com
8
What’s New?>
+ Oracle and Zend Partnership Oracle and Zend Technologies, Inc., the PHP company, and creator of products and services supporting the development, deployment and management of PHP-based applications, announced that the companies have partnered to produce Zend Core for Oracle™ - a fully tested and supported, free download that will deliver tight integration between Oracle Database and Zend's supported PHP environment, enabling developers to get up and running in minutes with PHP and Oracle infrastructure. Scheduled for availability in CQ3, Zend Core for Oracle will deliver reliability, productivity and flexibility to run PHP applications tightly integrated with Oracle Database. Zend will offer support and updates for Zend Core for Oracle, which will be compatible with Zend's existing products such as Zend Platform and Zend Studio. For more information visit: http://www.zend.com/
The Zend PHP Certification Practice Test Book is now available! We're happy to announce that, after many months of hard work, the Zend PHP Certification Practice Test Book, written by John Coggeshall and Marco Tabini, is now available for sale from our website and most book sellers worldwide! The book provides 200 questions designed as a learning and practice tool for the Zend PHP Certification exam. Each question has been written and edited by four members of the Zend Education Board--the very same group who prepared the exam. The questions, which cover every topic in the exam, come with a detailed answer that explains not only the correct choice, but also the question's intention, pitfalls and the best strategy for tackling similar topics during the exam. For more information, visit http://www.phparch.com/cert/mock_testing.php
June 2005
●
PHP Architect
●
www.phparch.com
9
TIPS & TRICKS
CAPTCHA That Form Before It Gets Away ! by Ben Ramsey
Abuzz with discussions, arguments, and numerous opinions on solutions to the problem, the PHP community has been focused, lately, on how to prevent weblog comment spam and how to protect one’s forms in general— be they comment forms, e-mail forms, etc. The topic has graced the pages of blogs, and threads on the subject have adorned more than one mailing list. Some say it’s a PHP security problem; others blame the developers. But one thing is certain: it’s just plain annoying.
H
ow can we combat comment spam or verify that those using our forms are actually doing so from our pages and not some remote script out there? I don’t pretend to have the definitive answer, and, in fact, this month’s Tips & Tricks column doesn’t attempt to provide a concrete solution, but I will point out a few erroneous practices, show how they leave forms vulnerable by providing examples of scripts that can misuse your forms, and provide a few “best practices” for securing your forms. There are several popular methods out there for protecting Web forms. Almost all of them, however, aim to accomplish the same result, which is to determine the difference
June 2005
●
PHP Architect
●
between a human and a computer (or automated script). Some scripts embed a token of some sort in the form and set a cookie or session variable. Others provide the user with a CAPTCHA (Completely Automated Turing test to tell Computers and Humans Apart) image of a word or phrase that the user must enter. Some check the Referer header. Still others implement some variant of each of these methods. The problem is that any script can simulate a valid user (read “human”) interaction with a form, and some feel that, as long as the script is properly simulating a user session, it’s okay. Yet, if your forms are set up improperly, these user-
www.phparch.com
simulating scripts can continually access your script using the same session, potentially flooding you with spam. This month’s Tips & Tricks examines three popular methods of “securing” forms and shows how to keep external scripts from posting to them. The Embedded Token Method The simplest and perhaps most user-friendly method to “securing” a Web form is to use what I’m referring to as the “embedded token” method. The embedded token method is simple because it only requires a few lines of code to implement, and it’s user-friendly because it does not
10
CAPTCHA That Form Before it Gets Away
require any additional action from the user to validate their human identity (there is no word or phrase to type). It simply relies on the presence of a user agent (Web browser) visiting the form. The server either sets a session variable or asks the browser to set a cookie that is then checked against a hidden form field when the user submits the form. Listing 1 illustrates a very basic implementation of this method. The problem with the embedded token method in its most basic form is the assumption that only a Web browser can set a cookie or make use of sessions. This could not be
further from the truth, since the Web server will send a Set-Cookie header to the user agent, which doesn’t necessarily have to be a browser. As long as something can parse and read HTTP response headers and send valid HTTP requests, it is a user agent—even if it’s a script (PHP or otherwise). Listing 2 illustrates a script that uses the PEAR package HTTP_Request to send and receive valid HTTP headers, including the ability to capture the Set-Cookie header and send it back to the server in a valid request. For all intents and purposes, this script is a valid
setMethod(HTTP_REQUEST_METHOD_GET); $response = $req->sendRequest(); $regex = ‘/\/’; if (preg_match($regex, $req->getResponseBody(), $matches)) { $token = $matches[1]; } $cookies = $req->getResponseCookies(); foreach ($cookies as $cookie) { if (strcmp($cookie[‘name’], ‘PHPSESSID’) == 0) { $session_id = $cookie[‘value’]; } } /* POST to the form with the session */ $req->setMethod(HTTP_REQUEST_METHOD_POST); $req->addCookie(‘PHPSESSID’, $session_id); $req->addPostData(‘message’, ‘I simulated a user!’); $req->addPostData(‘token’, $token);
TIPS & TRICKS user, and the form treats it as such. This script, of course, assumes the presence of the “token” form field and assumes that this form field will never change in any way—the name will always be “token,” and it will always exist in its present form. It uses a regular expression to then grab the actual token from the form field to send it back in a subsequent POST action. Now, this regular expression could be much more complex to accommodate for changing parameters within the form field so that it is not so limited to the field that it must find. However, as I see it, there must be a constant in order for this type of simulated form post to work: the field name. If the field name of the token is always constant, then this external post to the form will always work. If you work out a way to randomize the field name, then you can block external scripts from making use of yours. Randomizing the field name may seem like a superfluous extra step to block others from using your scripts, but it could save you from unnecessary spam, flooding, or even being used as a spam e-mail relay. The Referrer Check Method Another common approach to blocking scripts from using your forms is to check the Referer header using $_SERVER[‘HTTP_REFERER’]. This is often a suggested method that many believe will completely block external scripts from using your forms. However, just about every server-side scripting language has the ability to modify the HTTP Referer header—and I’m told even some proxies will change it, as well. Let’s take our example in Listing 1. It’s simple to modify the code to check the Referer. Just modify the if statement checking for the posted “message” field to include a second check against the Referer header, as shown here:
if (isset($_POST[‘message’]) && preg_match(“/^http:\/\/benramsey.com/”,
11
TIPS & TRICKS
CAPTCHA That Form Before it Gets Away
$_SERVER[‘HTTP_REFERER’])) {
Now, the script will only process the form if the Referer matches any page from http://benramsey.com. Of course, if the Referer were from http://www.benramsey.com , it would fail, but the regular expression used here is simple; it can be made more complex to allow for other variations of domain names. Just as the Referer check method is easy to implement, it’s similarly easy to fake a Referer header with PEAR::HTTP_Request . Adding the following line of code to the POST request in Listing 2 will trick the form into thinking that the POST it’s receiving is being sent from http://benramsey.com when, in reality, it could be sent from anywhere on the Web. $req->addHeader(‘REFERER’, ‘http://benramsey.com’);
The Referer header is not a good safeguard for your scripts. It’s too easy to manipulate, and this is not a fault of PHP—almost every scripting language can do this. The CAPTCHA CAPTCHA’s are quickly becoming a preferred method of determining whether a form post is from a valid user or a script. Their popularity has also led to great annoyances caused by unfriendly user experiences due to the terrible readability of most CAPTCHA images. Nevertheless, the CAPTCHA seems here to stay. For the most part, the CAPTCHA image is an effective means of blocking external scripts from using your forms. However, I have seen several implementations that leave much to be desired from the programmer. For example, I have seen scripts that simply embed the actual CAPTCHA phrase in a hidden field. In this case, a script such as the one shown in Listing 2 can easily grab the phrase and return it in a post to the form. This form of security does nothing to hinder external scripts from using your forms. It merely
June 2005
●
PHP Architect
●
gives the appearance of tighter control while aggravating your real users who must squint to guess at the CAPTCHA phrases. Never store your CAPTCHA phrase in a hidden field. If you must do so, use md5() and salt to disguise the word or phrase. Listing 3 uses PEAR::Text_CAPTCHA to create a simple CAPTCHA test. Much like the example from Listing 1, it sets the phrase to a session variable for checking against the posted user input. Instead of placing the phrase in a hidden form field like the token, however, the user is required to enter the word or phrase here. Already, the security is increased because external scripts cannot request this page and grab the phrase from the code as shown in Listing 2. However, not everything is perfect here. If a malicious user is feeling
rather, well, malicious, he can manually access this form on your site through a Web browser and grab the session ID, which is automatically saved to a cookie on his machine. He can also make note of the CAPTCHA phrase and then leave your site without otherwise touching the form. Now, armed with a session ID and phrase, he can use the code in Listing 4 to simulate a normal user posting to your form and entering a proper CAPTCHA phrase. As long as the session ID remains active on the server, the CAPTCHA phrase will work. This may not seem like a big deal since it’s a lot of work for someone to go through simply to flood your site with posts, but it is an opportunity that you will want to close to outside scripts, and this is easy to do. All you must do is unset the ses-
sion variable after processing the form. unset($_SESSION[‘phrase’]);
An external script will now be able to fool your CAPTCHA exactly once, but the phrase will no longer be valid in the session after its use, so the script cannot continue to post to your form. This seems like a no-brainer, but I’m amazed at how often I see this simple step left out of code examples and actual production code. It’s not a hard thing to do, and it doesn’t take rocket science, but it’s an often-overlooked practice. The Security Question Throughout this column, I’ve been referring to these examples as being “insecure” and giving you tips on how to “secure” the code. In reality, these are not true security concerns. Left unchecked, your server or database will not be open to attacks. However, your web site
forms may be open to spamming and flooding—and you could potentially be used as an e-mail relay, depending on how your forms are set up. In general—and as a related aside—you should never use a “form mail” script that requires a hidden form field for a To address. Even if the script checks the Referer header, you are vulnerable as a spam relay—it’s happened to me. Instead, always set the To address from the server-side and within the actual PHP code. Smart Programming In this column—my debut effort for Tips & Tricks—I’ve given several
examples of how external scripts can use your forms even when you’re sure they can’t. Plus, I’ve shown you how to use PEAR::HTTP_Request to simulate a valid user and act as a user agent. I’ve shown more tricks than I have tips, but in the end, being a smart programmer is the key. It is my hope that you’ll take these few tips and expound upon them as you program applications. Being a smart programmer means thinking through the problem and even considering how others may abuse your application. Only then will you be able to tackle real security problems head on. Until next time, be sure to practice safe coding!
About the Author
?>
Ben Ramsey is a Technology Manager for Hands On Network in Atlanta, Georgia. He is an author, Principal member of the PHP Security Consortium, and Zend Certified Engineer. Ben lives just north of Atlanta with his wife Liz and dog Ashley. You may contact him at [email protected] or read his blog at http://benramsey.com/.
To Discuss this article: http://forums.phparch.com/224
Award-winning IDE for dynamic languages, providing a powerful workspace for editing, debugging and testing your programs. Features advanced support for Perl, PHP, Python, Tcl and XSLT, on Linux, Solaris and Windows.
Download your free evalutation at www.ActiveState.com/Komodo30
CROSSING THE
DIVIDE OBJECT PERSISTENCE IN PHP by Theo Spears Almost any PHP application needs to store some kind of data. While databases provide high performance and reliability, actually using them when writing object-orientated code can prove tedious. In this article, we’ll look at three solutions which help you to forget about storage and focus on functionality.
P
HP has come a long way from being a little set of C-based helper code, used to maintain Rasmus Lerdorf’s résumé. Versions 4 and especially 5 have added many of the features needed to build enterprise web applications and make it reasonable to compare PHP with technologies such as Java Servlets and ASP.NET. Foremost amongst these additions has been the improvement of object support. While objects were possible in version 4, they were little more than syntactic sugar. PHP 5 now has comprehensive support for objects and increasingly, people are using objects to build more complicated PHP scripts. However, with comparable support for objects, PHP begins to face a problem which has confronted other languages: how are these objects to be stored? Relational SQL-driven databases remain the only widespread form of storage for web applications, but this does not fit smoothly with object-orientated code.
June 2005
●
PHP Architect
●
www.phparch.com
REQUIREMENTS PHP
4, 5
OS
Any
Other Software
PEAR
Code Directory
persistence
i
RESOURCES DB DATA OBJECT
http://pear.php.net/package/ DB _D ataO bj ect
PROPEL
http://propel.phpdb.org/
EZPDO
http://www.ezpdo.net/
14
Crossing the Divide: Object Persistence in PHP
Persistence frameworks provide a bridge between object code and the database. To do this they perform a number of tasks. The most simple of these is transferring data from the member variables of objects to tables in the database, and back again. This often requires marshalling the data to the correct format for the database (see the discussion on dates, below, for an example). A second role performed by persistence frameworks is managing the relationships between different objects. When objects contain references to other objects these should be translated into foreign key entries, or for many-to-many joins, entries in a join table. Likewise, when loading objects, references should be automatically mapped back into objects. This, in turn, can cause problems with circular references or a single object load bringing many other objects from the database. To overcome this, most persistence frameworks use some form of lazy loading where related objects are only loaded when they are accessed. Lastly, loading an object from the database every time it is needed is expensive, thus many persistence frameworks include some form of cache. This means multiple requests for the same object result in only a single database query. This can heavily reduce database load in complicated applications. In Java and ASP.NET this data is often cached between requests. In PHP, however—because PHP has no reliable framework for sharing memory between requests—this caching has to be performed on each page load. This makes loading performance more important with PHP than with other technologies. There are several well known frameworks for object Figure 1
FEATURE persistence with Java or Microsoft .NET, the most popular probably being Hibernate and NHibernate, respectively. There are also an increasing number of similar frameworks for PHP that are in various states of completion and offer varying functionality. Here we will look at three: DB_DataObject, Propel, and EZPDO. In order to demonstrate the strengths and weaknesses of each, we will use each to implement a very simple example that I have put together, to determine how straightforward each one makes coding the various parts. School Manager In the first example, we will be making a website for a school to keep track of its students and the classes that they are taking. While maintaining a relatively simple object model, this gives scope for testing some of the more advanced features of the persistence frameworks. Our website must allow people to do the following: Add new students and delete or modify existing ones Display a list of all students and find students with a given name View teacher details View class details Alter the list of students associated with a class Of course, any real website would need many more options than these, but in terms of code, most would be similar to the ones above, just working with different objects. From this, I have created an object model which you can see in Figure 1. Notice that Teacher and Student are derived from a common class—Person. Inheritance is an important part of object-orientated programming and the part which is most difficult to translate into databases. Considering the small size of this project I chose to do no further high level preparation, the rest of the design will be decided upon as the code is written. For larger projects you should, of course, put more work into these early stages. Installation and Configuration I am choosing to use MySql to provide database functionality, with all three persistence frameworks. It is installed in the usual way, and a separate database was created for each framework. As both Propel and EZPDO provide their own SQL to create the database tables, and because the tables they generate are incompatible, using a shared database would be impossible. What you gain in coding speed with general tools like these, you sometimes lose in flexibility. I then installed the three frameworks. Each took some trial and error to install, initially, but once I had identified and installed all the requirements, installation was reasonably straightforward. The simplest was DB_DataObject, which, as a native member of PEAR, simply required the following PEAR packages (including
June 2005
●
PHP Architect
●
www.phparch.com
15
FEATURE
Crossing the Divide: Object Persistence in PHP
dependencies): PEAR, Archive_Tar, Console_Getopt, XML_RPC, DB, Date, and DB_DataObject. Propel can also be installed via the PEAR installer— but it must be fetched from a separate repository–by issuing the following commands: pear install \ http://creole.phpdb.org/pear/creole-current.tgz pear install \ http://propel.phpdb.org/pear/propel_runtime-current.tgz
For development you will also need to run pear install \ http://phing.info/pear/phing-current.tgz pear install \ http://propel.phpdb.org/pear/propel_generator-current.tgz
Propel requires DOM support in your version of PHP. Most PHP users will already have this installed, but with some binary distributions (such as Debian) it may be necessary to install it as an additional module. EZPDO, by contrast, is installed by downloading a package and extracting it somewhere in your project directory. You may wish to rename the folder to remove the version number so your code does not break when you upgrade to a newer version. EZPDO also requires the PHP XML and SPL modules which again can be compiled in or installed from packages. It also needs items from PEAR including a few packages that are in Beta state. To install these, you need to run: pear install XML_Util Log FSM pear uninstall XML_Parser pear -d preferred_state=beta \ install XML_Parser XML_Serializer
Although not strictly related to installation, there is another hurdle to cross before you are able to use EZPDO: it has automatic test suites, but these do not test all database options so, with Pear DB and MySQL, it is necessary to fix a small bug. You should find a copy of a patch to fix this in the code archive for this article. We can now configure each framework for our specific project. Each of the three packages has its own configuration file. There is no need to memorise the details for any of the frameworks as all provide examples which you can copy-paste and then modify to suit your application. DB_DataObject is, again, the simplest of the three to configure. It uses the PEAR options so you must call PEAR::getStaticProperty for each property and set an appropriate value to this static property. However, the documentation includes a small script to load all the information from an ini file. This method of implementing your configuration is much simpler, but you might wish to avoid it if you are trying to squeeze ultimate performance. There are five settings: the database connection, a string your class names must start with in order to be stored in the database, and three paths to where the classes and schematic details are
June 2005
●
PHP Architect
●
www.phparch.com
1 class Student extends Person { 2 /** 3 * The school year containing the student 4 * @orm integer 5 */ 6 public $year_group; 7 /** 8 * The tutor for this student 9 * @orm has one Teacher 10 */ 11 public $tutor; 12 /** 13 * The classes the pupil is part of 14 * @orm has many class 15 */ 16 public $classes; 17 }
// Get a list of the students in the database $student = DB_DataObject::factory(“student”); $result = $student->find(); if ( $student->count() == 0 ) { message(“There are no students”); return; } // Show a header for the table show_student_list_header(); while ( $student->fetch() ) { // Fetch related classes as well $student->getLinks(); // Show each student’s details show_student_details ( $student ); } show_student_list_footer();
// Get a list of the students in the database $student = DB_DataObject::factory(“student”); $result = $student->find(); if ( $student->count() == 0 ) { message(“There are no students”); return; } // Show a header for the table show_student_list_header(); while ( $student->fetch() ) { // Fetch related classes as well $student->getLinks(); // Show each student’s details show_student_details ( $student ); } show_student_list_footer();
16
FEATURE
Crossing the Divide: Object Persistence in PHP
stored. A sample configuration file for my computer is shown in Listing 1. Propel is configured through an XML file that specifies where to log information and how to connect to the database. Unlike DB_DataObject and EZPDO, Propel doesn’t require all classes to be stored in a spe-
this range, this may be a critical consideration. Other data types may or may not work properly. Bearing this in mind, you are free to create your tables however you choose. Once this is done, it is simplest to run the createTables.php script that is included with DB_DataObject, passing it an ini file in the same form
“Persistence frameworks provide a bridge between object code and the database.” cific location; you include the classes directly, so they can be stored wherever you choose. This is especially useful for large projects, as classes from different modules can be stored in different directories. EZPDO’s configuration is the least flexible of the thee, and it has the most complicated configuration file. Configuration must be stored in the project directory, in a file called config.xml. The most important elements are source_dirs , default_dsn , db_lib and auto_flush. source_dirs controls the directory containing classes to be serialised. Assuming you leave recursive set to true, you can store classes anywhere under this path, although I did not bother with any subdirectories. default_dsn and db_lib control the database connection. On a unix-like platform, you will probably want to use peardb as your db_lib and default_dsn is then a standard PEAR database URI. auto_flush controls whether all items are automatically stored to the database when your script ends. Although useful, it will have serious effects on performance, so I recommend turning it off. You may also wish to modify the logging options to either turn off logging altogether, or to log to a database instead of the default file. The Object Model Each of the three frameworks has a different way of specifying the layout of classes and tables. With DB_DataObject, you create your tables and then either manually define classes or run a script to automatically generate them, while with EZPDO you define your classes and tables are automatically created. Propel generates both SQL and classes from a separate XML file that you’ve created. DB_DataObject fully supports string and numeric types. It also supports date types, although in a more awkward way—it allows you to treat them as if they were strings. Unlike the other frameworks, dates are not converted into unix timestamps, so dates before the start of the epoch (January 1st, 1970) are supported. If your application has to deal with dates outside of
June 2005
●
PHP Architect
●
www.phparch.com
as the one used for configuring the runtime. This will generate classes for of all the tables in your schema. It will not, however, pull out any details about joins or foreign keys. Alternatively you can manually create classes derived from DB_DataObject and create your own schema file. Personally, I do not recommend this: it is exactly the type of boring work persistence frameworks are meant to save you from. If you wish to use the join functionality of DB_DataObject you will need to create a configuration file specifying the links between tables. This is another ini file in the same folder as the schema generated by generateTables.php. Its filename takes on a form like: .links.ini. In this file, there is a section for each table and then entries in the form = : for each relationship. Look at db_dataobject.links.ini in the DataObjects directory of the DB_DataObject project for an example of such a configuration file. DB_DataObject has no native support for inheritance, although it is possible to share methods between classes by modifying the generated inheritance hierarchy to give them a common superclass. It is impossible to search for a superclass and get all subclasses that match. As mentioned above, with Propel you instead create an XML file. This has various nodes for each table and column, and since it is XML, it is fairly self-documenting. The only item of note is that to specify a relationship, you must both define a column in the table for the foreign key and also separately define the relationship. Personally, I think XML is a poor format for defining classes and their relationships, especially as no DTD is provided to allow editors to auto-complete. I would suggest that you either generate this XML from another format, or use a GUI XML editor, if you wish to use Propel for anything more than the most simple of object models. Propel has a very strange form of inheritance. Essentially, it uses subtractive rather than additive inheritance; the superclass table must specify all fields and each subclass may choose which of those fields to use.
17
FEATURE
Crossing the Divide: Object Persistence in PHP
There is no built in way of simply getting objects of a specific class, you must add code to filter the superclass entries yourself. Because of this, I chose not to use the inheritance feature and instead just use separate classes. Once this XML file is defined you must run the propel-gen script, passing it the directory containing your schema.xml file. This will, in turn, generate all classes and an SQL file that will generate the table. This file will be place under the build directory of your project. Coming from a C# background, I found EZPDO the most intuitive. With EZPDO, you write your classes as if they were not going to be persisted at all. You then add the custom phpdoc @orm tag to each field that should be stored in the database, specifying its type. You can give a name for each table and column, but if you don’t, EZPDO will automatically choose one. This has the added side effect of encouraging you to document your variables. To specify relationships between classes, you use a class name as the type, together with information as to the type and multiplicity of the relationship. EZPDO supports composition (ccomposed_of ) relationships, where destroying the parent object destroys the children and aggregation (hhas) relationships where the child can exist without the parent. It also allows one-toone (oone) and one-to-many (mmany) relationships. For an example of this, see the Student class shown in Listing 2. As you simply define classes, inheritance works as expected, although again there is no way to fetch a superclass and get results for all matching subclass instances. Let’s Write Some Software It seems to have taken a long time to get to a position where we are ready to start writing some code, but hopefully, all of our preparation will prove worthwhile, by allowing us to write the code more quickly, and with fewer bugs. Working from the list of requirements, let’s start by adding a page to list all of the students. For all three frameworks, the logic is exactly the same: ask the framework for a list of all students and then iterate through that list, displaying each entry in a table. With DB_DataObject, this is a matter of creating an empty student object, either directly or through the DB_DataObject::factory() function, and then calling its find() method. Using DB_DataObject::factory() has the advantage that the required PHP files will automatically be loaded for you. As we have not set any properties on the object, calling find() gets a list of all students in the database, regardless of their properties. See Listing 3 for some simple pseudo-code. Note: as we want to show details from the student’s tutor, which is stored in a different table, we have to call getLinks() on each student. This loads the tutor into the _tutor member. We could also have called addJoin() before
June 2005
●
PHP Architect
●
www.phparch.com
finding the entries, to reduce the number of queries, but this results in a less logical resulting object. With Propel, we create an empty Criteria object to match all students, and then call StudentPeer::doSelectJoinTeacher() to directly get all students and their tutor details from the database. Note the name is doSelectJoinTeacher() not doSelectJoinTutor(); it is based on the name of the foreign table not the name of the foreign key column in the local table. We can then call getTeacher() to get Listing 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
// Get a list of all the students in the database $student_list = epManager::instance()->getAll(‘Student’); // Check there are any if ( $student_list == false ) { message(“There are no students”); return; } // Show a header for the table show_student_list_header(); foreach ( $student_list as $student ) { // Show each student’s details show_student_details ( $student ); } show_student_list_footer();
Listing 6 1 2 3 4 5 6 7 8 9 10 11 12 13
// Populate a class object to use for the search $class = DB_DataObject::factory(‘class’); $class->selectAs(); // We also want the size of each class // so link a student record $class->groupBy(‘class.id’); $student = DB_DataObject::factory(‘link_student_class’); $class->joinAdd(&$student,”LEFT”); $class->selectAdd(“count(link_student_class.student_id) “ . “as student_count”); $class_count = $class->find();
details of the student’s tutor. See Listing 4 for an example of this. Although doing this with DB_DataObject and Propel was hardly complicated, EZPDO makes this task the easiest. Here, it is a matter of getting an instance of the persistence manager with the epManager::instance() function, and then calling getAll(‘Student’) on the returned instance. Again, the required files are automatically included. See Listing 5. There are a few other differences I came across when working with the list of objects that each framework returned. While Propel and EZPDO return an array of objects, DB_DataObject returns an iterators, on which you call fetch(),repeatedly. This is slightly less objectlike, but hardly a problem. More annoyingly, in terms of increased typing, Propel objects use get and set
FEATURE We can use exactly the same methods to get a list of all the teachers. In fact, with a little use of polymorphism, we could use exactly the same code to get the list of students or the list of teachers, although to keep things simple in this example, I chose not to. However, we also want to be able to view details of an individual teacher. To do this, we must select a single teacher entry from the database. This is again done in similar ways with all 3 frameworks. With DB_DataObject, you use the DB_DataObject::staticGet($class_name, $primary_key) function, while in Propel you must call TeacherPeer::retrieveByPK($primary_key) . EZPDO uses the get($class_name, $primary_key) function on an instance of the persistence manager. In all three cases, this returns an instance of the relevant object. As well as showing the teacher’s own properties, I
“Propel is configured through an XML file that specifies where to log information and how to connect to the database.” methods where DB_DataObject and EZPDO use direct assignment. This also means Propel code requires slightly more effort to read, for people coming from a C# background, although Java programmers should be used to it. We also want to be able to look up specific students based on their names. For this, I modified the code that shows all students, so it could optionally be limited to showing only students matching a certain criteria. All three frameworks have a simple method for specifying criteria when you simply wish to check for equality. DB_DataObject and EZPDO do this by setting values in an object and then looking for all objects like it, while with Propel, you add constraints to the Criteria object. However, because I want to allow searching on substrings, things are more complicated. This is simplest with Propel where it is simply a matter of adding a Criteria and specifying that a LIKE match is required. With DB_DataObject, it is necessary to use the addWhere() method to directly add an SQL WHERE constraint to the request. EZPDO requires the most complicated implementation; it uses its own query language to allow you to constrain the results, thus you must call the following to get a list of students matching a name: $student_list = epManager::instance()->find( “from Student as student where student.full_name like ?”, $name_template );
June 2005
●
PHP Architect
●
www.phparch.com
wanted to show a list of their tutees. To do this with DB_DataObject, the simplest way was simply to fetch all of the students from the database, whose tutor field matched the primary key of the teacher being viewed. This is effectively querying as if there was no persistence framework, but simply an object wrapper. It is effective, but breaks the object-orientated encapsulation I was hoping for. Propel is far more promising in this regard. It provides a Teacher::getStudents() function that returns an array of all the teacher’s tutees. By iterating through this array, we get a list of student objects. Similarly, EZPDO allows us to access the tutees property as an array of student objects. Next, let’s get a list of courses. This is slightly more complicated, as I also want to display how many students are enrolled in each course. Here, the abstraction provided by DB_DataObject was rather inadequate. The solution I came up with involved manually adding a join between the course table and the course-student join table, and manually adding a column to select the COUNT() of rows in the database. Although, arguably slightly less work that using raw SQL, I still largely had to think in terms of relational databases. For the code I used, refer to Listing 6. Propel again provided a better level of abstraction. Using the getStudentCourseRefJoinCourse() function, I was able to get an array of students which I could count. As this suggests, many of the function names for many-to-many joins in Propel are rather cumbersome,
19
FEATURE
Crossing the Divide: Object Persistence in PHP
but you can see the list by looking at the generated code. If you are going to be using a function often, you can always wrap it to have a friendlier name. With EZPDO, it is again just a matter of calling count() on the students member. Moving on, lets now look at how we can create and store new objects, in this case for our students. In each case, this is done by creating a new class instance and setting its properties, then telling the framework to store it in the database. With DB_DataObject this is done with the DB_DataObject::factory() function, as mentioned above. Once the object is ready to be added to the database, you call its insert() method to store it. Propel is the simplest of the three when it comes to implementing this. You create the object as you would any other with the new operator and then call save() to store it. EZPDO works slightly differently. You create the object with the create() function on the persistence manager, but it is stored by calling the commit() method on the manager, rather than the object itself. This makes for slightly longer code. If you enabled auto_commit in the configuration file, this step is optional. EZPDO also works slightly differently in terms of adding relationships. With DB_DataObject and Propel, all relationships are implicitly bidirectional, so setting the tutor for a student automatically adds the student to that teacher’s list of tutees. However, with EZPDO, links in both directions are discrete, so you must also modify the teacher to add a reference to the student. This is the expected behaviour from a purely object-orientated point of view but is far less convenient, and violates the Don’t Repeat Yourself principle of avoiding data duplication. For an example, look at Listing 7. Editing entries is done in almost exactly the same way as adding new ones. Instead of creating a new record, you fetch an existing one, as detailed above. You modify its properties appropriately, and then save it back to the database. With Propel and EZPDO this is exactly the same as saving a new object, with DB_DataObject you must call update() instead of insert(). Deleting is done in a similar manner. With DB_DataObject, you call delete() on the relevant object. You can also create an empty object and use whereAdd() to delete all objects matching a specific criteria. With Propel, you can delete a specific object or delete any object matching a provided Criteria class. With EZPDO, you call delete() on the persistence manager to delete an object, however there is no way to delete all objects matching a certain criteria. Other considerations As important as the ease of use of a framework, is the quality of documentation that is provided for it. In this respect, all three frameworks do rather well.
June 2005
●
PHP Architect
●
www.phparch.com
Figure 2
OPERATION
NUMBER OF QUERIES DB DataObject
Propel
EZPDO
LIST STUDENTS
129
2
158
LIST CLASSES
9
9
117
ADD A NEW STUDENT
2
5
46
MODIFY AN EXISTING STUDENT
2
6
2518
SHOW A TEACHER
3
7
87
SHOW A CLASS
34
19
66
CHANGE MEMBER OF CLASS
11
51
2138
DB_DataObject provides the standard PEAR API reference, along with a user guide that takes you through each function, with lots of examples. It is worthwhile to read all of the documentation, if you plan to use DB_DataObject, as it provides some more efficient methods of performing certain tasks. Propel also provides an extensive API reference and user guide. The user guide walks you through setting up a simple example and then goes on to cover more advanced topics. It is probably the best documentation provided by any of the three frameworks. EZPDO provides a walk-through tutorial on its website that takes you through setting up a simple project, which is included with the EZPDO framework download. Its coverage of more advanced areas is adequate. It may not be as comprehensive as that of Propel, though this may in part be because it is simpler to use. All of its functions are fully documented in the phpdoc scheme, although unfortunately, I could not find a copy of the generated documentation online, so if you want it you may have to generate a copy for yourself. For people building large sites, another important consideration will be performance. I always question the value of “microbenchmarks,” but Figure 2 shows the number of queries each framework needs for various operations. The results show similar performance between DB_DataObject and Propel but a much higher query count for EZPDO, especially when modifying existing data. Clearly, there is room for improvement in the EZPDO engine, especially where existing data is modified, which seemed to require an absurd number of queries. This may, in part, be due to my use of circular references but this is not that uncommon a pattern. Hopefully, this is something that will be fixed in later
20
FEATURE
Crossing the Divide: Object Persistence in PHP
versions as EZPDO did feel noticeably slower to me than the other two frameworks. Lastly, a brief note on version support: DB_DataObject fully supports PHP 4 and 5. Propel primarily supports PHP 5, although work is being done on a PHP 4 version. EZPDO is firmly PHP 5 only and would be extremely difficult to port to PHP 4. While I recommend people use PHP 5 and take advantage of the new features it provides, in some cases this is not feasible and EZPDO may simply not be an option. Conclusion I wish to stress that the three frameworks I looked at here are by no means all of the options available for PHP. If there in another framework you like, and think it is better than the ones presented here, we would love to have you come and tell us its merits in the discussion forum. Likewise, if you think there is something crucial I have failed to mention, let everyone else know about it. Restricted to these three, your choice will inevitably depend on your situation. DB_DataObject is not really do a full translation between the world of objects and the world of databases, and is more just a wrapper around SQL. I chose to include it because of its getLinks() and joinAdd() functionality but these do
not really compare with the power of Propel and EZPDO. It is a good choice if you want a wrapper that avoids the need to write SQL, but are happy to think in a relational manner. EZPDO, by contrast, completely abstracts the database, almost to too great of an extent, though I found it the most pleasant of the three alternatives to use. Unfortunately, it has severe performance problems at the moment. Hopefully these will be overcome in later versions but until then I cannot really recommend it for anything beyond the smallest personal site. This leaves Propel which although not as complete as EZPDO does a good job of providing an object-orientated interface to databases. It is more complete that DB_DataObject and much lighterweight than EZPDO so if you can tolerate its XML method of defining your classes, it is the one I would recommend.
About the Author
?>
Theo is a student at a university in the UK, studying Social Sciences. While he claims to prefer C# and ASP.NET, in his spare time he can still, often, be found writing PHP scripts or giving tips on IRC. He can be contacted at [email protected]
To Discuss this article: http://forums.phparch.com/226
Available Right At Your Desk All our classes take place entirely through the Internet and feature a real, live instructor that interacts with each student through voice or real-time messaging.
What You Get Your Own Web Sandbox Our No-hassle Refund Policy Smaller Classes = Better Learning
Curriculum The training program closely follows the certification guide— as it was built by some of its very same authors.
Sign-up and Save! For a limited time, you can get over $300 US in savings just by signing up for our training program! New classes start every three weeks!
http://www.phparch.com/cert
June 2005
●
PHP Architect
●
www.phparch.com
21
FEATURE
An OO Layered Approach to Web Apps F E A T U R E
by Ronel Sumibcay
We’ve all heard about the benefits of OOP and that it provides the ability to have more reusable, maintainable, and extendible code. But with a great deal of PHP develepors with a background in procedural programming, switching to OOP may seem like an overwhelming task. With the help of a few OO design patterns, and by organizing your code into layers, not only will you have the beginnings of what OOP has to offer, you can more confidently develop a piece of code by knowing its place and responsibilities in the overall app.
O
ver the past few years, web applications have become increasingly complex. This has led to the increasing need to develop these applications with more maintainable, reusable, and extendible designs. Object oriented (OO) design patterns provide elegant and refined solutions to these problems, but they only apply to OO languages. Until recently, for OO-based web applications, the natural choice was Java and C#. Many developers adopted these languages to implement their code designs, and wherever necessary, used design patterns to solve their coding problems. Along the way collaborating, sharing, and publishing additional design patterns that were more
specific to web applications. Soon, server-side web languages that started off procedurally, incorporated object oriented features. Macromedia’s ColdFusion MX now provides object-based support through Cold Fusion Components (CFCs). PHP 5 refines its support by providing additional class modifiers and keywords to more fully support the properties of an object oriented language. We’re at a point now where web apps in PHP can fully benefit from the catalogs of design patterns that were once only available for J2EE web developers. Many web applications are faced with the same or similar design issues, no matter which programming language they are implemented in. Design patterns put a common name to a design solution. Each design pattern’s name (e.g. Singleton, AbstractFactory, Adapter, etc.) provides a common vocabulary that can be shared and communicated when describing a class or a piece of code. This shared vocabulary would make it easier for a developer to understand how the code works: “Why does this class have a private constructor? Oh, it says here it’s a Singleton. I can probably assume then, that only one instance of this class is used in the whole application.” Design Patterns “Design Patterns describe the communication and relationship between objects that are customized to solve a general design problem.” (From the famous book Design Patterns: Elements of Reusable-Object Oriented Software.) In most cases, design patterns are documented in a catalog. A cataloged design pattern has four essential elements:
FEATURE • The pattern name • A description of the problem that the pattern solves • The solution that describes the elements that make up the design, their relationships, responsibilities, and collaborations. The solution is abstract enough that it can be applied in many different situations. • The consequences describing the results and trade-offs of applying the pattern. Although the pattern may look like a fit, you need to consider the overall impact of introducing this pattern into your system. Patterns are only beneficial if used appropriately. Consider the effects of flexibility, extendibility, and portability. If you are just moving from procedural to OO, then learning an existing Design Pattern catalog is quite a bit to comprehend. Start off by getting familiar with OOP. One of the best ways to learn OOP is to use already available PHP classes within your app. The PEAR library is an extensive library of useful classes. PHP 5 now includes a potentially powerful set of libraries named the Standard PHP Library (SPL). This library includes the Iterator which is an implemented design pattern that abstracts looping through different sets of data. The point is many of these classes have implemented a design pattern in some way or another. As you become familiar with using a class from a library, try to determine if the class is using a design pattern—this may be mentioned in the comments. Look for the pattern in a pattern catalog, and read it
Figure 1
June 2005
●
PHP Architect
●
www.phparch.com
26
FEATURE
An OO Layered Approach to Web Apps
over. You may discover the reason that the class provides certain methods, and perhaps understand the reason why the class is implementing the pattern. As you become familiar with the pattern, the next time you run into a class that implements it, it will be much easier to understand how to use the class, how the class behaves, and what to expect of it. You will be able to identify situations in your own application that may benefit from that same design pattern. Or better yet, discover a new variant of it! To aid you in this discovery process, we will go over the BusinessObject (BO), the DataAccessObject (DAO), the DataAccesGatewayObject (DAGO), and the
ing the pages that are used to display the data. Presentation (or Templating) frameworks such as MachII, Fusebox, Smarty, or PHPTAL would make up the presentation layer. A couple of these templating frameworks such as Mach-II, and Fusebox already provide a mechanism for the FrontController. Fusebox 4, for instance, provides controller logic with a combination of XML files called circuits.xml and fusebox.xml. Mach-II provides a similar approach with a mach-ii.xml file. The model layer consists of the BusinessObject (BO), the DataAccessGatewayObject (DAGO), and DataAccessObject (DAO). The TransferObject is a sim-
“Web apps in PHP can fully benefit from the catalogs of design patterns that were once only available for J2EE web developers.” TransferObject. These objects are used for the business layer. The overall design uses the Model-ViewFrontController. This is a very common pattern used in web application development. In our example, we use Model-View-FrontController for communication between the presentation layer and the business layer. The general idea of this design comes from the ModelView-Controller (MVC). Since its original inception in Smalltalk, the MVC design pattern has been used in so many different ways, under different contexts, that there have become differing opinions and meanings of what MVC really is. The FrontController is a J2EE design pattern for enterprise web applications. As with a few of the patterns in the J2EE design pattern catalog, it solves a general problem for web applications independent of the server-side scripting language being used. There are already a few presentation frameworks that implement this FrontController design in PHP. We’ll discuss further in the Layers section. The Layers Figure 1 illustrates each object and where they lie in each layer. The presentation layer is responsible for presenting the model data and breaking the HTTP request variables into calls to the model layer. This is part of the job of a FrontController. It resides in the presentation layer. The FrontController also drives page flow logic and manages data and communication coming from and going into the model. It is also responsible for compos-
June 2005
●
PHP Architect
●
www.phparch.com
ple object containing only getter and setter methods and is used to encapsulate data being passed between the presentation layer and the model layer. An instance of a TransferObject may be passed to a DAO, for persistence. The BO, as it seems fit, may compose and provide different types of TransferObjects to the presentation layer for course grained views into the model. The DAGO, and DAO are always in the back located closest to the data source. These objects contain functions used to access the data source e.g. SQL statements, data from a web service, an RSS feed, etc. These objects encapsulate all the logic used to access the data source. This design improves maintainability by having your SQL statements in a central location as opposed to having them spread out throughout your code. In addition, it is located way back in the layers where it logically belongs. In some designs, a DAO is all that is used for accessing the data source. For this design, we fine tune the DAO’s responsibilities by pulling out its aggregate functions and putting them into a DAGO. This pattern is described as a gateway in the Mach-II development guide, even though it is described in the context of a Mach-II application, I’ve found that it solves the same general problem in our design with PHP. As for most, if not all, situations when building web applications, there are two types of operations being performed on persistent data. (As stated in the Data Access section of the Mach-II Development Guide http://livedocs.macromedia.com/wtg/public/machiid evguide/models.html):
• per-object access: creating, editing, working 27
newsID = $id; } public function getID() { return $this->newsID; } public function setPublishDate($pdate) { $this->publishDate = $pdate; } public function getPublishDate() { return $this->publishDate; } public function setBody($b) { $this->body = $b; } public function getBody() { return $this->body; } public function setHeading($h) { $this->heading = $h; } public function getHeading() { return $this->heading; } public function setSubHeading($sh) { $this->subHeading = $sh; } public function getSubHeading() { return $this->subHeading; } public function setAuthorFname($fname) { $this->authorFname = $fname; } public function getAuthorFname() { return $this->authorFname; } public function setAuthorLname($lname) { $this->authorLname = $lname; } public function getAuthorLname() { return $this->authorLname; } public function setAuthorEmail($email) { $this->authorEmail = $email; } public function getAuthorEmail() { return $this->authorEmail; } public function getCategoryID() { return $this->categoryID; } public function setCategoryID($catid) { $this->categoryID = $catid; } public static function fromArray($arr) { $news = new News(); $news->setId($arr[“pk_NewsID”]); $news->setHeading($arr[“heading”]); $news->setPublishDate($arr[“publishDate”]); $news->setSubHeading($arr[“subHeading”]); $news->setAuthorEmail($arr[“authorEmail”]); $news->setAuthorLname($arr[“authorLname”]); $news->setAuthorFname($arr[“authorFname”]); $news->setPublishDate($arr[“publishDate”]); $news->setBody($arr[“body”]); $news->setCategoryID($arr[“categoryID”]); return $news; }
newsDAO = NewsDAO::getInstance(); $this->newsDAGO = NewsDAGO::getInstance(); } /** * Create a News item record. * @param News news item to be inserted */ public function create(News $news) { $this->newsDAO->create($news); } /** * Update the given News item record. * @param News news item to be updated */ public function update(News $news) { $this->newsDAO->update($news); } /** * Retrieve a News object with the given id. * @param mixed newsid (string or integer) */ public function read($newsid) { return $this->newsDAO->read($newsid); } /** * Delete the News record with the given id. */ public function delete($newsID) { $this->newsDAO->delete($newsID); } /** * Retrieve all News items. * @return Iterator of associative arrays where * each array is a News item record. */ public function findAllNews() { return $this->newsDAGO->findAllNews(); } /** * Retrieve all Categories * @return Iterator of associative arrays where * each array is a Category record. */ public function findAllCategories() { return $this->newsDAGO->findAllCategories(); } } ?>
} ?>
June 2005
●
PHP Architect
●
www.phparch.com
28
FEATURE
An OO Layered Approach to Web Apps
in depth with a single row (object) • aggregated access: reporting, searching, listing multiple rows Following the bullet points, the DAO is responsible for performing per-object access. It provides methods for performing the usual CRUD operations (Create, Read, Update, Delete) on a single record in the database. We use a TransferObject to encapsulate the data being persisted. This object is returned from or passed to the CRUD methods. The DAGO is responsible for providing aggregate
access. It provides methods such as findAllCategories() , findAllNews() , or findNewsByCategoryID(). Since the result set varies and can possibly be a set of data from different sources, it is not convenient to encapsulate each record’s data into a TransferObject. If we did, that would mean we would have to code a different TransferObject for each varying set of data. So what is usually returned from a DAGO method is an array of associative arrays, or an Iterator. The BO is the communication front between the controller and the model layer. It provides coarse grained operations to the controller. The BO uses the DAO, and DAGO, but hides these objects and other implementation details from the controller. The controller only knows to communicate with the BO to access and manipulate the model. Listing 5
dbConn = & DataSource::getConnection(); } /** * Singleton method for retrieving a NewsDAGO instance */ public static function getInstance() { if (self::$thisInstance == null) { self::$thisInstance = new NewsDAGO(); } return self::$thisInstance; } /** * Retrieve all Categories * @return Iterator of associative arrays where * each array is a Category record. */ public function findAllCategories() { $sql = “SELECT * FROM NewsCategory”; $result = $this->dbConn->query($sql); return new ResultSetIterator($result); } /** * Retrieve all News items. * @return Iterator of associative arrays where * each array is a News item record. */ public function findAllNews() { $sql = “SELECT * FROM News”; $result = $this->dbConn->query($sql); return new ResultSetIterator($result); } } ?>
June 2005
●
PHP Architect
●
www.phparch.com
FEATURE The News Example So, let’s take what we’ve discussed so far and put it into a working example. We’re going to model a news item. Here are the requirements: A news item will have a heading, a subheading, a body, a publish date, and an email of the author. Each news item must belong to a category. The News class is a TransferObject that is passed between the controller and the NewsBO for CRUD operations. The News class defines the getters and setters (accessor methods) that map to its column values in the News table. A single instance represents a single row in the database. See Listing 1. This type of object is analogous to a bean in Java. Because of its simplicity, some may argue that it’s not even needed. Why not just have it be an associative array? Since this object is involved in the data persistence logic, I’ve found that dealing with an object keeps the system honest in the sense that the object has a type (see type hinting in Listing 2 for the update(), and create() methods). A TransferObject represents a concrete piece of the model, and provides a strict interface for manipulating its properties. With the help of type hinting, this forces a contract between the model and the presentation layer on what is used to communicate to each other. Also, by using a TransferObject, it sets you up for future changes, as data becomes more complex, and potentially finer- grained, you can provide a TransferObject that provides a more encapsulated view into the model. It may become an aggregate for other objects containing associated data. So let’s say you run into a situation where your controller is making a lot of fine grained calls into the BO in order to retrieve little pieces of the model that it will later use for display. This may become a maintenance headache if somebody had to go into the controller and see all of these calls into the BO. This would be a good point to refactor those fine grained calls, and put them into a single course grained method in the BO. The BO would then be responsible for retrieving the necessary data, and composing that data into a single TransferObject. That TransferObject would encapsulate all of the little model pieces that can be more easily managed by the controller and the display. The NewsBO (BusinessObject) manages access to the NewsDAO (DataAccessObject) and the NewsDAGO (DataAccessGatewayObject). See Listing 2. For simple applications, such as this one, you will see that there are often one-to-one calls between an exposed BO method and to the appropriate DAGO or DAO. You may ask “why the extra layer?” There may be situations where additional fine-grained operations will need to happen along with the calls to the DAO or DAGO. It can be anything from processing an image upload, to the management of session data, to the composition of an aggregate TransferObject, to optimization/caching
resultSet = $rs; $this->numRows = $this->resultSet->numRows(); $this->rowIdx = 1; $this->currRow = $this->resultSet->fetchRow(DB_FETCHMODE_ASSOC); } public function __destruct() { // Free the resultset $this->resultSet->free(); } /** * Overidden from Iterator */ function rewind() { // No rewinds here. We are only moving forward. } /** * Overidden from Iterator */ function current() { // Retrieve the current row return $this->currRow; } /** * Overidden from Iterator */ function key() { // We are just returning the index of the // current row. return $this->rowIdx; } /** * Overidden from Iterator */ function next() { // Point to the next row $this->currRow = $this->resultSet->fetchRow(DB_FETCHMODE_ASSOC); $this->rowIdx++; } /** * Overidden from Iterator */ function valid() { // Return true if we have more rows to go. False // if we’ve reached the end. return ($this->rowIdx <= $this->numRows); } } ?>
June 2005
●
PHP Architect
●
www.phparch.com
FEATURE logic. The BO is responsible for managing this, itself (or delegating to other objects), and, since all these implementation details are hidden from the controller it keeps the model cohesive and loosely coupled from its presentation layer. The NewsDAO provides the CRUD operations for the News object. We are using type hinting (a PHP 5 addition) with the method create(), and update(). Type hinting is a runtime check to see if the argument being passed is an instance of the class specified in the function’s parameter list. If it isn’t, an error is generated. Type hinting may not work with native or primitive data types (string, int, etc.), it is, however, a very useful aid in narrowing the risk of possible bugs. This is accomplished by giving a visual cue to the developer using the function that a particular instance of a class is what it expects to receive, and triggering an error if the function receives anything different. This ensures that the correct object type is being passed, and locks down the contract between the client and the NewsDAO. Hopefully some time in the future we will see type hinting for returned function values. See Listing 3. The NewsDAO and NewsDAGO classes are using the PEAR::DB database abstraction layer. Now, take a look at Listing 4. Since we are discussing design patterns, I should point out that the PEAR::DB package is able to support different RDMS through the Factory design pattern. It may be even using an AbstractFactory pattern, but that can be an exercise for you to find out. The call to retrieve a database connection through a static method called DB::factory() is a hint that the Factory pattern is being used. The Factory method is a creational design pattern that uses a specialized object for creating other objects with different implementations that conform to a particular interface, or abstraction. Think of the specialized object as a real-world factory that produces different remote controls that all look the same. They all have the same volume buttons, and the same channel buttons. The difference is what they control (what is under the cover). One may control a stereo, another may control a television set. The factory determines what to create based on a situation, or argument passed to it. It hides of all the creational logic from the client. The factory method is useful when the client does not know how to create what it needs during run time. All it has is knowledge of its interface. The client just knows how to operate what it’s given. This type of data hiding allows the client to use an object through its interface, regardless of what is under the cover. The factory method is also very useful as a framework that allows for pluggable implementations of a particular interface. If done correctly, this allows for developers to provide an implementation of an interface and plug it into the framework without modifying a piece of the framework code. With the help of PHP 5’s Reflection
31
An OO Layered Approach to Web Apps
API, this type of pluggable framework can happen fairly easily. The Reflecion API allows a developer to reverse-engineer classes, interfaces, functions and methods as well as extensions. Among the neat ways to introspect a class, the Reflection API provides the ability to create an instance of a class with its class name as a string. So, back to the pluggable framework idea, that string can be in a config file that the framework reads to instantiate your pluggable class. All that a third party developer would need to do is to drop their pluggable class in a predetermined location, and register it to the framework by adding its class name to the config file. A Couple of Optimization Techniques The NewsDAO has been coded with a couple resource optimization techniques. For one, it is implemented as a Singleton. The Singleton design pattern provides a way to ensure that only one instance of a particular class exists in the whole application. The Singleton design pattern is one of the easiest patterns to implement, however, in a multi-threaded system, it can Listing 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
// Load config settings and include used classes require_once(“Config.php”); require_once(“patterns/News.class.php”); require_once(“patterns/NewsBO.class.php”); // Create a news transfer object and // set its properties $news = new News(); $news->setHeading(“A Heading”); $news->setSubHeading(“A Sub Heading”); $news->setBody(“A news body”); $news->setAuthorLname(“Jenkins”); $news->setAuthorFname(“Spock”); $news->setAuthorEmail(“[email protected]”); $news->setCategoryID(“1”); $news->setPublishDate(date(“Y-m-d”)); // Create a News business object // to perform some operations on news $bo = new NewsBO(); // Create a new News record $bo->create($news); // Retrieve a News bean and // change some properties $anotherNewsInstance = $bo->read($news->getID()); $anotherNewsInstance->setHeading(“Changed heading”); $anotherNewsInstance->setBody(“Changed body”); $bo->update($anotherNewsInstance); // Retrieve all news items $results = $bo->findAllNews(); foreach($results as $row) { echo “row: “.$row[“pk_NewsID”] . “ ”; } // Retrieve all categories $results = $bo->findAllCategories(); foreach($results as $row) { echo “category: “.$row[“pk_NewsCategoryID”].” ”; } // Retrieve all news items again, and delete them. $results = $bo->findAllNews(); foreach($results as $row) { $bo->delete($row[“pk_NewsID”]); } ?>
June 2005
●
PHP Architect
●
www.phparch.com
FEATURE potentially be the most dangerous. If a Singleton object does not properly protect itself against concurrent state changes, you will end up with an object in an invalid state. This can cause a potentially devastating effect on your system. The reasons we are implementing the NewsDAO as a singleton is that the NewsDAO is basically a stateless object. There is no state in it that the business logic uses, and there doesn’t need to be. And so there is no real need for multiple instances of a NewsDAO to be allocated in a request, thus preventing from any unnecessary memory allocation. Another technique is lazy loading of prepared statements. Assuming that prepared statements need to be compiled somewhere in the database abstraction layer, lazy loading defers the compilation and allocation of prepared statements until the time they are being used as opposed to creating all of the prepared statements at once, in the constructor. Lazy loading allows a developer to delay the allocation of resources until absolutely necessary. So, for the instance variables createPST, readPST, updatePST, and deletePST prepared statements, they will not be allocated until their respective CRUD function is called, saving us a little memory and processing time. Since these the prepared statements are being set as instance variables of a Singleton object, they will be reused over and over without reallocation and recompilation. The NewsDAGO is responsible for providing aggregate methods for listing news items, searching, etc. The NewsDAGO—which can be seen in Listing 5—has been implemented as a Singleton and with lazy loading of prepared statements for the same reasons as the NewsDAO. Aggregate methods of the NewsDAGO class return an Iterator instance, another feature of PHP 5. The Iterator interface follows the design pattern conveniently named “Iterator”. From the GoF book Design Patterns: Elements of Reusable Object-Oriented Software: “The Iterator design pattern provides a way to access the elements of an aggregate object sequentially without exposing its underlying representation. It provides a generic interface for looping.” In PHP, Iterators are generic enough to be used within the foreach loop. I’ve put together a simple implementation of the Iterator interface. The class is called ResultSetIterator, and can be seen in Listing 6. The Iterator interface is meant mostly for use with the Adapter design pattern. This pattern uses inheritance to adapt one object or interface to another. In our case, the ResultSetIterator adapts an instance of DB_result into an Iterator. By having the NewsDAGO return an implementation of Iterator as a return value from its aggregate functions, the client on the other end calling these functions would only be required to know how to use an Iterator as opposed to a DB_result. By introducing this small
FEATURE
An OO Layered Approach to Web Apps
piece into the design, we now have the ability to swap out whole implementations of the NewsDAGO with another without affecting the client code that uses the NewsDAGO. Suppose we wanted to switch the datasource from a relational database to a SOAP based datasource, or to an RSS feed. Say we use the SOAP alternative resulting
more generically called NewsDAGO, with a set of predetermined aggregate functions e.g. findAllNews(), findAllCategories(), etc. The RDMSNewsDAGO, and SOAPNewsDAGO would extend from the NewsDAGO abstract class and be forced to provide implementations of the aggregate functions. When the client calls the static function, NewsDAGO::getInstance(), it would
“Design Patterns describe the communication and relationship between objects that are customized to solve a design problem.”
in the replacement of the NewsDAGO implementation with a SOAPNewsDAGO implementation. As long as the SOAPNewsDAGO provides the same methods as the original NewsDAGO, the client code would be left unchanged. The aggregate methods would still return an Iterator, but in this case, we would have an Iterator implementation that knows how to loop through an XML result. You may ask, “why use an Iterator? Why not just return an array of all the records? It’s generic enough, and everybody knows how to loop through those.” We are using an Iterator for memory optimization. Dumping all the results into an array means that there needs to be enough memory to hold that array. If the result set is fairly large, we may run into a memory limit. The ResultSetIterator fetches the rows as they are being retrieved and does not store or cache them. It leaves it up to the programmer who is looping through the data to decide on the necessity of saving all the results into memory for further use. Now that we’re talking about swapping out whole implementations of NewsDAGO, we just found a use for the Factory pattern in our design. Just like how DB::connect() provides an abstracted DB connection of the underlying RDBMS layer, depending on the DSN being passed through, we can do the same for NewsDAGO. By redesigning the NewsDAGO around the Factory pattern, the call to NewsDAGO::getInstance() would return a SOAPNewsDAGO implementation or the usual relational database implementation. This can be decided by a config setting telling which one to use. We would have to do a little refactoring on the back end, though. We would rename the relational database implemenation of the NewsDAGO class to more specifically RDBMSNewsDAGO, and would define an abstract class
June 2005
●
PHP Architect
●
www.phparch.com
decide on and return an instance of RDMSNewsDAGO or SOAPNewsDAGO depending on a config setting. The client code using the NewsDAGO instances wouldn’t have to change or care about which is being supplied. We would, of course, have to apply the same design to the NewsDAO as well, we wouldn’t want to be persisting data and retrieving aggregate data from two different data sources in the same app. A Test Run Since the model layer is decoupled from the presentation, it can be tested independently of a presentation layer, as you can see in Listing 7. We are just walking through the NewsBO operations. First, we instantiate a NewsTransferObject to prepare it for inserting, by setting its properties. Then, we instantiate a NewsBO. We create the news record by calling $bo->create($news) of the NewsBO instance. We then illustrate updating a News record in the datasource. First, we retrieve a News object from the BO by passing in the unique id of the News record $anotherNewsInstance = $bo->read($news->getID()) . We changed some of its properties $anotherNewsInstance->setHeading(“Changed heading”) $anotherNewsInstanceand >setBody(“Changed body”) then we call $bo->update($anotherNewsInstance) . // Retrieve a News bean and // change some properties $anotherNewsInstance = $bo->read($news->getID()); $anotherNewsInstance->setHeading(“Changed heading”); $anotherNewsInstance->setBody(“Changed body”); $bo->update($anotherNewsInstance);
Next, we retrieve the aggregate data, which, in this case, is a set of News records. We do this by calling
33
FEATURE
An OO Layered Approach to Web Apps
$bo->findAllNews(). We then loop through the data echoing out the primary key. // Retrieve all news items $results = $bo->findAllNews(); foreach($results as $row) { echo “row: “.$row[“pk_NewsID”] . “ ”; }
Remember, we are looping through an Iterator, by simply using a foreach loop. Next, we loop through Category records in the same fashion. // Retrieve all categories $results = $bo->findAllCategories(); foreach($results as $row) { echo “category: “.$row[“pk_NewsCategoryID”].” ”; }
The final block of code illustrates looping through and deleting all News records. // Retrieve all news items again, and delete them. $results = $bo->findAllNews(); foreach($results as $row) { $bo->delete($row[“pk_NewsID”]); }
You may have noticed that I’ve left a few things out, such as the CRUD and aggregate functions for a NewsCategory. That task can be left to you. The NewsCategory is simple enough that it’s not necessary to create a BO, DAO, and DAGO object for it. Since a NewsCategory isn’t much without News items, you can add the appropriate functions to the existing NewsBO and NewsDAO. This is also true if you wish to add more aggregate functions to the NewsDAGO, such as NewsDAGO::findNewsByCategoryID() . We are also missing a boat load of error handling. PHP 5 now provides an OO approach to error handling using Exceptions. The Exception model allows developers to catch and throw Exceptions as they see fit. When an Exception is thrown, the code following the line where it was thrown will not be executed. If the Exception is not caught, it will trickle all the way up
Dynamic Web Pages www.dynamicwebpages.de sex could not be better | dynamic web pages - german php.node
causing a PHP Fatal Error. This mechanism allows the developer to catch and handle particular types of Exceptions, thus giving them the ability to choose the appropriate course of action for that particular type. The meat of using Exceptions for error handling is that it provides developers with the ability to define their own errors by extending the Exception class. Exception types can be specific to a related set of classes, thus helping out with the time it takes to hunt down a particular error or bug. An instance of Exception encapsulates enough information that allows the developer to find the point of code where the error occurred. Exception handling can also be overused, making code more difficult to read. If used correctly, they can be a powerful aid in keeping your OO application rock solid. Conclusion That’s an awful lot of code for persisting and listing news items. With OOP, that is mostly the case. OO code is more complex than procedural, but in the end, it’s easier to maintain because it’s better organized. There is no way around it. You may have to think things through a little bit more when developing your web app, but doing it right will save you maintenance time in the future. What we did, with the help of a few design patterns, was re-organize the code in a way that makes it more manageable, maintainable, extendable, and re-usable. We also acquainted ourselves to a few features in PHP 5, like Iterator and type hinting. What we have is a design with a model layer independent of any presentation layer. We can take any templating framework and integrate the model into it.
About the Author
?>
Ronel Sumibcay is a Senior Web Developer at Red Door Interactive. His experience in server-side web development started off in 1998 with Java where he co-developed the first JSP parsing engine for the JRun Servlet Container. He now enjoys working with PHP and “all of its object oriented goodness”.
To Discuss this article: http://forums.phparch.com/227
34
NEXCESS.NET Internet Solutions 304 1/2 S. State St. Ann Arbor, MI 48104-2445
http://nexcess.net
PHP / MySQL SPECIALISTS! Simple, Affordable, Reliable PHP / MySQL Web Hosting Solutions P O P U L A R S H A R E D H O S T I N G PAC K A G E S
MINI-ME
$
6 95
SMALL BIZ $ 2195/mo
/mo
500 MB Storage 15 GB Transfer 50 E-Mail Accounts 25 Subdomains 25 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel
2000 MB Storage 50 GB Transfer 200 E-Mail Accounts 75 Subdomains 75 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel
16 95
/mo
900 MB Storage 30 GB Transfer Unlimited MySQL Databases Host 30 Domains PHP5 / MYSQL 4.1.X NODEWORX Reseller Access
NEXRESELL 2 $
We'll install any PHP extension you need! Just ask :) PHP4 & MySQL 3.x/4.0.x options also available
59 95
/mo
7500 MB Storage 100 GB Transfer Unlimited MySQL Databases Host Unlimited Domains PHP5 / MySQL 4.1.X NODEWORX Reseller Access
: CONTROL
php 5 4.1.x
POPULAR RESELLER HOSTING PACKAGES NEXRESELL 1 $
NEW! PHP 5 & MYSQL 4.1.X
PA N E L
All of our servers run our in-house developed PHP/MySQL server control panel: INTERWORX-CP INTERWORX-CP features include: - Rigorous spam / virus filtering - Detailed website usage stats (including realtime metrics) - Superb file management; WYSIWYG HTML editor
INTERWORX-CP is also available for your dedicated server. Just visit http://interworx.info for more information and to place your order.
WHY NEXCESS.NET? WE ARE PHP/MYSQL DEVELOPERS LIKE YOU AND UNDERSTAND YOUR SUPPORT NEEDS!
php 4 3.x/4.0.x
128 BIT SSL CERTIFICATES AS LOW AS $39.95 / YEAR DOMAIN NAME REGISTRATION FROM $10.00 / YEAR GENEROUS AFFILIATE PROGRAM
UP TO 100% PAYBACK PER REFERRAL
30 DAY MONEY BACK GUARANTEE
FREE DOMAIN NAME WITH ANY ANNUAL SIGNUP
ORDER TODAY AND GET 10% OFF ANY WEB HOSTING PACKAGE VISIT HTTP://NEXCESS.NET/PHPARCH FOR DETAILS
Dedicated & Managed Dedicated server solutions also available Serving the web since Y2K
Any more, and we’d have to take the exam for you! We’re proud to announce the publication of The Zend PHP Certification Practice Test Book, a new manual designed specifically to help candidates who are preparing for the Zend Certification Exam. Available in both PDF and Print
Written and edited by four members of the Zend Education Board, the same body that prepared the exam itself, and officially sanctioned by Zend Technologies, this book contains 200 questions that cover every topic in the exam. Each question comes with a detailed answer that not only provides the best choice, but also explains the relevant theory and the reason why a question is structured in a particular way. The Zend PHP Certification Practice Test Book is available now directly from php|architect, from most online retailers (such as Amazon.com and BarnesandNoble.com) and at bookstores throughout the world.
Get your copy today at http://www.phparch.com/cert/mock_testing.php
FEATURE
References in PHP: An In-Depth Look
F E A T U R E
by Derick Rethans
PHP’s handling of variables can be non-obvious, at times. Have you ever wondered what happens at the engine level when a variable is copied to another? How about when a function returns a variable “by reference?” If so, read on.
E
very computer language needs some form of container to hold data—variables. In some languages, those variables have a specific type attached to them. They can be a string, a number, an array, an object or something else. Examples of such staticallytyped languages are C and pascal. Variables in PHP do not have this specific restraint. They can be a string in one line, but a number in the next line. Converting between types is also easy to do, and often, even automatic. These loosely-typed variables are one of the properties that make PHP such an easy and powerful language, although they can sometimes also cause interesting problems. Internally, in PHP, those variables are all stored in a similar container, called a zval container (also called “variable container”). This container keeps track of several things that are related to a specific value. The most important things that a variable container contains are the value of the “variable”, but also the type of the variable. Python is similar to PHP in this regard as it also labels each variable with a type. The variable container contains a few more fields that the PHP engine uses to keep track of whether a value is a reference or not. It also keeps reference count of its value. Variables are stored in a symbol table, which is quite analogous to an associative array. This array has keys that represent the name of the variable, and those keys point to variable containers that contain the value (and type) of the variables. See Figure 1 for an example of this.
June 2005
●
PHP Architect
●
www.phparch.com
Reference Counting PHP tries to be smart when it deals with copying variables like in $a = $b. Using the = operator is also called an “assign-by-value” operation. While assigning by value, the PHP engine will not actually create a copy of the variable container, but it will merely increase the refcount field in the variable container. As you can imagine this saves a lot of memory in case you have a large string of text, or a large array. Figure 2 shows how this “looks”. In Step 1 there is one variable, a, which
REQUIREMENTS PHP
4.3.0+
OS
Any
Other Software
N/A
Code Directory
references
Figure 1
37
References in PHP: An In-Depth Look
contains the text this is and it has (by default) a reference count of 1. In step 2, we assign variable $a to variable $b and $c. Here, no copy of the variable container is made, only the refcount value gets updated with 1 for each variable that is assigned to the container. Because we assign two more variables here, the refcount gets updated to 2 and ends up being 3 after the two assignment statements. Now, you might wonder what would happen if the variable $c gets changed. Two things might happen, depending on the value of the refcount. If the value is 1, then the container simply gets updated with its new value (and possibly its type, too). In case the refcount value is larger than 1, a new variable container gets created containing the new value (and type). You can see this in step 3 of Figure 2. The refcount value for the variable container that is linked to the variable $a is decreased by one so that the variable container that belongs to variable $a and $b now has a refcount of 2, and the newly created container has a refcount of 1. When unset() is called on a variable the refcount value of the variable container that is linked to the variable that is unset will be decreased by one. This happens when we call unset($b) in step 4. If the refcount
Figure 2
FEATURE value drops below 1, the PHP Engine will free the variable container. The variable container is then destroyed, as you can see in step 5. Passing Variables to Functions Besides the global symbol table that every script has, every call to a user defined function creates a symbol table where a function locally stores its variables. Every time a function is called, such a symbol table is created, and every time a function returns, this symbol table is destroyed. A function returns by either using the return statement, or by implicitly returning because the end of the function has been reached. In Figure 3, I illustrate exactly how variables are passed to functions. In step 1, we assign a value to the variable $a, again—“this is”. We pass this variable to the do_something() function, where it is received in the variable $s. In step 2, you can see that it is practically the same operation as assigning a variable to another one (like we did in the previous section with $b = $a), except that the variable is stored in a different symbol table—the one that belongs to the called function— and that the reference count is increased twice, instead the normal once. The reason for this is that the function’s stack also contains a reference to the variable container. When we assign a new value to the variable $s in step 3, the refcount of the original variable container is decreased by one and a new variable container is created, containing the new variable. In step 4, we return the variable with the return statement. The returned variable gets an entry in the global symbol table and the refcount value is increased by 1. When the function ends, the function’s symbol table will be destroyed. During the destruction, the engine will go over all variables in the symbol table and decrease the refcount of each variable container. When a refcount of a variable container reaches 0, the variable container is destroyed. As you see, the variable container is again not copied when returning it from the function due to PHP’s reference counting mechanism. If the variable $s would not have been modified in step 3 then variable $a and $b would still point to the same variable container which would have a refcount value of 2. In this situation, a copy of the variable container that was created with the statement $a = “this is” would not have been made. Introducing References References are a method of having two names for the same variable. A more technical description would be: references are a method of having two keys in a symbol table pointing to the same zval container. References can be created with the reference assignment operator &=.
June 2005
●
PHP Architect
●
www.phparch.com
38
FEATURE
References in PHP: An In-Depth Look
Figure 3
Figure 4
Figure 4 gives a schematic overview of how references work in combination with reference counting. In step 1, we create a variable $a that contains the string “this is”. Then in step two we create two references ($$b and $c) to the same variable container. The refcount increases normally for each assignment making the final refcount 3, after both assignments by reference ($$b =& $a and $c =& $a), but because the reference assignment operator is used, the other value is_ref is now set to 1. This value is important for two reasons. The second one I will divulge a little bit later in this article, and the first reason that makes this value important is when we are reassigning a new value to one of the three variables that all point to the same variable container. If the is_ref value is set to 0 when a new value is set for a specific variable, the PHP engine will create a new variable container as you could see in step 3 of Figure 2. But if the is_ref value is set to 1, then the PHP engine will not create a new variable container and simply only update the value to which one of the variable names point as you can see in step 2 of Figure 4. The exact same result would be reached when the statement $a = 42 was used instead of $b = 42. After the variable container is modified, all three variables $a, $b
June 2005
●
PHP Architect
●
www.phparch.com
39
References in PHP: An In-Depth Look
and $c will contain the value 42. In step 4, we use the unset() language construct to remove a variable—in this case variable $c. Using unset() on a variable means that the refcount value of the variable container that the variable points to gets decreased by 1. This works exactly the same for referenced variables. There is one difference, though, that shows in step 5. When the reference count of a variable container reaches 1 and the is_ref value is set to 1, the is_ref value is reset to 0. The reason for this is that a variable container can only be marked as a referenced variable container when there is more than one variable pointing to the variable container. Mixing Assign-by-Value and Assign-byReference Something interesting—and perhaps unexpected— happens if you mix an assign-by-value call and an assign-by-reference call. This shows in Figure 5. In the first step we create two variables $a and $b, where the latter is assigned-by-value to the former. This creates a situation where there is one variable container with is_ref set to 0 and refcount set to 2. This should be familiar by now. In step 2 we proceed by assigning variable $c by reference to variable $b. Here, the PHP engine will create a copy of the variable container. The variable $a keeps Figure 5
●
pointing to the original variable container but the refcount is, of course, decreased to 1 as there is only one variable pointing the this variable container now. The variables $b and $c point to the copied container which has now a refcount of 2 and the is_ref value is set to 1. You can see that in this case, using a reference does not save you any memory, it actually uses more memory, as it had to duplicate the original variable container. The container had to be copied, otherwise the PHP engine would have no way of knowing how to deal with the reassignment of one of the three variables as two of them were references to the same container $b and $c, while the other was not supposed to be a reference. If there is only one container with refcount set to 3, and is_ref set to 1, then it is impossible to figure that out. That is the reason why the PHP engine needs to create a copy of the container when you do an assignment-by-reference. If we switch the order of assignments—first we assign $a by reference to $b and then we assign $a by value to $c—then something similar happens. Figure 6 shows how this is handled. In the first step we assign the variable $a to the string “this is” and then we proceed to assign $a by reference to variable $b. We now have one variable container where is_ref is 1 and refcount is 2. In step 2, we assign variable $a by value to variable $c, now a copy of the variable container is made in order for the PHP engine to be able to handle modifications to the variables, correctly, with the same reasons as stated in the previous paragraph. But if you go back to step 2 of Figure 2, where we assign the variable $a to both $b and $c, you see that no copy is made here. Passing References to Functions Variables can also be passed-by-reference to functions. This is useful when a function needs to modify the value of a specific variable when it is called. The script in Figure 7 is a slightly modified version of the script that you have already seen in Figure 3. The only difference is the ampersand (&&) in front of the $s variable in the declaration of the function do_something(). This ampersand instructs the PHP engine that the variable to which the ampersand is applied is going to be passed by reference and not by value. A different name for a passed-by-reference variable is an “out variable”. When a variable is passed by reference to a function the new variable in the function’s symbol table is pointed to the old container and the refcount value is increased by 2 (one for the symbol table, and one for the stack). Just as in a normal assignment-by-reference the is_ref value inside the variable container is also set to 1 as you can see in step 2. From here on, the same things happen as with a normal reference like in step 3,
Figure 6
June 2005
FEATURE
PHP Architect
●
www.phparch.com
40
References in PHP: An In-Depth Look
where no copy of the variable container is made if we assign a new value to the variable $s. The return $s; statement is basically the same as the $c = $a statement in step 2 of Figure 6. The global variable $a and the local variable $s are both references to the same variable container and the logic dictates that if is_ref is set to 1 for a specific container and this container is assigned to another variable by-value, the container does not need to be duplicated. This is exactly what happens here, except that the newly created variable is created in the global symbol table by the assignment of the return value of the function with the statement $b = do_something($s). Returning by Reference Another feature in PHP is the ability to “return by reference”. This is useful, for example, if you want to select a variable for modification with a function, such as selecting an array element or a node in a tree structure. In Figure 8 we show how returning by references work by means of an example. In this example (step 1), we define a $tree variable (which is actually not a tree, but a simple array) that contains three elements. The three elements have key values of 1, 2 and 3, and all of them point to a string describing the English word that matches with the key’s value (ie. one, two and three). This array gets passed to the find_node() function by reference, along with the key of the element that the
FEATURE find_node() function should look for and return. We need to pass by reference here, otherwise we can not return a reference to one of the elements, as we will be returning a reference to a copy of the $tree. When $tree is passed to the function it has a refcount of 3 and is_ref is set to 1. Nothing new here. The first statement in the function, $item =& $node[$key], causes a new variable to be created in the symbol table of the function, which points to the array element where the key is “3” (because the variable $key is set to 3). In this step 3 you see that the creation of the $item by assigning it by reference to the array element causes the refcount value of the variable container that belongs to the array element to be increased by 1. The is_ref value of that variable container is now 1, too, of course. The interesting things happen in step 4 where we return $item (by reference) back to the calling scope and assign it (by reference) to $node. This causes the refcount of the variable container to which the 3rd array key points to be set to 3. At this point $tree[3], $item (from the function’s scope) and $node (global scope) all point to this variable container. When the symbol table of the function is destroyed (in step 5), the refcount value decreases from 1 to 2. $node is now a reference to the third element in the array. If the variable $item would not have been assigned by reference to the return value of the do_something()
Figure 7
June 2005
●
PHP Architect
●
www.phparch.com
41
References in PHP: An In-Depth Look
FEATURE
Figure 8
June 2005
●
PHP Architect
●
www.phparch.com
42
References in PHP: An In-Depth Look
function, but instead would have been assigned by value, then $node would not have been a reference to $tree[3]. In this case, the refcount value of the variable container to which $tree[3] points is then 1 after the function ends, but for some strange reason the is_ref value is not reset to 0 as you might expect. My tests did not find any problems with this, though, in this simple example. If the function do_something() would not have been a “return-by-reference function”, then again the $node variable would not be a reference to $tree[3]. In this case, the is_ref value of the variable container would have been reset to 0. Finally, in step 6, we modify the value in the variable container to which both $node and $tree[3] point. Please do note that it is harmful not to accept a reference from a function that returns a reference. In some cases, PHP will get confused and cause memory corrup-
FEATURE tions which are very hard to find and debug. It is also not a good idea to return a static value as reference, as the PHP engine has problems with that too. In PHP 4.3, both cases can lead to very hard to reproduce bugs and crashes of PHP and the web server. In PHP 5, this works all a little bit better. Here you can expect a warning and it will behave “properly”. Hopefully, a backported fix for this problem makes it into a new minor version of PHP 4—PHP 4.4. The Global Keyword PHP has a feature that allows the use of a global variable inside a function: you can make this connection with the global keyword. This keyword will create a reference between the local variable and the global one. Figure 9 shows this in an example. In step 1 and 2, we create the variable $var and call
Figure 9
June 2005
●
PHP Architect
●
www.phparch.com
43
References in PHP: An In-Depth Look
FEATURE
Figure 10
the function update_var() with the string literal “oone” as the sole parameter. At this point, we have two variable containers. The first one is pointed to from the global variable $var, and the second one is the $val variable in the called function. The latter variable container has a refcount value of 2, as both the variable on the stack and the local variable $val point to it. The global $var statement, in the function, creates a new variable in the local scope, which is created as a reference to the variable with the same name in the global scope. As you can see in step 3, this increases the refcount of the variable container from 1 to 2 and this also sets the is_ref value to 1. In step 4, we unset the variable $var. Against some people’s expectation, the global variable $var does not get unset—as the unset() was done on a reference to the global variable $var and not that variable itself. To reestablish the reference, we employ the global keyword, again in step 5. As you can see, we have re-created the same situation as in step 3. Instead of using global $var we could just as well have used $var =& $GLOBALS[‘var’] as it would have created the exact Listing 1 1 10
June 2005
●
PHP Architect
●
www.phparch.com
same situation. In step 6, we continue to reassign the $var variable to the function’s $val argument. This changes the value to which both the global variable $var and the local variable $var point; this is what you would expect from a referenced variable. When the function ends, in step 7, the reference from the variable in the scope of the function disappears, and we end up with one variable container with a refcount of 1 and an is_ref value of 0. Abusing References In this section, I will give a few examples that show you how references should not be used—in some cases these examples might even create memory corruptions in PHP 4.3 and lower. Example 1: “Returning static values by-reference”. In Figure 10, we have a very small script with a return-byreference function called definition(). This function simply returns an array that contains some elements. Returning by reference makes no sense here, as the exact same things would happen internally if the variable container holding the array was returned by value, except that in the intermediate step (step 3) the is_ref value of the container would not be set to 1, of course. In case the $def variable in the function’s scope would have been referenced by another variable, something that might happen in a class method where you do $def = $this->def then the return-by-reference properties of the function would have copied the array, because this creates a similar situation as in step 2 of Figure 5. Example 2: “Accepting references from a function that doesn’t return references”. This is potentially dan-
44
FEATURE
References in PHP: An In-Depth Look
gerous; PHP 4.3 (and lower) does not handle this properly. In Listing 1, you see an example of something that is not going to work properly. This function was implemented with performance in mind, trying not to copy variable containers by using references. As you should know after reading this article, this is not going to buy you anything. There are a few reasons why it doesn’t work. The first reason is that the PHP internal function preg_split() does not return by reference—actually, no internal function in PHP can return anything by reference. So, assigning the return value by reference from a function that doesn’t return a reference is pointless. The second reason why there is no performance benefit, here, is the same one as in Example 1, in the previous paragraph: you’re returning a static value— not a reference to a variable—it does not make sense to make the split_list() function to return-by-reference. Conclusion After reading this article, I hope that you now fully understand how references, refcounting, and variables work in PHP. It should also have explained that assigning by reference does not always save you memory— it’s better to let the PHP engine handle this optimization. Do not try to outsmart PHP yourself here and only use references when they are really needed. In PHP 4.3, there are still some problems with references, for which patches are in the works. These patches are backports from PHP 5-specific code, and although they work fine, they will break binary compatibility—meaning that compiled extensions no longer work after those patches are put into PHP. In my opinion, those hard to produce memory corruption errors should be fixed in PHP 4 too, though, so perhaps this
Have you had your PHP today?
creates the need for a PHP 4.4 release. If you’re having problems, you can try to use the patch located at http://files.derickrethans.nl/patches/ze1-returnreference-20050429.diff.txt
The PHP Manual also has some information on references, although it does not explain the internals very well. The URL for the section in PHP’s Manual is http://php.net/language.references
About the Author
?>
Derick Rethans provides solutions for Internet related problems. He has contributed in a number of ways to the PHP project, including the mcrypt extension, bug fixes, additions and leading the QA team. He now works as developer for eZ systems A.S.. In his spare time he likes to work on SRM: Script Running Machine and Xdebug, watch movies and travel. You can reach him at [email protected]
To Discuss this article: http://forums.phparch.com/228
http://www.phparch.com
NEW !
ce Lower Pri NEW COMBO NOW AVAILABLE: PDF + PRINT
The Magazine For PHP Professionals
June 2005
●
PHP Architect
●
www.phparch.com
45
FEATURE
Homo Xapian: The Search for a Better Search...Engine by Marco Tabini
Tired of fighting with full-text search in MySQL? Do you need to create a professional-quality search engine and don’t want to have to deal with all the details? Then look no further than Xapian, the open-source search technology that you can integrate directly into your PHP scripts.
L
ate last year, I found myself with a bit of spare time on my hands, and decided to take a hard look at building a decent search engine for the PHP mailing lists—I’ve never been a great fan of the one that is available on the PHP website and, frankly, I can’t stand the fact that after all these years Google still confuses the word “PHP” with the extension .php—there’s nothing quite as fun as having to look for the solution to an annoying scripting problem that’s been nagging you for two hours by wading through ten thousand advertisements for Viagra peddlers and other less-than-reputable websites, all of which seem to be using PHP. Therefore, I thought that it might be a good idea to look into a possible solution to my mailing list pet peeve. Being the lazy guy that I am, the first one that came to mind was to simply set up a database and take
June 2005
●
PHP Architect
●
www.phparch.com
advantage of MySQL’s full-text search functionality to provide the core of the engine. What I did not take into account, unfortunately, was the fact that the PHP mailing lists encompass an amount of data that MySQL, frankly, isn’t ready to handle. Not that I blame MySQL, mind you—after all, it wasn’t built for handling vast amounts of text, but a
REQUIREMENTS PHP
4.x
OS
UNIX/Linux/OSX
Other Software
Xapian
URL
http://www.xapian.org
Code Directory
xapian
47
FEATURE
Homo Xapian: A Search for a Better Search...Engine
simpler, all-purpose search engine; the fact that the data has to be stored inside the database is, in itself, a major handicap, since that causes the DBMS to build very large files and rapidly exhaust the operating system’s ability to manipulate them. Back to square one—and this time I was determined not to make the mistake of being lazy again, so I set out to build a search system that still relied on MySQL as the data engine but performed most of the searchrelated functionality itself. Its basic working principle was very simple: it started by parsing a mail message using the mailparse extension and then moved to break down its body using str_word_count(). I then assigned each word a globally unique identifier (calculated by creating a word table inside my database), and did the same thing to the message, linking the two sets of data as appropriate so that, at runtime, I could extract the messages that contain a specific set of words with a simple query. This approach did work a bit better—at least, I didn’t run out of disk space and could keep all the messages outside of the database itself—but not much. For one thing, MySQL would still be stubbornly slow when it came to indexing the amount of data involved in the mailing lists; additionally, the search results were rather less than useful, since the system was completely unable to perform any sort of “fuzzy” search. In short, this method might work for a forum system, but it didn’t work for me. A Solution That Can Actually Work Back to square one—again. By this time, my spare-time luxury had evaporated, so I had to leave the system as it was—with the result that, while perfectly working from a technical perspective, it will never see the light of day (unless the PHP team manages to get their hands on a supercomputer capable of enough power to run it). It wasn’t until a lot later, while I was discussing a book proposal with Marcus Baker—the author of our Test Pattern column, that he mentioned an open-source search technology called “Xapian,” which calls http://www.xapian.org its home. Xapian is a system built exclusively with one purpose: search. Unlike a general-purpose database, it doesn’t try to classify the data it stores, delegating that task to a different application layer built specifically for such a purpose. This level of specialization has allowed the developers of Xapian to implement some features that, for lack of a more technical term, can only be defined as “extremely cool:” Full support for a complete range of Boolean operators—you can easily build (or let your users build) complex queries that include Boolean operators like AND, OR and NOT
June 2005
●
PHP Architect
●
www.phparch.com
Probabilistic searches—Xapian is capable of ranking search results based on the importance of “relevant” words, so that documents that more closely match a query tend to be returned near the top of the result set Phrase and proximity search—the search engine gives higher importance to documents in which search terms appear closer to each other, since they are more likely to be relevant to the search query. Additionally, you can search for specific sentences, rather than just for words. Stemming—this feature allows the search engine to determine the “root” of a word and allow queries to match a broader set of terms. For example, the word “checking” will match terms like “check,” “checks” and “checker.” Relevance feedback—this allows a user to “tag” the documents returned by a search that are relevant to his or her needs and then run the query again. The engine will analyze the user’s preferences and build a new result set that more closely matches the tagged documents. Naturally, I won’t leave out the best feature of all—compatibility with PHP. Building Xapian Before being able to use Xapian, you will have to build it—something that can only be done in a UNIX-like environment, since there is no way to build on Windows unless you use Cygwin. If your operating system is among those that Xapian supports natively, installation is going to be very easy— Gentoo users will just have to run emerge xapian, while Debian users can use apt-get to retrieve pre-built binary packages directly from the project’s website, which also provides the same in RPM form for RedHat flavours of Linux. For the rest of us who can’t afford the luxury of binary packages, there is, of course, the option of compiling from source. This has always been—at least for me— completely painless. All you need to do is download the source package, uncompress it, and run configure followed by make: tar xzf xapian-core-0.9.0.tar.gz cd xapian-core-0.9.0 ./configure make install
Once the library is installed, you must compile the extension that will allow you to actually access Xapian’s functionality from your scripts. The Xapian team provides the xapian-bindings package, which includes bindings for well over ten languages, including PHP 4.x. Unfortunately, there is no PHP 5 package at this time, due at least in part to the fact that the PHP extension is built using SWIG, an open-source wrapper generator which only supports version 4 of our language. The use
48
FEATURE
Homo Xapian: A Search for a Better Search...Engine
of SWIG has two further side effects: first, the resulting extension tends to be rather large—it is about 2MB on my system; in addition, it can only be compiled as a dynamic module and it cannot be phpize’d. It’s important to keep one thing in mind: the entire Xapian system is based on C++ and relies heavily on its features, such as namespaces. SWIG doesn’t translate this into PHP very well, and, therefore, you will have to use Xapian using procedural code instead of an objectoriented approach. This is not a big problem—except that you need to keep in mind a couple of things (ok, maybe three): First, whereas you would normally access class members by referencing their respective namespaces, in PHP you will actually need to prefix function names with them; for example, the C++ function WritableDatabase::add_document() becomes WritableDatabase_add_document() in PHP. Constructors and destructors are declared (and must be called) explicitly in PHP. For example, a new Query
object is created by calling the function new_Query(), while the same is destroyed using delete_Query(). Do remember that PHP won’t call your destructors automatically—it’s up to you to do so where necessary. Finally, class members that are called as methods of an object must receive the object as their first parameter. This means, for example, that if you want to call the method add_document() on an instance of WritableDatabase , you will have to call WritableDatabase_add_document() instead and pass your object to it as the very first parameter. I realize that this may sound a little daunting—but, believe me, if there are daunting tasks in using Xapian from PHP, dealing with naming conventions is not one of them. In the end, this approach is no different than passing a file pointer as the first parameter to fopen(). The only real difficulty is in the fact that you will find a few differences between the official documentation— which was written for C++ users—and your function names. However, the differences are consistent
“To be or not to be”, “Spartacus” => “I am Spartacus”, “Star Wars” => “Luke, I am your father”); // The following adds a document to the database. // Note how each word is converted to lowercase before // being added as a search term. function add_document ($db, $doc_data, $key) { $words = str_word_count ($doc_data, DB_CREATE_OR_OPEN); // Create a new document $doc = new_document(); // Set the key as our metadata document_set_data ($doc, $key); echo “Indexing ‘$doc_data’ with key ‘$key’ ”; // Index each word separately // but specifying the order in which foreach ($words as $k => $v) { document_add_posting ($doc, strtolower ($v), $k + 1); } // Add the document and return the ID return writabledatabase_add_document ($db, $doc); } // Load the Xapian extension dl (‘xapian.so’); // Open database $db = new_WritableDatabase (“db”, DB_CREATE_OR_OPEN); // Index the documents foreach ($search_strings as $k => $v) { $id = add_document ($db, $v, $k); echo “Document has ID #$id
”; } // Close database delete_writabledatabase ($db); ?>
Homo Xapian: A Search for a Better Search...Engine
throughout, so once you’ve figured out how things work, you shouldn’t have any further problems. Xapian, Quartz and Databases Xapian is built to be capable of relying on a number of different database backends that provide the necessary data storage and retrieval mechanisms for the search engine to function. The reason for this is that different applications have different needs—for example, an inmemory database backend provides maximum performance at the cost of memory usage and the inability to permanently save the database. For the most part, however, you will find yourself
= array “He kept rolling down the stairs”, “Living la vida loca”, “Click the cat has seven lives”);
// The following adds a document to the database. // Note how each word is converted to lowercase before // being added as a search term. function add_document ($db, $doc_data, $key) { global $stemmer; $words = str_word_count ($doc_data, DB_CREATE_OR_OPEN); // Create a new document $doc = new_document(); // Set the key as our metadata document_set_data ($doc, $key); echo “Indexing ‘$doc_data’ with key ‘$key’ ”; // Index each word separately // but specifying the order in which foreach ($words as $k => $v) { document_add_posting ($doc, stem_stem_word ($stemmer, strtolower ($v)), $k + 1); } // Add the document and return the ID return writabledatabase_add_document ($db, $doc); } // Load the Xapian extension dl (‘xapian.so’); // Open database $db = new_WritableDatabase (“db”, DB_CREATE_OR_OPEN); // Create stemmer $stemmer = new_stem (“english”); // Index the documents foreach ($search_strings as $k => $v) { $id = add_document ($db, $v, $k); echo “Document has ID #$id
”; } // Close database delete_writabledatabase ($db); ?>
June 2005
●
PHP Architect
●
www.phparch.com
FEATURE using Quartz, Xapian’s main database engine. Quartz offers a very robust data storage system that is highly scalable and supports reader/writer concurrency— meaning that you can update the database while an arbitrary number of search queries are being executed (although, you can’t have two processes update it at the same time). Regardless of what database backend you choose, there is one bit of good news: you won’t have mess around with anything even remotely approaching the SQL language. In fact, you will only have to interact with the database through Xapian methods, which, if you’ve ever had to insert data from a fifty-field form into a SQL database, can only be an improvement. Writing the Indexer In the Xapian model, a search engine is composed of two different entities: an indexer and a searcher. As its name implies, the former’s role is that of adding documents to the search engine’s database so that queries can later be run against them. Before being able to add data to a database, we have to ensure that we actually have a database. This is accomplished by initializing one—or simply opening if it already exists—through a call to new_WritableDatabase(). This function takes only one parameter, which is the name of the directory in which the database files are to be stored. Clearly, this directory should be in a location that is not directly accessible from the web, but to which your webserver and indexer’s processes both have the appropriate access rights. Once the database has been opened, we can move on to actually creating entries in it. Xapian defines a document as the smallest possible indexable entity. Note that “document” doesn’t necessarily mean a text file—you could be indexing all sorts of data, even of a non-textual nature, if you wanted to. The database engine assigns each document a unique ID, while we are allowed to store an arbitrary amount of data in it that will be returned to us when the document is retrieved by a query against the database. This will allow us to rapidly retrieve the resource associated with it by providing a token that can be used to rapidly identify an external entity—for example, if we are indexing PDF files, we could save the file path, while if we are crawling a remote website, we can store the URL of each page. It is not a good idea to store the actual resources in the database for a number of reasons, chiefly the fact that the Xapian database was designed as a repository of search metadata—and not as a general-purpose database. Keeping your documents in it will serve no purpose other than bloating its files, since you can’t perform any operation on the data—even retrieving it—without going through a search query. Finally, each document contains an arbitrary number of search terms, which will be used by the search engine
50
FEATURE
Homo Xapian: A Search for a Better Search...Engine
to match results against a query. It’s important to understand that search terms are used by the search engine without applying any semantic rules to them. This gives Xapian maximum flexibility by allowing you to use any sort of data—including binary data—as a search term, but it does bring up one interesting problem that you must be aware of when dealing with text. Since no semantic rules are attached to the search terms, textual data will be treated literally and in a casesensitive way by the engine. This means that “Marco” will be different from “marco,” and “search” will be different from “searching.” Thus, without proper intervention, arbitrary search queries from an end user are likely to fare rather poorly. Luckily, it’s very easy to solve this problem, as we’ll see in a moment. A new document is created by calling the new_document() function. We can then add data and search terms to it by calling document_set_data() and document_add_posting() respectively. Finally, we can insert the document into the database by calling WritableDatabase_add_document(). This can be optionally followed by a call to WritableDatabase_flush(), which will cause the document to be written to disk and made immediately available for searching. This simple process can be repeated as many times as needed—you simply create a new document for every item that needs to be indexed. Once you’ve processed of all your documents, you need to call delete_WritableDatabase() in order to remove your exclusive-write lock on the database; failing to do so will result in the database to become inaccessible for writing. In case of a crash, removing the lock is simple enough—all you have to do is delete a single file in your database directory—but an automated process that doesn’t delete its locks will effectively stop working until a system administrator can manually intervene. Listing 1 shows a very simple indexer. We start (on line 42) by loading the Xapian extension, which, if you compiled it properly, will already be stored in the default extension location for your PHP installation. We then follow the procedure that I outlined earlier in this section. Note how each word is converted to lowercase before being added to the appropriate document as a search term. Also, I used the str_word_count() function to break down each source string into its component words, and ensured that the script will add the latter to its document in the same order in which they appear in the former. This will make it possible for the search engine to calculate a proper proximity map when queries are executed later on. Building the Searcher It’s now time to move on to the portion of our application that actually performs the search. The procedure that we must follow here is extremely simple—first, we
Homo Xapian: A Search for a Better Search...Engine
Figure 2
Figure 3
must open our database in read-only more (remember, there can only be one writer at any given time, but the system supports an arbitrary number of concurrent readers). Next, we build a query based on user input and execute it against the database. Finally, we retrieve the appropriate result set from the latter and ask the search engine to also return an estimate of the total number of hits. Building the query is definitely the most complicated task at hand. A query is composed of a set of search terms joined together by Boolean operators. Sounds complicated? It can be. Consider, for example, the following search string: “Marco Tabini” PHP
If we typed this string in a search engine like Google, we’d expect it to execute a search for the string “Marco Tabini” in proximity of the word “PHP.” In reality, what we are telling the search engine is that we want to execute a query like the following: (MARCO and TABINI) or PHP
You probably weren’t expecting the or operator right after the closed parenthesis—but, as far as Xapian is concerned, its use only causes the search engine to perform a probabilistic match, which is what we are after. Once you have determined how your search terms should be joined together, you can create a new query by using the new_Query() function. In this context, a query is simply the search for a given search term—for example, the search string above contains three different queries: one for the word “MARCO,” one for “TABI-
June 2005
●
PHP Architect
●
www.phparch.com
NI” and one for “PHP.” Additional queries can be tacked on to the first one by using the [new_Query_from_query_pair()] function, which also allows you to specify a Boolean operator used to glue the two queries together. As you can see, this process is actually quite simple— the difficulty lies in interpreting user input in order to produce a well-formed query. Luckily, Xapian offers a query parser that can be used for the purpose—we’ll look at it later on and, for the moment, limit ourselves to probabilistic searches using the or operator only. Once all the search terms have been built into a final query, this must be inserted into an enquire context, which can then be used to execute it by calling the enquire_get_mset() function. The latter allows you to choose which rows should be returned, much like a limit clause would on a MySQL query. Query results are returned using an Iterator. This is a very convenient template construct in C++ that becomes… a little less convenient in your PHP scripts. With a proper OOP implementation (particularly in PHP 5, where iterators are part of the Standard PHP Library), this approach would allow us to extract the data using nothing more than a simple foreach loop—but we are stuck with procedural code in PHP 4 and, therefore, we must navigate the result set using a slightly less elegant while loop instead. Let’s take a look at the example in Listing 2. As before, this script essentially follows the process that I just outlined. Note that, once again, we are converting each search term to lowercase before adding to our query—this keeps everything within our application nice and consistent. After executing the query, on line 52
Homo Xapian: A Search for a Better Search...Engine
34 we output the number of estimated matches by calling [mset_get_matches_estimated()] and then enter a [while()] loop, where we cycle through our result set. For each document that our query returned, we extract the unique document ID (which you’ll rarely need since you can store your own arbitrary metadata), followed by the match probability that search engine calculated and, finally, the metadata that we associated with the document itself, which we then use to retrieve our original text string. Figure 1 shows the search engine at work. As you can see, I searched for the string “I am,” which returned two matches as expected. Using a Stemming Algorithm Even though we have a functioning search engine, it is far from being flexible. For example, our current efforts at neutralizing certain semantically insignificant differences in the search terms are very limited: while we are
June 2005
●
PHP Architect
●
www.phparch.com
FEATURE converting every word to lowercase, that doesn’t help us one bit when it comes to performing a search that is based on the root of a word rather than on its declination. For example, a user looking for “debug” will expect the search engine to match words like “debugged,” “debugger” and “debugging”—all requirements of which our application is blissfully unaware. Xapian, however, is a classy tool and, therefore, it comes bundles with a stemmer—a utility capable of applying the appropriate rules to reduce a word to its root (or “stem”) in a given language. The stemmer that comes with Xapian must be invoked separately (so that the semantic rules that apply to text don’t have to interfere with other types of data) and supports a variety of different languages. Its use is incredibly simple: all you need to do is instantiate a new stemmer by calling the new_stem() function, and then call the stem_stem_word() function on every word that needs to be indexed or used as a search term before passing it on to Xapian itself. As you can see in Listings 3 and 4, this changes very little as far as our scripts are concerned. However, when we go to execute a query, the results change quite dramatically. In Figure 2, for example, you can see that searching for the word “roll” returns one entry that actually contains the word “rolling,” thus making our search engine much more powerful. It is important to keep in mind that, like all software that deals with human languages, the stemmer is far from perfect. Consider, for example, the result set shown in Figure 3. As you can see, I was searching for the word “life” and would have reasonably expected a match against the string “Click the cat has nine lives,” since “lives” is the plural of “life.” Unfortunately, this word does not follow the normal rules of the English language and, therefore, the stemmer is incapable of dealing with it in a proper way. On the other hand, searching for “lives” matches both the string above and “Living la vida loca”—proof that, while not perfect, the stemmer does work in most cases. Taking Advantage of the Query Parser As I mentioned earlier, Xapian includes a query parser capable of interpreting a complex set of syntactical rules (shown in Figure 4) and returning a query object ready to be fed to the search engine. You will probably find that the full functionality offered by the parser is actually overkill for all but the most advanced of users, so this particular little utility can come in very handy when you don’t want to write your own parser—an operation that, in PHP, can be as dull as it is complicated. Using the query parser in your script is extremely easy. All you need to do is instantiate it and initialize it with a stemmer (if you need to use one), then feed it the
53
FEATURE
Homo Xapian: A Search for a Better Search...Engine
Figure 4
OPERATOR
DESCRIPTION
AND
Matches documents that are matched by both operands
OR
Matches documents that are matched by either operand
NOT
Matches documents that are matched only by the left operand
XOR
Matches documents that are matched by one of the operands, but not by both
( and )
Allows for the sub-grouping of expressions
+ and -
Unary operators. Match terms that contain all operands prefixed by a plus sign and none of the words prefixed by a minus sign. For example "+Marco -Tabini"
NEAR
Matches documents in which the two operands are within ten words of each other
““
Allows for phrase search
databases at the same time, returning a seamless result set that incorporates documents from all the various schemas. You can even run a query through multiple machines across a network—each machine can return its own results, which will be combined with all the others into a single set. These distributed execution features alone give you nearly unlimited scalability. In terms of performance, Xapian’s specialization—if you’re used to the speed of MySQL’s full-text engine, particularly with a very large data set, you’ll be blown away. On an adequately-equipped machine, Xapian provides all the robustness needed by a large-scale enterprise application with none of the costs normally associated with such software. The one area where you’ll find Xapian a bit frustrating is documentation. While the entire API is fully documented, it is not always easy to figure out exactly how things work, because there are very few examples. Still, I had no problem grasping the basics and, with a bit of research, I managed to get my search engine going in no time at all. Considering how much time I wasted trying to build a search engine with tools that were not nearly as good as they claimed to be, the effort that it takes to get Xapian up and running is minimal—particularly if you consider the results.
search string and obtain a query object in return— that’s it! Let’s take a look at how this is done in practice. Listing 5 shows our little search engine modified so that the query parser has replaced the original parser written in PHP. As you can see, the new_queryparser() function creates a new query parser instance, which is then initialized by providing it with a stemmer (qqueryparser_set_stemmer()) and a stemming strategy (qqueryparser_set_stemming_strategy()). In this case, we are instructing the parser to stem all words unless they end in a dot; this parameter can also be set to zero (meaning that no words are ever stemmed) or to two (in which case all words are always stemmed). Finally, a call to queryparser_set_database() ensures that the query parser is accessing our database, while queryparser_parse_query() interprets the search string and returns a query object that we can handle in the usual manner. Where to Go From Here Despite the fact that even by just implementing the material that I presented in this article you will end up with a rather powerful and very scalable search engine, we barely scratched the surface of what Xapian can do. One of the most interesting aspects of the search engine is the ability to execute a query on multiple
June 2005
●
PHP Architect
●
www.phparch.com
About the Author
?>
Marco is the Publisher of (and a frequent contributor to) php|architect. When not posting bogus bugs to the PHP website just for the heck of it, he can be found trying to hack his computer into submission. You can write to him at [email protected].
To Discuss this article: http://forums.phparch.com/229
54
TEST PATTERN
T E S T
P A T T E R N
The Construction Industry by Marcus Baker
We write slices of applications. The PHP architecture involves writing code on a page by page basis. At the start of every page we have to create every object and at the end of the script they are all torn down. This constant setting up and tearing down makes object construction very important to the PHPer. It’s not always an easy task though, as if one object can create another you have introduced a dependency just as much as if one object uses another. As dependencies can make your code tangled, so what are the options?
W
hen we talk about dependencies, we usually think of one object using another—but there is another equally important type of dependency. Unless we are going to create every object in the top level script, some of our other objects will be creating other objects too. This process is called instantiation. Like any dependency, you want the connection to be as minimal as possible and carefully chosen. In PHP, we do a lot of construction, so it’s worth knowing the options. This is going to be a simple list, working through the most common ways one object can create another. I am only going to list object oriented methods, but in spite of the simplicity, there are a lot of patterns. Not a very exciting topic? Well the times they are ’a changin’, so bear with me. Because I cannot give an exhaustive example for each pattern, I am going to use the same example for each one. Not only that, but I am going to abbreviate it heavily and ask your imagination to fill in the gaps. To help you out though, I will describe the example I am going to use throughout very carefully now. We have a template class, called Template, that can have substitutions added to a specially marked up HTML file. We won’t worry too much about the substitution details except to say that these will mostly be simple name and value pairs. The main part is the Template::paint() method, as this will actually print
June 2005
●
PHP Architect
●
www.phparch.com
the combined content to the browser. It’s not that simple though, because we would like a string such as “RR&B’s legacy to the world” to be sent to the browser as “RR&B's legacy to the world”. Being sticklers for clean code, we want strict XHTML, but we don’t want all of that tedious conversion code cluttering up our nice simple Template class. Because of this, we’ll move all of this XHTML functionality into a class called XhtmlWriter. It will be a simple class with just a single XhtmlWriter::write() method. When you send a string to this method, the string pops onto your browser all neatly converted. The Template::paint() method will use the XhtmlWriter rather than print directly. I hope that you can already visualize how to implement these two classes, they are not complicated after all, and that we can move on to the bigger issue. The big question is: how will the Template get hold of an instance of the writer?
REQUIREMENTS PHP
5
OS
Any
Other Software
None
Code Directory
construction
55
TEST PATTERN
The Construction Industry
1. Just Use “new” The simplest approach is to create the writer at the point of need. In our case this is in the paint() method itself... class Template { ... function paint() { $writer = new XhtmlWriter(); ... } }
This makes using the Template very easy... $template = new Template(); $template->set(‘name’, ‘Marcus’); $template->paint();
Because the creation of the writer is sealed inside the paint() method, as programmers we are completely unaware of it, and so the client code is very clean. This is the tightest possible interaction between the two objects. The writer lives just long enough to do its job, and no longer. Unfortunately, this means we have no control over this process. What if we want a simple plain text writer, or perhaps a Rich Text Format (RTF) writer? More importantly, what happens when we want to test the Template class? Any test will cause the template object to start writing out text, probably messing up the display of our test tools, and it’s not so easy to test output once it has been sent. With this implementation, there is nothing we can do; it’s XHTML or nothing. 2. Parameterization The lightest touch is to have the two classes in contact for the duration of the method call only. We do that by passing one object into another... $template = new Template(); $template->set(‘name’, ‘Marcus’); $template->paint(new XhtmlWriter());
It is now the caller’s responsibility to create the writer. The good news is that we can create any writer that we want. The bad news is that we will have to do it for every call to Template::paint() which will get annoying if there are a lot of them. It will be especially unfortunate if one part of our code creates an XhtmlWriter() and another part creates an RtfWriter by mistake. The burden is pushed to the client code. Despite these minor pitfalls, this is still the simplest way to go. The code is so explicit and emphatic it is very difficult to introduce bugs. I always use this approach unless I have a good reason not to. 3. Passing to the Constructor This is an object oriented programmer’s mainstay... $template = &new Template(new XhtmlWriter()); $template->set(‘name’, ‘Marcus’);
June 2005
●
PHP Architect
●
www.phparch.com
$template->paint();
Here we pass the writer into the Template constructor by reference. This way the Template gets a writer that it keeps, tucked away, until it is needed. This is obviously handy if we have to call Template::paint() more than once, as we only have to specify the writer once for both calls. This mechanism is very common; a number of patterns implement this arrangement: Strategy, Observer, Bridge, Adapter and Proxy for example. We didn’t actually have to use the constructor for this. We could have used a setter instead, but when you do that, you have to make sure that the setter is called before the first use of the object. Being forced to do one thing before another, and having to remember to do it yourself is called temporal coupling. By using the constructor, we make sure that the writer is always created before it is used. Because we pass the object in by reference, we maintain a ghostly connection with it. You could take the Template object over to one side of the room and the XhtmlWriter over to another, metaphorically speaking of course, and Template::paint() would still write messages to the writer. This is the essence of the Observer pattern. The writer is observing the template here, waiting for messages. Another way of looking at it is to think of the writer modifying the behaviour of the template. Right now, it is equivalent to an “XhtmlTemplate”, but if we plug in an RtfWriter we will, in effect, get an “RtfTemplate” object, instead. This is the essence of the Strategy pattern (covered in the both the January and February issues of PHP Architect). The only downside is that the connection is always something that has to be thought about. However, any real world code has to have some connections and, done this way, everything is controlled and very obvious. This technique gets a big thumbs-up. 4. An Internal Factory Method If you are still hankering after the very clean interface of our first example, then this may be for you. Instead of using the new keyword directly, we can use an internal factory method called Template::createWriter(). We still use new, but we’ve created an additional intercept around it... class Template { ... function paint() { $writer = $this->createWriter(); ... } protected function createWriter() { return new XhtmlWriter(); } }
A factory method is really any method that creates
56
The Construction Industry
another object. What’s the point of this? Well, a problem with the hard coded version was that we could not change the writer. Now we can change the writer by subclassing the Template... class RtfTemplate extends Template { function createWriter() { return new RtfWriter(); } }
That was why we marked the method as protected. It was so that we could knock it out when subclassing. Although this adds some measure of flexibility, it is actually a fairly limited technique. Single inheritance means that we can only apply this trick once. If Template already inherits from another class, then things will get tangled pretty quickly. If in doubt, I prefer one of the passing techniques above. One case where the subclassing trick is useful, though, is for making legacy code test-friendly. An essential first step in testing a class is to isolate it from any other class. You don’t want any side effects from your tests and you want complete control of the environment in which the class is running. This leads to a chicken and egg situation. You don’t want to edit the code until it is under tests, but you cannot get it under tests because of all of the strands and dependencies. A way out is to replace all of the construction using new with the little factory methods. This is a trivial change that can be safely done by eye. When it comes time to test the Template, you can now subclass it in such a way that a fake version of a writer is used instead. 5. The Big Guns: AbstractFactory This is a highly factored solution and will solve even the most complicated cases. Creating the factory method internally required subclassing to change it, so instead we’ll pass an object that contains the factory method, directly into our class... $template = new Template(new WebServer()); $template->set(‘name’, ‘Marcus’); $template->paint(); class Template { ... function paint() { $writer = $this->environment>createWriter(); ... } }
$this->environment The variable, in the Template::paint() method is the WebServer object previously passed into the Template constructor. This implements the Strategy approach, as discussed earlier, but adds an extra factory step. For our trivial example, this is obviously an insanely complex solution and complete overkill. To have AbstractFactory make sense we must imagine additional complications.
June 2005
●
PHP Architect
●
www.phparch.com
TEST PATTERN We’ll pretend that there is another processing step in our application and that it takes the $_POST parameters, wrapped up in a Request object. Just like our Template class, the other imaginary code uses our factory class to create all of its objects. Thus, our WebServer class currently looks like this... class WebServer { function createWriter() { ... } function createRequest() { ... } }
To make things easy for ourselves, pretend that we instantiate this object once, at the start of the web page. Now look at this class... class CommandLine { function createWriter() { ... } function createRequest() { ... } }
It creates a TextWriter and gets the request data from the command line arguments. As far as other objects are concerned though, these are details. All they care about is that they get some abstract kind of factory. Think about that for a second. We can have our code change from a web based application into a command line tool by changing one class name. This is the real power of object orientation, but complexity-wise, it comes at a price. 6. Static Factory So far, we have been passing a lot of objects around. Another way to make the factory method available is to have it static... class WebServer { static function createWriter() { return new XhtmlWriter(); } } class Template { ... function paint() { $writer = WebServer::createWriter(); ... } }
Of course the static method could be a global function as well as they are pretty much equivalent. The only advantage this approach gives is that you could change the XhtmlWriter class name without affecting the rest of the code. Well you can do that with a search and replace operation, so this benefit is a mirage. The problem with static methods is that they fix the class name, robbing you of polymorphism. You can create a different class depending on the parameters, but the set of available options is sealed. 7. The Evil Global As if statics were not bad enough, I am going to cover
57
The Construction Industry
something even worse... global $writer; $writer = new XhtmlWriter(); class Template { ... function paint() { global $writer; ... } }
Once you resort to globals, the chances of accidents happening heads rapidly to 100% on any project that is remotely complicated. Besides some other method
TEST PATTERN on destruction. If want all of your queries going out in a single transaction, it makes sense to ensure that a second one cannot be created to spoil the show. Here is a typical PHP5 implementation: class XhtmlWriter { ... function instance() { static $writer; if (! isset($writer)) { $writer = new XhtmlWriter(); } return $writer; } }
“ Unless we are going to create every object in the top level script, some of our other objects will be creating other objects too. ” overwriting the global, you have to make sure it is properly set up, once only, before you use it. The usual reason for using a global is to cut down on the clutter of passing the object around. Is passing the object, or at least a factory for that object, much work? You think so? Have you actually tried it? In a PHP web application, each page request usually touches only a slice of the application. The need to keep objects global should be rare. The real evil of globals, though, is the combination of action at a distance and the availability of the object to every other part of the code. Once the value has been changed, you have no way of knowing which method in which class did the deed. Globals also create havoc with testing, as one run will interfere with the next. To prevent accidents, programmers will come up with all sorts of naming conventions to alert the danger, not realizing that they are creating more work than if they just added some intelligent design. When the disaster occurs anyway, debugging is a nightmare. 8. The Singleton This is a fiendishly clever pattern to get around the limitations of a global variable, and yet still have a single instance that is accessible to all. Basically, it is a static method that only returns one instance that is, itself, static. There are all sorts of reasons why you might want to guarantee something is only created once and then reused. A database connection is an example when used in a simple system. The connection has to open a transaction on construction and close it again
June 2005
●
PHP Architect
●
www.phparch.com
When we make a call to XhtmlWriter::instance(), we receive a freshly created XhtmlWriter instance, as a result of the method block execution. Because the $writer variable is static, it survives the first invocations and is still around when we ask for it again. This trick of keeping a variable alive inside a code block is called a “closure”. We keep passing back references to this single variable, a variable that cannot be directly accessed except by the code block that created it. By getting the only instance this way, we ensure that it cannot be overwritten and that it is guaranteed to be instantiated correctly... class Template { function paint() { $writer = XhtmlWriter::instance(); ... } }
A clever little pattern, but the static method makes this pattern very inflexible. You have to rewrite the code to change the type of instance. It also makes the application difficult to test, because that single instance is impossible to reset between each test. Code that you cannot test is certainly broken, and the Singleton, these days, has a bad reputation. 9. The Registry This is a major upgrade to the Singleton, but is much more complicated to code. Essentially the Registry is a singleton itself, but acts as a glorified hash into which you can stash other global resources. For example, we
58
TEST PATTERN
The Construction Industry
could start our application with: $registry = Registry::instance(); $registry->setWriter(new XhtmlWriter());
This is similar to just making the writer a global, right? Well sort of, but when we go through a method call, we can add additional protection. For example, we can have the setter throw an exception if the object is accidentally set up twice. In our template we pick up the XhtmlWriter instance the same way: class Template { ... function paint() { $registry = Registry::instance(); $writer = $registry->getWriter(); ... } }
The Registry is not a particularly clever pattern, but it is an effective one. By replacing the global keyword with your own class, you can create your own system just for your application. For example, we could set a writer factory of some kind, rather than just a single writer, if that is what we wanted. Alternately, we could add a method to allow a test version of the writer to be used when testing. The power of the Registry pattern comes from its customization, but we are still messing with essentially global information. If you must have global data, then at least do it with a Registry. If you catch yourself using the Registry in more than a few places though, you should pinch yourself and try to pass some of this information around. Your code will be easier to understand if you do. 10. Dependency Injection A dependency injector is basically a replacement for the new operator. You give it a class name, and it returns an instance of that class. I have had to write a demonstration version of the tool here, because most of the DI tools are for Java right now (e.g. “Pico”, “Spring” or “Needle” for Ruby). Some PHP versions are in the works, but are unfinished at the time of writing. My very minimal framework, nicknamed “Phemto”, exists only as Listing 1. Here it is in action: $injector = new Injector(); $injector->register(‘XhtmlWriter’);
This sets up the dependency injector to deliver our writer: class Template { ... function paint() { $injector = new Injector(); $writer = $injector->create(‘XhtmlWriter’); } }
If this was all there was to it, then this would be a com-
June 2005
●
PHP Architect
●
www.phparch.com
plete waste of time. Luckily, there are two wrinkles that make this a very powerful technique. The first is that the request for an instance can be very flexible. Suppose our XhtmlWriter had been written like this: class XhtmlWriter implements Writer { ... }
The injector is a heavy user of reflection and will know that by registering the XhtmlWriter, we actually have a class that can fulfill the role of a Writer interface. This means that we can write our Template::paint() method like so: class Template { ... function paint() { $injector = new Injector(); $writer = $injector->create(‘Writer’); ... } }
In other words, we can wire up our program using only Listing 1 1 registry[$class] = $class; 7 $reflection = new ReflectionClass($class); 8 foreach ($reflection->getInterfaces() as $interface) { 9 $this->registry[$interface->getName()] = $class; 10 } 11 } 12 13 function create($interface) { 14 if (! isset($this->registry[$interface])) { 15 throw new Exception(“No class registered for interface $interface”); 16 } 17 $class = $this->registry[$interface]; 18 $dependencies = $this>findConstructorDependencies($class); 19 return $this->instantiate($class, $dependencies); 20 } 21 22 private function instantiate($class, $dependencies) { 23 $code = “”; 24 for ($i = 0; $i < count($dependencies); $i++) { 25 $objects[$i] = $this->create($dependencies[$i]); 26 if ($i > 0) { 27 $code .= “, “; 28 } 29 $code .= “\$objects[$i]”; 30 } 31 $code = “\$object = new $class($code);”; 32 eval($code); 33 return $object; 34 } 35 36 private function findConstructorDependencies($class) { 37 try { 38 $reflection = new ReflectionClass($class); 39 if ($constructor = $reflection->getConstructor()) { 40 $parameters = $constructor->getParameters(); 41 for ($i = 0; $i < count($parameters); $i++) { 42 $interfaces[$i] = $parameters[$i]>getClass()->getName(); 43 } 44 return $interfaces; 45 } 46 } catch (Exception $exception) { 47 } 48 return array(); 49 } 50 } 51 ?>
59
TEST PATTERN
The Construction Industry
interfaces. Replacing our XhtmlWriter with another can be done at run time without any obvious change of code, which is handy for testing, if nothing else. The Template part of the code doesn’t even need to know about the XhtmlWriter class, even though it was responsible for instantiating it. The Template knows about the injector, but we can tackle that next. In case the power of this is still dipping below your radar, here is another trick. Remember our third example where we passed a new writer in the constructor of Template. First, a reflection-friendly Template class definition: Class Template { function __construct(Writer $writer) { ... } ... }
Here is “constructor based dependency injection” in action:
also automatically created an XhtmlWriter instance to fulfill the requirement of the constructor. In other words, we have been able to instantiate a Template without any knowledge of its dependencies, at all, yet, we still retain full control of all of its components, and this is all figured out by the injector. Because the dependencies were dealt with on construction, the injector is now invisible to the Template code as well. This is serious decoupling, and has the power to manage large applications and frameworks. To describe dependency injection as a hot topic right now, would be something of an understatement.
About the Author
?>
Marcus Baker works at Wordtracker (www.wordtracker.com) as Head of Technical, where his responsibilities include the development of applications for mining Internet search engine data. His previous work includes telephony and robotics. Marcus is the lead developer of the SimpleTest project, which is available on Sourceforge. He's also a big fan of eXtreme programming, which he has been practising for about two years.
$injector = new Injector(); $injector->register(‘Template’); $injector->register(‘XhtmlWriter’); $template = $injector->create(‘Template’);
That last line didn’t just create a Template instance, it
June 2005
●
PHP Architect
●
www.phparch.com
To Discuss this article:
http://forums.phparch.com/209
60
PRODUCT REVIEW
P R O D U C T
R E V I E W
Agata 7 Open-Source Search Technology by Peter B. MacIntyre
Figure 1
T
his month I am reviewing the product called Agata Report Generator. It is an open source product with a home in Brazil. If you look at the credits page for this product, you will see this list of the supporting cast of tools and products used to build it: • PHP (www.php.net); • PHP-GTK (gtk.php.net); • PEAR (pear.php.net); • PhpDocWriter ( http://phpdocwriter. sourceforge.net); • FPDF (www.fpdf.org); • BarCode128 (GuinuX); • JPGraph (www.aditus.nu/jpgraph); • Ximian Icons (http://www.ximian.com);
June 2005
●
PHP Architect
●
www.phparch.com
PRODUCT INFORMATION PHP
4+
OS
Any
Product Version
7
Price
FREE!
Web Address
http://www.agata.org.br
61
Agata 7: Open-Source Search Technology
ago now when I was developing a rather complex PHP system, and at the time it seemed to be quite immature. Now, I am looking at it again and it seems to have grown up quite nicely. Figure 1 shows the start up screen after the product has been installed. Notice that this is a command line (DOS in Windows) screen that starts up the product. This means that the product is done with the PHP-GTK module (among the list shown above).
Figure 2
Figure 3
So as you can see this is a rather unique product that has been built with quite a few other open source tools. So, what does their web site have to say in summary of what this product can do? Agata Report is a cross-platform database reporting tool with graph generation and a query tool that allows you to get data from PostgreSQL, MySQL, SQLite, Oracle, DB2, MS-SQL, Informix,
June 2005
●
PRODUCT REVIEW
PHP Architect
●
InterBase, Sybase, or Frontbase and export that data as PostScript, plain text, HTML, XML, PDF, or spreadsheet (CSV) formats through its graphical interface. You can also define levels, subtotals, and a grand total for the report, merge the data into a document, generate address labels, or even generate a complete ER-diagram from your database. Wow, that is a mouth-full. I looked at this product a few years
www.phparch.com
The 50 Cent Tour After the user responds to the opening screen with a language selection and a possible theme, the interface window appears (Figure 2). This is a relatively clean interface and not too difficult to figure out on your own. I say figure out on your own because the most glaring point against this product is the complete lack of any documentation. In my exchange of a few emails with the product "owner" (Pablo), I was informed that the major effort into the product is not the documentation. He acknowledges the fact that there is little to no documentation and is focusing on stabilizing version 7 for the time being. If you go on the Agata web site you will see an invitation for someone to do the documentation, so if any readers of this article have some spare time, they could contribute to a good open source project in this manner. Now, let's get back to the main user interface screen. The first thing you will have to do is connect to a database. Since this is a reporting engine, it needs a source of data on which to report. Using the File -> Connect to Database menu option you will be taken to the screen shown in Figure 3. Here, you make your connection by providing the requested connection information and you are returned to the main
62
PRODUCT REVIEW
Agata 7: Open-Source Search Technology
face is really doing is building a SQL Select statement for use in the report. Figure 5 shows that the grouping bands and sub-totaling is done separately. This is different than the SQL GROUP BY clause, so be careful of that distinction. I liked this interface, although it did take me a little time to play with and figure out because the methods used to navigate through the features were not consistent; some times you double-click, sometimes you use a drop-down box, and sometimes you select a field and click a button. Figure 6 shows the preview of my defined report.
Figure 4
Figure 5
screen. Once this connection has been established, the main screen is populated with table names from the database to which you are connected. Now that the connection has been established, it's time to select information on which to build a report. I chose my small database
June 2005
●
PHP Architect
●
that contains a practice schedule for the Junior High School basketball team that I coached. Figures 4 and 5 show some of the report definition already built. I added page headers and footers (these don't show up in the report preview screen), and group and page totals. As you can see, what this inter-
www.phparch.com
Summary This product has other features that I am not covering in this review. There are many ways to preview and save your designed reportsnamely XML, OpenOffice, PDF, CSV, TEXT, and HTML. There is even a way to tie these report definitions into a PHP application. So, be sure to test these features out when you are looking into the product for yourself. Other features that were noticed were graphical reports (done with the help of the JPGraph PHP library), and both label and bar code generation features. This product is always getting closer to becoming a prime time challenger to Crystal Reports, but it is not quite there yet. Now, don't get me wrong; this application is very useful and versatile and is a great testament to the integration of PHP extension tools being used in concert. It certainly has come a long way in maturity since I first laid eyes on it a few years ago. Kudos goes out to Pablo-who is the main developer of this product-for conceiving it, and maintaining it in all its complexities. The feature set appears to be lending itself to a robust set of tools; perhaps the
63
Agata 7: Open-Source Search Technology
PRODUCT REVIEW developers have bitten off a little more than they can chew? My main problem with this product is, as I mentioned earlier, is that it has a great need for some quality documentation. At the very least, it should have a short tutorial on how to connect to a database and how to define some of the more basic report types so that the prospective user does not have to give up 3 days in trail and error.
Figure 6
I give this product 3 out of 5 stars.
About the Author
?>
Peter MacIntyre lives and works in Prince Edward Island, Canada. He has been and editor with php|architect since September 2003. Peter’s web site is at http://paladin-bs.com
June 2005
●
PHP Architect
●
www.phparch.com
64
You’ll never know what we’ll come up with next For existing subscribers
NEW
Upgrade to the Print edition and save!
LOWER PRICE! Login to your account for more details.
php|architect
Visit: http://www.phparch.com/print for more information or to subscribe online.
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue to be mailed to you.
Choose a Subscription type:
Canada/USA International Air Combo edition add-on (print + PDF edition)
*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly.
To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057
EXIT(0);
Tales from the Script e x i t ( 0 ) ;
by Marco Tabini
A
fter much consideration— most of it in my own bed, where I finally slept again for the first time in two weeks last night—I have come to the conclusion that scientists at Miami International Airport must have run out of mice. The rodent shortcoming, in turn, must have forced them to shift their experiments on the third most intelligent life form on the planet Earth: man (the other one being, as all fans of Douglas Adams known, dolphins). This has to be the only reasonable explanation for the absolute maze that I had to walk through to get from gate A1 to A26 and catch my connection flight on the way back to Toronto from Cancun. Not that there is anything reasonable about flight travel anyway—these days, airport procedures seem to be designed by some bureaucrat who, having drank way too much one night, just happened to fall on his keyboard in a particularly clever way and type something that, while linguistically accurate, makes no sense whatsoever to anyone but another bureaucrat. Don’t let the apparent closeness (of the asort() kind) of A1 to A26
June 2005
●
PHP Architect
●
fool you, either—the two gates are a good thirty minutes’ walk away from each other and, most likely, reside in two different area codes. In fact, they probably reside in two different countries, which explains why, fresh from an international flight where a customs officer has rummaged through your dirty clothes of ten days and a half-hour pleasant stroll through the hermetically-sealed “connecting flight corridor,” you have to go through security all over again. Take off your shoes, pull out the laptop from the bag, show that your camera is not really a thinly-disguised atomic device and you’re well on your way to another three hours of over-regulated cattle-class flying. And in case you’re traveling with your brother and you’re both carrying an LCD projector of the same brand and model, you must really be twins—no matter if your passports say you were born three years apart (yes, this really happened; I kid you not). Honestly, if you happen to read this and live in Florida, please write your congressman—the federal government should, at the very least, share the results of the experiment with the rest of us.
www.phparch.com
Meanwhile, back in Mexico, science and technology came together in a large jug of industrial cleaner that I found lying around while I was doing my rounds at php|tropics. According to the label (in Spanish, of course), that particular chemical can be used to either clean stainless steel surfaces or to disinfect fruits and vegetables (but not for cleaning porous materials like porcelain, which could get badly corroded). That bit of scientific breakthrough, however, is nothing compared to the fact that the Airport Authority of Cancun has answered the question “how can we afford a $500,000 automated parking system?” by hiring some guy who simply stands there and pulls up a chain across the road until you’ve shown him your validated parking ticket. Unless you want your car ripped in half like a slightly oversized can of sardines, this method is just as effective and likely much cheaper than your typical American (or Canadian) solution. I love traveling—it’s what makes life worthwhile.
php|a
67
Can’t stop thinking about PHP? Write for us! Visit us at http://www.phparch.com/writeforus.php