This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Secure your applications against Email Injection Tips on Output Buffering KOMODO - reviewed and much more...
NEXCESS.NET Internet Solutions 304 1/2 S. State St. Ann Arbor, MI 48104-2445
http://nexcess.net
PHP / MySQL SPECIALISTS! Simple, Affordable, Reliable PHP / MySQL Web Hosting Solutions P O P U L A R S H A R E D H O S T I N G PAC K A G E S
MINI-ME
$
6 95
SMALL BIZ $ 2195/mo
/mo
500 MB Storage 15 GB Transfer 50 E-Mail Accounts 25 Subdomains 25 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel
2000 MB Storage 50 GB Transfer 200 E-Mail Accounts 75 Subdomains 75 MySQL Databases PHP5 / MySQL 4.1.X SITEWORX control panel
POPULAR RES ELLER HO ST I NG PAC KA G ES NEXRESELL 1 $16 95/mo 900 MB Storage 30 GB Transfer Unlimited MySQL Databases Host 30 Domains PHP5 / MYSQL 4.1.X NODEWORX Reseller Access
NEXRESELL 2 $ 59 95/mo 7500 MB Storage 100 GB Transfer Unlimited MySQL Databases Host Unlimited Domains PHP5 / MySQL 4.1.X NODEWORX Reseller Access
: CONTROL
PA N E L
All of our servers run our in-house developed PHP/MySQL server control panel: INTERWORX-CP INTERWORX-CP features include: - Rigorous spam / virus filtering - Detailed website usage stats (including realtime metrics) - Superb file management; WYSIWYG HTML editor
INTERWORX-CP is also available for your dedicated server. Just visit http://interworx.info for more information and to place your order.
WHY NEXCESS.NET? WE ARE PHP/MYSQL DEVELOPERS LIKE YOU AND UNDERSTAND YOUR SUPPORT NEEDS!
NEW! PHP 5 & MYSQL 4.1.X
php 5
4.1.x
We'll install any PHP extension you need! Just ask :) PHP4 & MySQL 3.x/4.0.x options also available
php 4
3.x/4.0.x
128 BIT SSL CERTIFICATES AS LOW AS $39.95 / YEAR DOMAIN NAME REGISTRATION FROM $10.00 / YEAR GENEROUS AFFILIATE PROGRAM
UP TO 100% PAYBACK PER REFERRAL
30 DAY MONEY BACK GUARANTEE
FREE DOMAIN NAME WITH ANY ANNUAL SIGNUP
ORDER TODAY AND GET 10% OFF ANY WEB HOSTING PACKAGE VISIT HTTP://NEXCESS.NET/PHPARCH FOR DETAILS
De dicat ed & M an ag ed D edic at e d s e rv e r so lu t io ns a ls o av a ila ble Serving the web since Y2K
TM
CONTENTS
Columns 6 EDITORIAL 8 php|news
Features 10
48 TEST PATTERN
Why is it Taking so Long?
Lead times and the rationale behind them
2005 Look Back
Reflecting on last year’s events in the PHP world
by DERICK RETHANS
18
by MARKUS BAKER
53 SECURITY CORNER
Email Injection by CHRIS SHIFLETT
PHPLib’s Block Tool
Templating PDF’s for Maximum Reusability
by RON GOFF
56 TIPS & TRICKS
Output Buffering by BEN RAMSEY
26
FPDI in Detail
Importing existing documents with Free PDF Import
60 PRODUCT REVIEW
Komodo
The Web Development IDE for All Platforms?
by JAN SLABON
by PETER MacINTYRE
38
i18n
Internationalize Your Web applications with less PHP code
by CARL McDADE
64 exit(0);
2006: A Look Forward by MARCO TABINI
Download this month’s code at: http://www.phparch.com/code/
WRITE FOR US!
If you want to bring a php-related topic to the attention of the professional php community, whether it is personal research, company software, or anything else, why not write an article for php|architect? If you would like to contribute, contact us and one of our editors will be happy to help you hone your idea and turn it into a beautiful article for our magazine. Visit www.phparch.com/writeforus.php or contact our editorial team at [email protected] and get started!
EDITORIAL
PLATFORM DIVERSITY
I
n the past five (or so) years, especially, the desktop landscape has changed, severely. Desktops have traditionally been dominated by Windows, but alternatives are making their way into both the office and home. Apple’s hit operating systems in the OS X series, and other chic products (like the iPod) have not only fueled the sales of Macintosh computers, but have opened consumers’ minds to the reality that there are alternatives to Windows. The market is still strongly clutched by Microsoft, but more and more users are making the “switch” to Mac (and to a much lesser extent, alternatives like Linux). This diversity, while good, can cause portability problems, and as I’ve touched on in past issues, developers can no longer target a single browser, but must become more and more aware of standards and cross-browser/cross-platform compatibility issues. For the most part, developers seem to have the browser issue under control. I personally never use Internet Explorer for anything but testing (I’m a Firefox fanboy), and it’s very rare that I still run into sites that simply won’t work with FF. Even in cases where it seems I’m out of luck, I can often spoof the User-Agent header, and get a working site. Since Firefox is available on many platforms, it seems that the HTML issue is (mostly) behind us—I say “mostly” because standards-compliance and portability are things that we always need to strive for. If you’ve tried to distribute a printable, offline-viewable, and well laid out document, in the past, you know that HTML doesn’t cut it. There’s little provision for the features that are necessary to build a professional document (there is hope with CSS, though). This often leaves websites delivering “richer” documents, such as MS Word documents or RTF files. The distribution of proprietary format documents leads to its own set of problems, primarily: document creation and portability. Have you tried to build a Word document from your non-Windows Web server? It’s not fun. Equally tedious is trying to get that document to render properly in different versions of Word, on different platforms—worse is the rendering in non-Microsoft applications, such as OpenOffice. Enter PDF. Now, PDF is certainly not new technology. It does, however, seem to be becoming more and more the de facto standard for document distribution. PDF is no stranger to php|architect readers: if you’re not reading this on paper, you’re reading a PDF, and we’ve brought you much PDF-centric content in the past, but we’ve certainly not drained the PDF knowledge pool. This month, we’re happy to focus on PDF, once again, but this time with a twist: using PHP to modify existing PDFs, through various means. It’s also our pleasure to be running Derick Rethans’ PHP Lookback, 2005. Marco will touch more on this in exit(0). On that note, we at php|architect wish you and your business a happy and successful 2006. Here’s to another great year of PHP!
Volume 5 - Issue 1 Publisher Marco Tabini
Editor-in-Chief Sean Coates
Editorial Team Arbi Arzoumani Peter MacIntyre Eddie Peloke
Graphics & Layout Aleksandar Ilievski
Managing Editor Emanuela Corso
News Editor Leslie Hill
[email protected] Authors Marcus Baker, Ron Goff, Peter B. MacIntyre, Carl McDade, Ben Ramsey, Derick Rethans, Chris Shiflett, Jan Slabon php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material. php|architect, php|a, the php|architect logo, Marco Tabini & Associates, Inc. and the Mta Logo are trademarks of Marco Tabini & Associates, Inc.
Ilia Alshanetsky announces the release of php 5.1.2 RC1. “I’ve just packaged PHP 5.1.2RC1, the first release candidate for the next 5.1 version. A small holiday present for all PHP users, from the PHP developers. This is primarily a bug fixing release with its major points being: • Many fixes to the strtotime() function, over 10 bugs have been resolved. • A fair number of fixes to PDO and its drivers • New OCI8 that fixes large number of bugs backported from head. • A final fix for Apache 2 crash when SSI includes are being used. • A number of crash fixes in extensions and core components. • XMLwriter & Hash extensions were added and enabled by default.” Get all the info at http://ilia.ws/archives/ 97-PHP-5.1.2RC1-Released!.html
FUDforum 2.7.4RC1 Released
The FUDforum team has announced the latest release of their open source forum package, version 2.7.4 RC1. Some of the new features include: • Added subscribed forum filter to message navigator • Added handling for in-lined attachments in mailing list import • Added the ability to supply custom signature to message synchronized from the forum back to mailing list or a news group • Added support for allowing the user to select how many threads they want to see per page • Much more… Visit FUDforum.org for all the latest info.
8 • php|architect • Volume 5 Issue 1
ez.no is proud to announce the release of ez components. ez.no announces: ”Ez components is an enterprise ready, general purpose PHP platform. As a collection of high quality independent building blocks for PHP application development, ez components will both speed up development and reduce risks. An application can use one or more components effortlessly, as they all adhere to the same naming conventions and follow the same structure. All components are based on PHP 5.1, except for the ones that require the new Unicode support that will be available from PHP 6 on.” Need to speed up your development? Check out ez.no for more info.
xajax 0.2
xajaxproject.org announces the release of version 0.2. What is it? The site describes it as:” an open source PHP class library that allows you to easily create powerful, webbased, Ajax applications using HTML, CSS, JavaScript, and PHP. Applications developed with xajax can asynchronously call server-side PHP functions and update content without reloading the page.” To start working with xajax, visit xajaxproject.org.
SQLiteManager 1.2.0RC2
If SQLite is the db of choice for your PHP application, you may be interested in the latest release of SQLiteManager. SQLiteManager. org lists the features as: • Management of several databases (creation, access or upload) • Management of the attached databases • Create, edit and delete tables and indexes • Insert, edit, delete records in these tables • Management of views; create views from SELECTs • Management of triggers • Management of user defined functions • Manual request and from file, it is possible to define the format of the requests, sqlite or MySQL; a conversion is done in order to directly import a MySQL database in SQLite • Importing of records from a formatted text file • Export of structure and the data • Choice of several display skins Check out SQLiteManager.org to start managing your SQLite DB, today.
php|architect Releases New PDFlib Book We are proud to announce the release of our latest book in the “Nanobooks” series called Beginning PDF Programming with PHP and PDFlib. Authored by Ron Goff, this book provides a thorough introduction to the great capabilities provided by the PDFlib library for the creation and manipulation of PDF files. The book features a foreword by Thomas Merz, the original author of PDFlib and founder of PDFlib GmbH, and tackles topic like PDF file creation, fonts, text, shapes and much more, including PDFlib’s Block Tool, which allows for the manipulation of existing PDF documents. For more information, http://www.phparch.com/pppp
MDB2_Drivers Check out the hottest new releases from PEAR.
Image_Color2 0.1.4
PHP 5 color conversion and basic mixing. Currently supported color models: • CMYK - Used in printing • Grayscale - Perceptively weighted grayscale • Hex - Hex RGB colors i.e. #abcdef • HSL - Used in CSS3 to define colors • HSV - Used by Photoshop and other graphics packages • Named - RGB value for named colors like black, khaki, etc. • WebsafeHex - Just like Hex but rounds to websafe colors
Config 1.10.5
The Config package provides methods for configuration manipulation. • Creates configurations from scratch • Parses and outputs different formats (XML, PHP, INI, Apache...) • Edits existing configurations • Converts configurations to other formats • Allows manipulation of sections, comments, directives... • Parses configurations into a tree structure • Provides XPath like access to directives
MDB2 drivers where released for: • SQLite • postgreSQL • mysqli • mysql • Oracle
MDB2 2.0.0RC3
PEAR MDB2 is a merge of PEAR DB and Metabase php database abstraction layers. Note that the API will be adapted to better fit with the new PHP 5-only PDO before the first stable release. It provides a common API for all supported RDBMS. The main difference to most other DB abstraction packages is that MDB2 goes much further to ensure portability. Among other things MDB2 features: • An OO-style query API • A DSN (data source name) or array format for specifying database servers • Datatype abstraction and on demand datatype conversion • Various optional fetch modes to fix portability issues • Portable error codes • Sequential and non sequential row fetching as well as bulk fetching • Ability to make buffered and unbuffered queries • Ordered array and associative array for the fetched rows • Prepare/execute (bind) emulation • Sequence emulation • Replace emulation • Limited sub select emulation
Fileinfo 1.0.3 GDChart 0.2.0
The GDChart extension provides an interface to the bundled gdchart library. This library uses the (bundled) GD library to generate 20 different types of graphs, based on supplied parameters. The extension provides an OO interface to gdchart exposing majority of options via properties and complex (array) options via a series of methods. To use the current version of the extension PHP 5.0.0 is required, and older PHP 4 only version can be downloaded from CVS, by checking out the extension with PECL_4_3 tag.
yaz 1.0.6
This extension implements a Z39.50 client for PHP using the YAZ toolkit.
This extension allows retrieval of information regarding vast majority of files. This information may include dimensions, quality, length etc... Additionally, it can also be used to retrieve the mime type for a particular file and for text files, the proper language encoding.
pecl_http 0.21.0
It eases handling of HTTP URLs, dates, redirects, headers and messages, provides means for negotiation of clients preferred language and charset, as well as a convenient way to send any arbitrary data with caching and resuming capabilities. It provides powerful request functionality, if built with CURL support. Parallel requests are available for PHP-5 and greater. PHP-5 classes: HttpUtil, HttpMessage, HttpRequest, HttpRequestPool, HttpDeflateStream, HttpInflateStream PHP-5.1 classes: HttpResponse
• • • • • • • • • • • •
Row limit support Transactions support Large Object support Index/Unique Key/Primary Key support Autoincrement emulation Module framework to load advanced functionality on demand Ability to read the information schema RDBMS management methods (creating, dropping, altering) Reverse engineering schemas from an existing DB SQL function call abstraction Full integration into the PEAR Framework PHPDoc API documentation
MDB2_Schema 0.4.1
PEAR::MDB2_Schema enables users to maintain RDBMS independent schema files in XML that can be used to create, alter and drop database entities and insert data into a database. Reverse engineering database schemas from existing databases is also supported. The format is compatible with both PEAR::MDB and Metabase.
Validate_ptBR 0.5.2
Package contains locale validation for ptBR such as: • Postal Code • CNPJ • CPF • Region (brazilian states) • Phone Number • Vehicle plates
Xdebug 2.0.0beta5
The Xdebug extension helps you debugging your script by providing a lot of valuable debug information. The debug information that Xdebug can provide includes the following: • stack and function traces in error messages with: • full parameter display for user defined functions • function name, file name and line indications • support for member functions • memory allocation • protection for infinite recursions Xdebug also provides: • profiling information for PHP scripts • script execution analysis • capabilities to debug your scripts interactively with a debug client
Volume 5 Issue 1 • php|architect •9
F EATU RE
2005 PHP
A new year is upon us, and as is customary in the PHP world, it is time to reflect
LOOK BACK
on the events of the past year. Derick Rethans, a PHP internals developer, has been publishing a PHP Look Back for a few years, now, and this year, we saw it fitting to publish it, here. Happy 2006!
by D E RIC K RE THAN S
W
elcome to the fourth installment of the PHP Look Back. Just as in previous years, we’ll look back on PHP development discussions, bloopers and accomplishments of the last year. This is not supposed to be a fully objective review of last year—note that the opinions in this article are that of the author, and not of the PHP development team (nor of php|architect).
January January was a quiet month, with not much going on. After about 8 months [001], we finally added [002] a PIC/nonPIC detection mechanism to the configure script, that will select non-PIC object generation for supported platforms (Linux and FreeBSD). Non-PIC code is about 30% faster, as measured in earlier benchmarks. 10 • php|architect • Volume 5 Issue 1
TO DISCUSS THIS ARTICLE VISIT: http://forum.phparch.com/281
A week later, Leonardo [003] was wondering whether we planned on adding type hints for scalar types to PHP. As PHP is a weakly-typed language, this is not something we wanted to add, although we did add support for an “array” type hint, later in the year. With PHP 5.1’s new GOTO execution method (added last August), variable name lookups are cached internally. This caused some problems for Xdebug [004], as it needs some information to find out which variables are used in a specific scope. Andi committed [005] a patch that made Xdebug work properly, again. Michael started working on his HTTP extension (which
2005 Look Back generates way too many commit mails ;-) and encountered a problem with a naming clash [006] between PEAR’s HTTP class and his PECL extension. Greg responded [007], and said that this problem will be solved when PEAR 1.4 comes out, with its channel support.
February Andi started discussions in February by pointing out a date for the first beta of PHP 5.1: March 1st. He declared that “both PDO and Date should be included in the default distribution”[008] and others suggested that XML Reader[009] should be included by default, as well. In reply to Andi, Rasmus mentioned [010] that he would like to see the
issue that—later in the year—warranted a new PHP release, and Greg introduced [027] PEAR 1.4, with channel support. Halfway through the month, Marcus [028] mentioned a few things that should go into PHP 5.1; most notably the __toString() fix, which unfortunately, did not actually make it into the release. Type hinting with “= NULL” did, make it in [029], though. Martin Sarsale reported [030] an issue with references and segfaults, something which had been annoying us at eZ systems [031] for quite some time, too. This issue got fixed in PHP 4.4, albeit not without a little bickering (more about that later).
Luckily, Debian’s PHP packages got rid of some of the insanity that was present in previous releases. filter extension included, as well. The discussion about this extension quickly transitioned to data mangling of input request variables, and how they could not be influenced by the script authors, but only by the system administrator. In the end, this discussion made place for the topic of Operator overloading [011], where certain people kept reiterating that operator overloading is a “good thing. [012]” Andrei tried to stop this discussion by being funny [013], but it didn’t work very well [014]. Around the same time, Wez announced [015] the first beta of PDO—PHP Data Objects. Wez wanted people to test [016] PDO, and of course, over the next couple of months, there were various PDO-related concerns [017] and issues raised. Another discussion in February was about auto boxing [018] in PHP. Auto boxing is the encapsulation of all primitive types as objects. Naturally, people asked why [019] we would want to have this, and no sound reason was given. In the end, this discussion suggested that phpDocumentor[020] should handle type determining, instead. Having a doc block [021] parsing extension to the reflection API would be nice, although a bit hard. We also had an often-recurring discussion [022] on why the GPL[023] is a bad idea for PECL[024] extensions. John added the first version [025] of XMLRPCi to CVS; why he chose this silly name is still unknown. Jani wrote about a problem with overwriting globals [026], an
March In March, Ilia proposed [032] a patch that adds a special token that tells PHP’s parser to abort parsing when the token is encountered. This allows us to attach binary data to the end of a PHP script, which is highly useful for one-script installers, such as the one that FUDForum [033] uses. On the 14th of the month, Zeev released the first RCs [034] of both 5.0.4 and 4.3.11. We also encountered further reference issues [035]. The same guy that mailed tons of “fixes” to the internals list, last June [036], was back with more [037] patches. Andrei, once again, pointed out [038] that it is a good idea to check with an extension’s maintainer before applying patches, and Greg published [039] the package2.xml documentation. Lukas, once more, pointed out [040] the weird naming scheme that new extensions seem to be getting, and luckily Debian’s PHP packages got rid [041] of some of the insanity that was present in previous [042] releases by not always building in ZTS mode. Unfortunately, their packages still force PIC mode for the libraries. A user brought up the idea of an upload meter patch [043], again, and although we all seemed to remember[044] that the original patch was rejected [044], no one could find the original thread [046] where this was discussed. Last year’s Look Back discussed this too, and Volume 5 Issue 1 • php|architect • 11
2005 Look Back there, the reason was mentioned [047]. In the last week of the month, we had some fuss [048] about “FreeBSD doing stupid things [049]” regarding their naming of auto tools executables [050].
April April started with a suggestion [051] by Zeev to change the way that __autoload() works, by allowing multiple instances of this magic function. In the end we, didn’t end up implementing this, and as Lukas described [052], “Frameworks should provide __autoload() helper methods, but should never implement the function itself. It’s up to the end user to do this.” (This is exactly how we implemented it for the eZ components [053]). Andi wanted to release PHP 5.1 Beta 1[054] really soon, but, as Jani mentioned [055], there were quite a few things that were still not fully ready, and thus the suggestion to call it “Alpha”[056] was made, instead. During this thread, some pet-features [058] were brought up [059]. Kamesh, from the Netware porting team, found another reference issue [060]. Marcus added the File [061] class to his SPL extension, causing a small stir—the new class clashed with any application that already defines its own File class. Although this is a valid point, projects defining a “File” class should know better, and would be wise to prefix their class names. This same issue will pop up later in the year. A last, somewhat larger, discussion erupted when a question [062] about whether APC could be used as a content cache was posted to the list. Rasmus found it an interesting idea [063], although this functionality can also be accomplished in user space. In the last point of the thread, Rasmus mentioned [064] that APC will soon support PHP 5.
May May had a slow start, and things only got interesting at the end of the month. The first discussion that came up was Ilia’s removal of dangling commas from enums, something that “was in c language from the first day [065].” Apparently, GCC 4 is “becoming worse and worse [066],” but luckily, we can still just ignore the warnings [067]. After a small private discussion with Dmitry about Marcus’ and my reference fix patch [068], he came to the conclusion that this patch breaks binary compatibility and that this problem warrants a PHP 4.4 release. As this reference problem has been affecting many users, and definitely eZ over the past months, I wrote an email [069] to the list stating that it is “totally irresponsible” not to release a fix for such a grave bug. Zeev[070] also said that “we should probably not fix this at all in the 4.x tree” because of the hassles that accompany “breaking module 12 • php|architect • Volume 5 Issue 1
binary compatibility.” He also seemed to think that the bug can easily be worked around. Other users were a bit happier[071] that we finally nailed this bug, and Jani replied to Zeev that the magnitude [072] of this bug is pretty high. Rasmus added that he “will be deploying the patch and happily breaking binary compatibility [073]” as soon as the patch is ready. Breaking binary compatibility is only a “burden on the maintainers of these packages” (of the various distributions). Wez thought that “the only logical move forward is a 4.4 branch and release [074].” In the end, the Zeev almighty was “tired of going through the reasons again and again [075]” and noted that “everyone appears to prefer the upsides to the downsides.” This resulted in the creation of the PHP_4_4 branch [076] in the first week of June.
June Wez added a new patch to our CVS server that allows us to block access [077] to specific branches—with this, we closed the PHP_4_3 branch for good. A week later, I announced 4.4.0RC1[078], which features the reference bug fix. Andi wrote another PHP 5.1 mail [079], which spawned a nice long discussion on adding goto [080] to PHP, and comparing goto to exceptions. Magnus smartly added [081] that “people are talking about hypothetical messy code because of goto” and that they forget that you don’t have to use a language construct simply because it is available. The same thread also went into a branch that discussed [082] the ifsetor() language construct. After Andi returned, he decided not to do anything with goto or ifsetor()[083], and that it was now the time to branch, so that we can merge the Unicode support that was developed in parallel by mostly Andrei and Dmitry, although Rasmus was “pretty sure the current discussions will pale in comparison to the chaos that will be created when the Unicode stuff goes into HEAD![084]” Johannes wondered when the new date stuff[085] was going in; it was added a week later, just before PHP 5.1 beta 2. Lukas suggested that we add [086] the public keyword to PHP 4.4 for forward compatibility. Rasmus again wondered about “the reasoning ... for not having var be a synonym for public in PHP 5 [087].”. Andi mentioned [088] that this “was meant to help people find vars so that they can be explicit about the access modifiers” when moving to PHP 5. A few days later, Andi read a blog posting [089] which described how PHP 4.4 is breaking backwards compatibility by issuing an E_STRICT in cases where developers abuse return-by-reference. This, however, was not actually the case [090].
2005 Look Back Yasuo started a long thread [091] on allow_url_fopen() and claimed it was dangerous [092]. The main result of this thread seemed to be that we wanted to split the setting into two different privileges: one that allows remote opening of URLs and one to allow include() on remote URLs. However, this is something we could not yet change. The last thread of the month was by Andi, writing about the PHP 5.1 release process [093]
July In July, Jessie suggested [094] a String extension that declares only one class: String. This class is meant to prevent copying of the string’s data for most operations (which is currently done with PHP’s string functions). Most of the other developers where against it, for
where some people didn’t see [108] why we had to implement this fix. Unfortunately, there were some quirks [109] that we still had to sort out. In this same month, Rasmus released APC 3.0.0 [110] which came with PHP 5.1 support and numerous fixes.
August August started with a discussion on instanceof[111] being “broken,” as it raises a fatal error in the case where the class that is being checked for doesn’t exist. Andi declared “if you’re referencing classes/exceptions in your code that don’t exist, then something is very bogus with your code [112]” and “the only problem is if the class does not exist in your code base, in which case, your application should blow up![113]” I raised a question about whether the new PHP with
If you’re referencing classes/exceptions in your code that don’t exist, then something is very bogus with your code. different reasons: “String is such a generic name for a non-core class [095]” and “the savings gained by this will be more than offset by OO overhead [096],” so we will not let “this get anywhere near the core [097].” In the same week, I made more changes to the date extension [098] that allows users to more easily select the timezone that they want, instead of having to rely on the TZ environment variable. This is also needed because the TZ environment variable [099] can most likely not be used in a thread safe way, and it is certainly not portable [100]. Also in the same week, I proposed an API for new Date and Timezone functionality [101]. After some pressure [102], I added [103] an OO API, too. Near the end of the month, I committed the implementation of the new date functionality [104]. It was, however, #ifdef-ed out to facilitate discussions at a later date. Jessie came up with Yet Another Namespace Proposal [105], and tried to come up with a solution for all the previous problems we had with the implementation. He also made several patches [106] that added namespaces to PHP. We had some more fuss [107] about PHP 4.4 breaking BC,
Unicode should be called PHP 5.5 or PHP 6.0 [114]. Andi (amd the majority) wanted to go “with PHP 6 and aim to release it before Perl 6 [115].” After PHP_5_1 was branched, Andrei merged the Unicode branch and gave us some instructions on how to get started with it [116]. He also introduced the general ideas behind the implementation [117]. PHP 5.1 RC1 was finally rolled, about half way through the month, followed by PHP 5.0.5 RC2[118], a week later. During the development of the eZ components [119], we discovered various things in PHP’s OO model that we wanted to see changed. One of those issues was described in the Property Overloading RFC [120]. Unfortunately, not everybody could be convinced [121], and no changes were made. I will try again though :). The other issue that we raised was that failed typehints throw a fatal error[122], while that is not strictly necessary. Instead of throwing exceptions [123] in this case, the discussion turned towards adding a new error mode [124] (E_RECOVERABLE[125]) that will be used for non-enginecorrupting fatal errors at the language level—this is exactly the case with failed typehints. Volume 5 Issue 1 • php|architect • 13
2005 Look Back The longest thread of the month, was started by Rasmus when he posted his PHP 6 [126] wish list, which featured controversial changes such as “removing magic_quotes” and “making identifiers case-sensitive,”
attempt detection in favour of the new date.timezone setting [147]. After some discussion, we came up with a solution [147], which was then implemented. It should guess the timezone correctly in most cases, even on
The filter extension, which I’ve been developing for quite some time, did not make it into PHP 5.1... to which most developers quickly agreed [127]. Following his initial wish list, the crowd went wild and started suggesting all kinds of weird changes, such as “Radically change all of the operator syntaxes [128],” adding
September In September, Antony committed [132] an upgraded OCI8 extension which fixes a lot of bugs [133]. We also decided to play a bit nicer with version_compare(), regarding naming [134] release candidates. Zeev wanted to roll [135] PHP 5.0.5 but there was an issue [136] with the shutdown order. The reference issues returned, too. The first one [137] turned out to be an incorrect merge to the PHP 5.0 branch, where suddenly some of the notices turned into errors [138]. The second one [139] is simply a small change in behaviour, which previously created memory corruption. Rasmus explained the issue a bit more [140], once again. Ilia tried to implement a clever fix [141] which turned out to be a problem later on. Pierre started a discussion on supporting Unicode in identifiers, something he didn’t want to see. PHP already supports using UTF-8 encoded characters [142] in identifiers, so removing this feature will break BC unnecessarily. Besides breaking BC, many people simply want to use their own language for writing code, as Tex [143] writes. Zeev made another attempt at PHP 5.1.0 RC2[144] with the latest PEAR being the only thing missing. Marcus brought up the issue of __toString() again, and finally managed to get it into CVS, but unfortunately not in time for PHP 5.1. Stanislav[146] noticed some problems with detecting time zones, as the new date/time code did not try to 14 • php|architect • Volume 5 Issue 1
Windows. I also added support for an external timezone database [149].
October In October, I noticed some weird notices [150] with “make install-pear,” without a clue as to why they were showing up. This discussion turned into a “why does PEAR not support PHP 5.1” thread [151]. In the end, Greg managed to nail down the weird notices, though. I also noticed a commit by Dmitry [152] that ignores “&” when $this is passed. I pointed out that this should not be supported (in PHP 5), as it doesn’t make really sense that people won’t see a warning/notice/error when they’re doing something silly. Dmitry explained [153] that disallowing it would break code, but he also writes that by “using ‘=& $this’, a user can break the $this value”— which is something we definitely should prevent. He suggested [154] we make this an E_STRICT warning, and Andi suggested [155] we escalate this to an E_ERROR in PHP 6, but neither of those things happened. A week later, Piotr[156] asked for a tarball of our CVS to make it “possible to convert it to Subversion repository ... so browsing the repositories would be much easier.” We wondered [157] why he needed that, as we offer our own browser[158], already. Matthias [159] said that we “do not want to set off yet another discussion about the changes 4.4 brought,” but that is exactly what he did. Again, there was something wrong with his code, and thus the warning is legal. After resolving the timezone issues, last month, we were surprised by a message from Zeev. He simply missed [161] the conclusion in the “lengthly thread.” As a result of the negative comments on the PHP 4.4.0 release, Lukas, Ilia and I set up a routine [162] for involving some of the more known projects to the PHP 4 [163] and PHP 5 [164] release processes. As part of this effort, we send out [165] a mail to all participating projects whenever we
2005 Look Back have a release candidate to test. I raised [166] some concern regarding our current Unicode implementation because of maintenance issues. In part of my mail, I also indicated that I wanted “to clean up PHP 6 for real, [167]” after private discussions with Marcus and Ilia. Behind the scenes, we prepared some material to organize a PHP Developers Meeting to discuss the Unicode implementation and the extended “PHP 6 Wishlist.” I also committed [168] a patch that allows typehints for classes to work with = NULL[169]. Another guy raised the issue of “that new isset()-like language construct, [170]” but this ended up going nowhere, as people were suggesting very Perl-like [171] operators. Jani replied to this thread with “How about a good ol’ beating with a large trout?[172]” On the last day of the month, we released PHP 4.4.1[173] which addresses some of the reference issues we’ve seen in PHP 4.4.0.
November In November, we prepared to finally release PHP 5.1, and one of the efforts was to make an upgrade guide [174] for people switching to PHP 5.1. Sean noticed [175] a problem with the parameter parsing API’s automatic type conversion. Like Andrei [176], many people think that “passing ‘123abc’ and having it interpreted as 123” is still wrong. Dmitry implemented [177] support for “= null” as default to array type hinting, something that I did not do [178] on purpose because “= array()” is the logically correct way of doing this. Andi agreed [179] with me on this. Ilia implemented, in PHP 5.1RC5 [180], one of the items that was on the outcome list of the PHP Developers Meeting: adding a notice that warns people that curly braces [181] for addressing a character in a string is now deprecated in favour of the [] operator—contrary to the current explanation in our manual. {} and [] are exactly the same thing [182] and “having two constructs for the same behaviour is silly and leads to confusing, hard to read code.” The outcome of this discussion was the removal of the notice in PHP 5.1 and the likely conclusion is that it is not going to get removed. Another change that as made PHP 5.1RC6 was the creation of the “Date” class, which caused quite a stir after the release of PHP 5.1[183]. The reason to introduce it in 5.1 was simply to make sure that no applications were going to break if we introduced the Date class later in the 5.1.x series. Unfortunately a lot of projects, including PEAR, never heard of “prefixing” class names, causing class name clashes. Marcus described the problem as “PEAR ignores coding standards, [184]” but others suggested that we renamed the internal class [185] to something silly
like php_date. Andrei [186] asked “what does renaming really buy us? The only purpose of introducing this class in RC6, as far as I can tell, was to reserve the ‘Date’ name for future use.” Now that we know about this issue, it’s time for PEAR to start prefixing its classes, so that we finally can do the right thing and add our Date (and Timezone) classes, code that has been around for months, now, and I’m quite tired of waiting for it to be in a release where I can use it. We ended up reverting the change that claimed the Date and Timezone classes, and released 5.1.1 with this change. After the PDM I posted [187] the meeting notes [188] to the list. Most of the outcome was well appreciated, except the curly braces idea which has already been discussed. With these notes, we hope to make PHP 6 a success. The notes also spawned numerous [189] polls [190] on the symbol to use for separating namespaces from class names/function names. We also discussed our version of a goto: labeled [191] breaks [192]. The filter extension [193], which I’ve been developing for quite some time, did not make it into PHP 5.1, although it is a good idea [194] to add it, now, with an “experimental” status, so that this wanted extension gets more testing. Perhaps for PHP 5.1.2…
December December was a quiet month with little action. Ilia proposed [195] a plan for PHP 5.1.2 and released PHP 5.1.2RC1[196], Zeev committed [197] Dmitry’s re-implementation of the FastCGI API and some user[198] was whining about our “official” IRC channel (which doesn’t exist). That was it for 2005 (as far as PHP internal development is concerned)! I hope you enjoyed reading this, and have a happy new year. Extra thanks go to Ilia, for being the release master, Dmitry for maintaining the engine, Jani for hunting down bug reports, Andrei for his work on Unicode, Mike for his enormous stream of useless commit messages ;-), and to all others who made PHP happen this year.
DERICK RETHANS provides solutions for Internet related problems. He has contributed in a number of ways to the PHP project, including the mcrypt, date and input-filter extensions, bug fixes, additions and leading the QA team. He now works as project leader for the eZ compoments project for eZ systems A.S. In his spare time he likes to work on, xdebug watch movies, travel and practice photography. You can reach him at [email protected].
Volume 5 Issue 1 • php|architect • 15
2005 Look Back 046 http://beeblex.com/php.internals/15567 047 http://beeblex.com/php.internals/13792
If you’ve been developing for any length of time, you’ve probably been tasked with generating PDFs at some point. In this article, we’ll discuss the process of combining data from many sources into a single PDF—from installation of the block tool, to creating the blocks in Adobe Acrobat, and then finally working with the blocks via PDFlib.
b y R o n G of f
T
he PDFLib Block Tool—available for use only with PDFlib Personalization Server (PPS)—helps create PDF documents derived from large amounts of variable data. Before the block tool was added, it was a difficult process to place variable data, images, and even other PDFs into precise areas of a PDF that had been designed previously. Now, adding variable data is very simple and helps create great dynamic pieces for just about any application.
Installing the Block Tool Currently, the block tool plug-in for Adobe Acrobat is only available on the Windows and Macintosh (both Mac OS 9 and Mac OS X) platforms. On either platform, you must also have Version 6 or 7 of Adobe Acrobat Professional or Adobe Acrobat Standard, or the full version of Adobe Acrobat 5. Other versions of Adobe Acrobat—Acrobat Reader, and Acrobat Elements—and all other PDF creation
18 • php|architect • Volume 5 Issue 1
CODE DIRECTORY: pdflib TO DISCUSS THIS ARTICLE VISIT: http://forum.phparch.com/280 tools do not work with the block tool plug-in. (Check the PDFlib web site for an up-to-date list of supported PDF authoring tools.)
Windows OS Installation If you’re using Windows, you can use the block tool installer provided by PDFlib to get the plug-in installed correctly into your version of Adobe Acrobat 5, 6, or 7. The installer places the correct files into the Acrobat plug-ins folder, which is typically found at C:\Program Files\Adobe\Acrobat 6.0\Acrobat\plug_ins\ PDFlib. The Windows version of the block tool is
compatible only with PPS version 6.0.1.
PHPLib’s Block Tool
FIGURE 1
FIGURE 2
Mac OS Installation You can install the block tool in either Mac OS 9 or OS X. If you own Adobe Acrobat 5, place the files that comprise the block tool into the Acrobat plug-in directory, typically located at /Applications/Adobe Acrobat 5.0/Plug-Ins/. If you’re using Adobe Acrobat version 6 or version 7, save the files that comprise the block tool into a new directory and then locate the Acrobat program, which is usually found at /Applications/Adobe Acrobat 6.0 Professional. Using the Finder, click once on the Acrobat application to select it and then choose “File > Get Info” from the menu bar. Locate the triangle next to the words “Plugins.” Expand the triangle, select “Add,” and then locate the folder that contains the block tool plug-in files.
The New and Improved Block Tool
If you’ve used previous versions of the block tool, you’ll notice that the new version is much more user friendly. The export and import features have also been updated, making it much quicker to apply blocks from previously formatted PDFs.
FIGURE 3
Creating Blocks After you install the block tool, you should see a new menu called “PDFlib Blocks” in Acrobat’s main menubar. You should also see a new icon that resembles [=])—this is the block tool. (See the top of Figure 1.) You use the block tool icon to create regions that you can fill with variable data. When you click the block tool icon and hover over the PDF, your cursor turns into a crosshair. To create a block, click the mouse and hold it while dragging your cursor. As you drag your cursor, a lightly-outlined box should appear. (See Figure 1.) When you’re satisfied with the size of the box, release the mouse button. A menu like the one shown in Figure 3 appears. The menu controls all of the properties of the block, including the formatting of the data that will be contained in the block (data that you will add via Volume 5 Issue 1 • php|architect • 19
PHPLib’s Block Tool PDFlib). FIGURE 4 There are three types of blocks that can be created: • The first and default type of block is text. It handles any type of text, whether it’s a single line of text or many lines of text. • The second type of block is image. As its name implies, an image block is a container for the dynamic placement of images within the PDF. • The third and last type is PDF, which is able to contain other PDFs. Each block has general properties (see Figure 2) and FIGURE 5
type-specific properties. General properties set attributes such as the placement of the block, its background and border colours, and its orientation, to name just a few. Some of the sections that follow describe the typespecific properties. So what do you do with blocks? As you might have inferred, already, you use blocks to mix dynamic content amid static content. A designer can create a PDF, include static text and images, and then place blocks wherever dynamic content should appear. Your application “fills in the blanks,” so to speak, and because blocks retain properties such as typeface, font size, color, kerning, and other settings, the block, once filled, looks exactly like the rest of document—just as the designer intended. Using blocks, the application that generates each PDF document need not format anything. However, if you want to customize a block on-the-fly, you can. Pre-defined block attributes can be overwritten by your code.
Editing Block Settings
FIGURE 6
To change a block property, select the block you want to configure and then navigate to find the property you want to change. For example, Figure 3 shows how to edit the textflow property, which can be either true or false (hence, the dropdown menu). The purpose of most properties is obvious, but be careful with attributes that specify font names. Unless you’re running Acrobat on the same machine as your PDFlib application, it’s likely that the set of fonts on the two machines (say, your desktop and the server, respectively) will differ. Be sure to use the name of fonts that are installed on your server.
Text Flow Settings If you want a block to flow (automatically wrap and justify) arbitrary amounts of text, set the textflow property to true. Once set to true, an additional button named TextFlow appears next to the existing button labeled Text. Click on TextFlow to examine and set specific variables (such as leading and indents) that control how text flows in the block. All other text attributes—those for one line of text or a flow of text—remain in the same pane as the textflow property.
Mac OS X “Tiger”
If you’re using a very recent version of Mac OS X, you can find Acrobat’s plug-ins folder by control-clicking the Acrobat application and selecting “Package Content”. 20 • php|architect • Volume 5 Issue 1
PHPLib’s Block Tool
Image Settings By changing the block option to image, you can use PDFlib to place images dynamically in a PDF. There are far fewer options for an image block than for a text block. The options screen for an image block is shown in Figure 5. The defaultimage attribute names a default image to place if the image specified by PDFlib is unavailable. The dpi setting, or the number of dots per inch, is used to override the dpi of an image. PDFlib will use the default dpi value of the image if it is available, or 72 dpi if this option isn’t set. If necessary, you can set the horizontal and vertical dpi independently by supplying two values instead of one, first horizontal dpi and then vertical dpi. The scale property controls the scaling of the image. You can supply one value to scale horizontally and vertically equally, or supply two values, one for the horizontal and another for the vertical scale factor.
FIGURE 7
PDF Settings The settings for a PDF block are very similar to the settings for an image block, as shown in Figure 6. defaultpdf specifies a default PDF to place if the PDF document that PDFlib names cannot be found. defaultpdfpage specifies which page of the default PDF to place if the default PDF must be used. scale controls the scaling of the PDF. As with an image, you can specify one value to apply to both axes or you can provide two values, one for horizontal scaling and another for vertical scaling.
FIGURE 8
Custom Settings When using any type of block, you can specify custom attributes. Custom attributes do not affect the output when using PDFlib, but can be retrieved by PDFlib for interpretation by your code. Custom attributes are good for passing information to the PDFlib program, or even for just better record keeping. As an example, say that you want to create a text block that’s limited to ten characters or less. Create the text block, add a custom property named length, set it to 10, and then retrieve the value via PDFlib at runtime. Your code can verify the length of a string before filling the block and react accordingly, perhaps truncating the string or asking the user to provide a new value.
FIGURE 9
The PDFlib Blocks Menu To make setting up blocks easier, the “PDFlib Blocks” menu has a few handy tools. You can export and import blocks to re-use complex blocks, you can align elements, and more. Volume 5 Issue 1 • php|architect • 21
PHPLib’s Block Tool
Exporting The “Export” feature is a huge timesaver when dealing with multiple PDFs that require the same types of blocks. Once you’ve finished setting up blocks in a single “master” PDF, you can export those blocks and then import them over and over again into other PDFs. There are several different settings in the “Export” dialog (see Figure 7): • You can export blocks from all pages of the PDF or from a subset of them. • You can export blocks to a new PDF or to an existing PDF. Selecting “New File on Disk” creates a blank PDF with the blocks set in the new file. If you want to export blocks to a document that you already have opened in Adobe Acrobat, select “Open Document” and click “Choose” to see a list of all open documents. If you choose “Replace Existing Files”, the block tool will overwrite the target file with blank pages with the blocks in the proper place. • The next option is “Export Which Blocks?” This section allows you to control which blocks are exported. You can export all blocks— depending on the number of pages you choose in the first section—or just the blocks that you highlight before exporting. You can also choose to delete the blocks that exist on the target PDF.
that it’s your primary choice. Then choose another block; it should turn blue, indicating that it’s your secondary choice. When you select “Align,” the blue block should align with the pink block. Figure 9 shows two blocks, Block_1, the secondary block, left-aligned to the primary block, Block_0. The “Size” alignment option only works when more than one block is selected. You can change all secondary blocks (blue) to be either the same width or height as the primary block (pink). The “Center” alignment option aligns all blocks selected either horizontally or vertically, and even both horizontally and vertically.
Defining Blocks and Detecting Settings Two other time savers are available in the “PDFlib Block” menu: one creates a block from a placed object like an image, and another creates blocks that automatically detect the font settings and font color of the font that the block is being created over. Click on “Click Object to Define Block” and then click on an object such as an image to create a block of the same dimension in the exact same position. Or, if you click on “Detect Underlying Font and Color” before you create a block, the block’s font settings are automatically set to match the style and size of the text below the new block. This feature is especially useful
Whatever text you “insert” assumes the formatting of the block. Importing You can import blocks from another PDF using the import option in the “PDFlib Blocks” menu. When you choose “Import,” you will be presented with a screen to choose the file that contains the blocks you want to import (Figure 8). After you choose the appropriate file, you can determine which pages the blocks should be applied to.
Alignment Options The alignment option in the “PDFlib Blocks” menu allows you to align two blocks. To align, choose a block. It should turn pink, reflecting
22 • php|architect • Volume 5 Issue 1
when dealing with a lot of text and specific colors. (You may have to adjust the font name to match a font located on the server running PDFlib.)
Using Blocks As you might imagine, working with blocks from within your code makes placing text, images, and PDFs into a dynamic PDF far simpler than writing code to control the pointer, stroke text line-by-line, and so on. With blocks, formatting is separated from your code, leaving all of the aesthetics to the designer creating the PDF. Better yet, a change to the design of the page doesn’t (necessarily)
Anytime
Anytime
Anytime
PHPLib’s Block Tool necessitate tweaking your code. Setting up the dynamic PDF document is similar to what’s been shown in prior chapters, except you need to pull in the PDF that contains the blocks. First, specify the basic information: if (!extension_loaded(‘pdf’)) { dl(‘libpdf_php.so’); } $p = PDF_new(); PDF_begin_document($p, “”, “”); PDF_set_info($p, “Creator”, “block_tool.php”); PDF_set_info($p, “Author”, “Ron Goff”); PDF_set_info($p, “Title”, “Block Tool”);
Next, pull in the PDF page that contains the blocks, place it into memory, and create a new blank page: $block_file = “block_file.pdf”; $blockcontainer = PDF_open_pdi($p, $block_file, “”, 0); //Page standard 8.5 x 11 PDF_begin_page_ext($p, 612, 792, “”);
Continuing, call up the actual page that you want to use. In the line of code below, the 1 (numeral one) refers to page one of the PDF that contains the blocks. $page = PDF_open_pdi_page($p, $blockcontainer, 1, “”);
If you want to use another page from the “template” PDF, just specify that page number instead of 1. Finally, the page with blocks is “copied” to the new page in the new PDF. PDF_fit_pdi_page($p, $page, 0.0, 0.0, “adjustpage”);
The adjustpage option adjusts the size of the new page to match the page size of the template PDF. adjustpage overrides any page settings that have been set previously. From here, you are ready to use the blocks.
Text Blocks Whether working with a line of text or a text flow, text is easy to fill in: just specify the name of the block and the text to render and call PDF_fill_textblock(). $block = “Block_1”; $text = “All the pie in the sky wasn’t enough to fill my plate”; PDF_fill_textblock($p, $page, $block, $text, “encoding=winansi”);
The block name, here Block_1, is the name that was assigned to the block when it was created in the template PDF. (Block names are unique and the default name is Block_#, but a block name can be any string of alphanumeric characters.) Notice that there are no extra formatting options. Whatever text you “insert” assumes the formatting of the block.
24 • php|architect • Volume 5 Issue 1
Form Conversion
You may be familiar with the Adobe Acrobat “Form Tool,” a great way to create fillable areas of your PDF. So, why not just use forms to define variable data placement? Because the form tool is limited: it cannot specify advanced font settings, whereas the block tool has been designed specifically to customize all aspects of your text. However, if you have a PDF that used the form tool to define areas for text, there is an option within the “PDFlib Blocks” menu to convert your pre-made forms into blocks (Figure 5.4). If you want to override a block’s formatting, you can. Where encoding=winansi appears, add the options that you want to override. For example, to override the font size, specify encoding=winansi fontsize=12. You should also enable embedding as needed. You can enable embedding by adding embedding=true as in encoding=winansi embedding=true.
Image Blocks The process of placing an image in an image block resembles that of placing the image “manually”: the image is loaded and then placed. $block4 = “Block_4”; $image_load = “image.jpg”; $image = PDF_load_image($p, “auto”, $image_load, “”); PDF_fill_imageblock($p, $page, $block4, $image, “”); PDF_close_image($p, $image);
In this example, the image image.jpg is placed in Block_4 using the function PDF_fill_imageblock().
PDF Blocks The steps to place a PDF document within the dynamicallygenerated PDF are similar to the steps required to set up a page to work with blocks. You identify which block you want to “fill,” identify the PDF and the page you want to extract from, and then fill the named block with that content. $block5 = “Block_5”; $pdf_load = “basic_pdf.pdf”; $pdf = PDF_open_pdi($p, $pdf_load, “”, 0); $pdf_fill = PDF_open_pdi_page($p, $pdf, 1, “”); PDF_fill_pdfblock($p, $page, $block5, $pdf_fill, “”); PDF_close_pdi($p, $pdf);
PDF_open_pdi() opens the PDF, while PDF_open_pdi_page() loads the correct page. The function PDF_fill_pdfblock()
puts it all together, placing the actual PDF onto the page. Finally, close the open PDF by calling PDF_close_pdi(), which frees the resources consumed by the open PDF.
PHPLib’s Block Tool
Closing the Page After you’ve filled all of the appropriate blocks on the open page, you must close that page. PDF_close_pdi_page($p, $page);
This line closes the PDF and you can start a new page, or end the entire document after this is called.
Putting All Together A complete example using the PDF_fill_textblock() function can be seen in Listing 1. The PDFlib block tool is easy to use and provides for complex layouts without extensive programming. Using blocks, a designer can assign where dynamic text, images, and even PDFs are to be placed, yielding a much more professional result.
RON GOFF is the technical director/senior programmer for Conveyor Group (www.conveyorgroup.com), a Southern-California based web development firm. He is the author of several articles for PHP|Architect magazine and other online publications. Ron’s lives in California with his wife Nadia and 2 children. You can contact him at [email protected].
Most PHP developers about the ability to create PDF documents on the fly. When looking at the wide range of PHP classes or APIs, every product has its own advantages and disadvantages—some of them are very expensive and others are free, but don’t offer the same functionality as the expensive ones. The main difference between the free and commercial libraries is the ability to use external documents. PDFLib has supported this through its PDI interface, but the free classes didn’t external documents, until I released FPDI for FPDF, which gives you the same muscle—but for free!
by JAN SLABON
P
DF documents—or better stated: the PDF format—have reached widespread popularity over the past few years, and this momentum continues. A very strong example of this is in a recent ISO standard, which is based on PDF 1.4, and defines a PDF derivate for the long-term preservation of electronic documents. PDF has becomea a real standard! In fact, the dynamic generation of PDF documents is an important issue today, and will continue to be so in the future. While it’s quite simple to build PDF docments on desktop PCs, their dynamic generation on a webserver, especially when using a language like PHP, can prove very difficult. On the Internet, you’ll find several PDF APIs that will allow you to create PDF documents with PHP. Some 26 • php|architect • Volume 5 Issue 1
PHP: 4.2+ OTHER SOFTWARE: FPDF 1.53 and FPDI 1.1 CODE DIRECTORY: fpdi TO DISCUSS THIS ARTICLE VISIT: http://forum.phparch.com/279 are delivered as PHP extensions, and some are “simple” PHP classes. Years ago, I came across a PHP class going by the name of FPDF, written by Olivier Plathey (http://www.fpdf.org). I was absolutely amazed by its capabilities, its easy usage and that that the “F” in “FPDF” stands for “Free.”
FPDI in Detail When I was working with FPDF, I was often challenged with a situation where I had to rebuild a whole document, programmatically. As you can imagine, this part was very frustrating, tedious, and time consuming. A digital version of your document is sitting right in front of you, and you just cannot use it. Similarly, I ran into additional problems when dealing with vector based graphics and FPDF. There was no real way to import such things, except by converting them to bitmaps and using the Image() method of FPDF. I’m sure I don’t have to explain the drawbacks to this workaround. When I found an article in php|architect (Vol. 3, Issue 5) where Marco Tabini described how to parse a PDF and update it with some simple content, I got the idea to implement this technique into FPDF—which resulted in a library which was also named with 4 simple chars: FPDI (Free PDF Import). I released my new library under the Apache Software License 2.0, which allows you to use it in your commercial or non-commercial projects. The project homepage can be found at http://fpdi.setasign.de. The article by Marco is freely available as a monthly sample, at http://www.phparch.com/issuedata/articles/article_110.pdf. In this article, I’ll introduce you to FPDI, explain how it was born, and cover its internal workings. I will assume that you have some knowledge of FPDF, and have a bit of experience with the Portable Document Format, itself. If not, just download FPDF, and run the tutorials that Olivier provided in the package. This article will not tell you how to use FPDF, but will delve deeper into the details of the PDF structure and how FPDI extends FPDF, bringing out the ability to import single pages of existing PDF documents—not just modifying existing documents. This feature is not that clear to most people out there. At this point I could tell you much about the structure of a PDF document, but as I already mentioned, the whole idea is based on another article, where everything you need to know about parsing a PDF is already described. I will cover some details about that issue later in this article. I want to make it clear why I chose the “import single pages” method, instead of “really modifying/updating” a PDF. To put it simply: “It is much easier.” You can look at a PDF document as a collection of single objects which are linked to each other. Pages, images, font descriptions, and document information are all single objects and can be identified by a unique ID. The PDF format is more flexible than just assigning objects by simple IDs, though—it allows one to define named relations. For example, these relations can be used to put an image into a content stream of a PDF page. You have to set up a resource dictionary, where you
define the name of the image and its real object relation. After this, you can simply refer to the image by using the name you provided in the content stream. As FPDF, and any other PDF generators, use named relations, which lead into name conventions, you have to pay attention when updating a PDF. If you’ve read Marco’s article, you’ll remember that there’s a part in it where he searches for the next available font name. This check has to be built into FPDF before every piece of code where FPDF creates a named relation. Another disadvantage of updating documents is that you cannot remove single pages, or reuse an existing page in an easy way. This method will, however, allow us to reuse, resize, crop or rotate page. We can also avoid naming conventions, because every imported page has its own kind of namespace in the new document, as you’ll see below.
The Basics While I was studying the PDF reference to find a good solution for importing pages, I came across a technique with the spooky name of “form XObjects”. I’m sure that everyone who stumbles upon this term thinks about conventional “forms” like those that we use in HTML, or on paper. In this case, “form” has another meaning: it corresponds to the notation of forms in the PostScript language. A form XObject can be compared with a kind of layer. It is a self-contained description of any sequence of graphics objects—its whole structure is almost similar to the structure of a single page in a PDF document. The form XObject has its own resource dictionary, where named relations are defined. So, it seemed to be the perfect solution for my problem: if I could create form XObjects, I most certainly would be able to convert pages into them. But, form XObjects have more advantages than simply preparing FPDF for PDF import. For example, they can be reused at any time in a PDF document, where the viewer application can cache the rendered results to optimize the execution. It sounded like a kind of template to me, so I began extending FPDF with this feature, which resulted in a PHP class called fpdf_tpl. This class redirects all output made by FPDF into containers which will be used as form XObjects, so one can reuse any output created with FPDF, at any time. This class has more to offer than merely preparing FPDF for FPDI—as already stated. You can reuse a template multiple times in a document, whereas it only needs to be written once to the resulting document, which leads to less memory usage and processing time in your script. Volume 5 Issue 1 • php|architect • 27
$this->Image(‘images/php-a.png’, 100, 5, 100); $this->SetDrawColor(0); $this->SetLineWidth(0.3); $this->Rect($this->lMargin+.15, 31, $width-0.3, $this->h-31-10, ‘D’); $this->SetXY($this->lMargin+.15, 31+.15); if (is_null($content)) $content = file_get_contents(__FILE__); $this->SetFont(‘Courier’,’’,6); $this->MultiCell($width-.3, 2.5, $content); } // For debugging purpose function pdf($orientation=’P’,$unit=’mm’,$format=’A4’) { $this->_startTime = microtime(); parent::fpdf_tpl($orientation,$unit,$format); } // For debugging purpose function Close() { $this->_endTime = microtime(); $this->_writingTime = true; $this->AddPage();
Examples of its use are: the generation of headers and/or footers, table headers which could be repeated on every page, a background grid of large tables, text in front or behind a template, etc. If you take a look at Listing 1 and Figure 1, you’ll see a sample script which demonstrates the use of templates. You turn templates on and off by setting the $pdf->useTPLs property to true or false—the visual result is the same. This demo has no real meaning, but it shows how much the file size and process time decrease if you’re using templates. My tests gave me a process time of only 0.0766 seconds when using templates, and 3.649 seconds without them! The same was true for the buffer size: with templates it only takes up 14.5 kb—without 28 • php|architect • Volume 5 Issue 1
templates, approximately 1.2 MB. I hope that the main advantage of fpdf_tpl is now clear. Let’s skip ahead and take a deeper look at this class. The class uses an array for holding all created templates named $this->tpls where each entry describes a single template as an array with special keys. The main entries in each template array are x, y, w, h and buffer. All other entries are just used to save other information, and are prefixed with o_. A new property, with the name of $this->res is used to assign resources like fonts, images, or other templates, to the template or the page. The assignment of resources to single pages is left in for testing purposes, and will be removed in the next release of fpdf_tpl.
So, we’ll only take a look at the tpl key in $this->res. This array is needed to rebuild the form XObjects resources dictionary with named relations, which are used in the template. To redirect the output made by FPDF, I used a simple flag, $this->intpl, and extended the _out() method. I had to take special care because a form XObject cannot include internal or external links or better, any kind of annotation. FPDF uses a single, global resource dictionary for all pages and creates this within the _putresources() method. I extended this method to make it call _puttemplates(), which will create all necessary template objects. After the objects are created and written, the named relations to them will be written to the main resource dictionary. All created templates are usable on every page! Unfortunately, using the global resource dictionary isn’t the best solution because it’ll introduce problems when interpreting or extracting pages of a document, as you will see later. With the fpdf_tpl class, I’ve build the basis for FPDI—now, we have to convert the pages of an already existing PDF document, but we have to parse it first, to get the desired information.
pdf_parser, and added support for reading streams. Let’s
Parsing the Original Document I owe a lot of credit to Marco’s article, because the parsing of an existing document was nearly completely covered in it. I adapted all parsing functions into a single class, 30 • php|architect • Volume 5 Issue 1
take a quick look at the structure and how the parsing has to be done. The first task that the parser has to do is to read the xref-table of the PDF document. This is done by the pdf_parser::pdf_read_xref() method. The xref-table is similar to a table of contents. It gives us information about the objects used in the document, and their byte-offset positions in the file. At the end of the xref-table, we’ll find the file trailer dictionary; the entries in this table lead us to the catalogue dictionary of the file. The catalogue dictionary is the root of all objects in the document’s object hierarchy and we’ll find the reference to the first page tree node of the document’s page tree—which is exactly what we’re searching for: all single pages used in the existing document. The parser has to follow the whole page tree to get the exact page count and to collect other information on the pages, which is done by read_pages() in the extended class, fpdi_pdf_parser, and results in an array as the $this->pages property. The keys of $this->pages are the desired page numbers starting at zero where each entry holds the related page object. After this task is done, we have enough information about the source document for now. While I was implementing this code, I got stuck on some problems—it took me several days (and nights) to fix them. A great problem for me was the determination of the line ending in a file. Normally, this task is handled by the PHP configuration directive
FPDI in Detail auto_detect_line_endings, but as a PDF file can have
multiple updates by different programs (on different operating systems), the line endings can be mixed. To overcome this issue, I’ve written a wrapper for fgets() which comes in use as a fallback function if fgets() returns incorrect data. This wrapper function also enables the class to be used with a PHP-version less than 4.3, where auto_detect_line_endings was introduced. To make FPDI compatible with PHP versions less than 4.3, I also created other wrapper functions for strspn() and strcspn() where introduced so that FPDI should run with php 4.2+. During my testing (with hundreds of PDF files), I found several minor bugs in the parsing process—some are fixed and some are so raw that they can be ignored for now.
Let’s Convert a Page to a Form XObject First, we’ll take a deeper look at a page object found in $this->pages of a parser object. A PDF object is represented internally as an array, in a specified structure, as Marco defined in his article. For demonstration purposes, we use the shipped demonstration PDF with FPDI: $pdf =& new fpdi(); $pdf->setSourceFile(‘classes/pdfdoc.pdf’); echo “<pre>”; print_r($pdf->current_parser->pages[0]);
You can see the output in Listing 2. At first look, it seems very odd, but everything makes sense! Every entry in any level is built as an array with at least the keys 0 and 1, where 0 describes the type of the value in key 1. All other keys are used to define special attributes of that value. The types are defined as constants in pdf_parser.php. For example the 0 key in the lowest level is 9 which is defined as a PDF object. This object’s value is a dictionary (5)—in this case a page dictionary—with tokens that each have their own value types. To import a page, FPDI offers a method called ImportPage() which is close to the BeginTemplate() method of fpdi_tpl. As we’ve seen, the structure of a template entry in $this->tpls contains main entries like x, y, w, h and buffer. If we take a closer look at Listing 2, we can see a relationship between these entries. /MediaBox is an array (6) of exactly 4 entries, whose value types are numeric (1). The first entry’s value is that of x, the second of y, third of w and, not surprisingly, the last one of h. This is actually a bug in the current release of FPDI. The last 2 values are also coordinates. The real values for the width and the height have to be calculated by specifying the
distance of the first to the third and the second to the fourth value. This bug has been overlooked for a long time, because its only manifests itself if the MediaBox’s x- or y-value have values other than 0. It’ll be fixed in the future! To resolve the MediaBox’s data, the extended parser for FPDI is shipped with a getPageBox() method. This method is needed, because the MediaBox (or any other box) can also be referenced to another PDF object, or the value can be inherited by a parent page in the page tree. This method makes sure that the correct values will be resolved. Currently, FPDI supports only PDFs that contain a MediaBox—there are other boxes in the PDF specification e.g. a CropBox or a TrimBox. If your PDF uses other boxes instead of a MediaBox, the results of FPDI might not be as expected. Also if another box is used, you can ignore the bug described in the paragraph above. The next task is to fill the buffer of our template with the content stream of the imported page. There’s one important difference between a PDF page and a form XObject: a page can have multiple content streams, while a form XObject can only have one. Because of this issue, we have to concatenate all content streams of a page into one single stream. To do this, there’s a method called getPageContent() in the extended parser (fpdi_pdf_parser). All of these resolved streams can be encoded with different filters. The most commonly used filter is the FlateDecode filter which can be decoded with the zlib functions, if they are enabled in the PHP installation. I’ve also written 2 more decoders for the LZWDecode- and ASCII85Decode-filters. With these 3 filters, FPDI should handle nearly all documents which have encoded page content streams—until now there have been no bug reports related to an absent filter. The decoding of the content streams is done by the rebuildContentStream() method, in the extended parser class. After decoding all streams, they can be simply concatenated to a single one and assigned to the buffer key in the desired template array. The next step is to resolve the resources which are used in the content streams we want to import. These can be relations to images, fonts or other form XObjects. The resources are normally defined as named relations in the page dictionary, or in one parent page in the page tree. To resolve them, the extended parser offers a _getPageResources() method, which returns the desired resource data of the page. The method will not resolve the resource’s own data, but only the information like its name, and to which objects it is referenced in the original document. The real import of these resources Volume 5 Issue 1 • php|architect • 31
FPDI in Detail
FIGURE 2
FIGURE 3
A PDF cannot be compared to a file with a structural language like HTML. LISTING 3
setSourceFile(‘pdfs/article_110.pdf’); #$pagecount = $pdf->setSourceFile(‘pdfs/thumbnails.pdf’); $pdf->AddPage(); $x = $pdf->lMargin; $y = $pdf->tMargin; for ($i = 1; $i <= $pagecount; $i++) { // import page no. $i $tplidx = $pdf->ImportPage($i); // use the imported page $size = $pdf->useTemplate($tplidx, $x, $y, 250); // draw a border around the used page $pdf->Rect($x, $y, $size[‘w’], $size[‘h’], ‘D’); // if it’s the third page in a row do a // pagebreak and reset the x- and y-values. if ($i % 3 == 0) { $pdf->AddPage(); $x = $pdf->lMargin; $y = $pdf->tMargin; continue; } $x += 270; $y += 100; } $pdf->Output(‘thumbnails.pdf’, ‘D’); $pdf->closeParsers(); ?>
32 • php|architect • Volume 5 Issue 1
FPDI in Detail into the new document will be done automatically in the extended _puttemplates() method. Because these resources have their own unique identifiers in their source document, FPDI has to reassign new identifiers to the objects at runtime. All of the data which will be copied from the original document to the new document will be written by the pdf_write_value() method, which accepts an array in the same structure that you see in Listing 2. If pdf_write_value() reaches an object reference (8), it’ll reassign a new unique id (if one does not exist), and push the original object identifier onto a stack. This stack will be processed in the _putOobjects() method, recursively. If _putOobjects() sends data to pdf_write_value(), which also includes object references, the stack will be filled again. FPDI will not write duplicates of object references—it will “remember” previously written objects of a specific file. FPDI will, however, follow every object reference it finds. This behaviour is particularly important to the programmer, even if you want to import only a single page of a very large file. As I’ve already stated, the PDF structure allows the creator to define a single, global resource dictionary, as FPDF does, where all used resources are defined in the document. FPDI will not recognize which of these resources are really in use on the imported page. Just think about the following example: we create a 100 page PDF with FPDF, where each page shows one unique image. Now, we want to import page number 40 into a new document with FPDI. Because FPDF uses such a global resource dictionary, FPDI will resolve that dictionary as the resource dictionary of the single page, and will copy all of the images into the new document— even if it only shows one image! So, don’t be surprised, if you re-import pages of PDFs made by FPDF.
Using FPDI Now we should know how FPDI and fpdf_tpl work, internally. It’s time to take a look at some examples. Listing 3 shows code which creates a thumbnail overview, similar to Marco’s original article. As you can see, the usage is very simple. The first step is to call setSourceFile() with the desired PDF file, which will return the page count of the document. Next, we simply use a for loop to import each page. As you can see, the useTemplate() method nicely returns the dimensions of the imported page, so we can use this data to draw a border around it. You can see the results in Figure 2. To demonstrate FPDI’s flexibility, you can try to re-import this generated document by changing the filename to thumbnails.pdf and then take a look at Figure 3. I already suggested that FPDF normally cannot work with vector based graphics, like a logo. But, as a PDF
FIGURE 4
FIGURE 5
FIGURE 6
Volume 5 Issue 1 • php|architect • 33
devshed
devshed
FPDI in Detail document can have vector based information, we can use FPDI to do the job. Let’s go back to the first example of fpdf_tpl. I used a PNG image as the php|architect logo. If we zoom in, we’ll see that the image gets a bit distorted (see Figure 4)—it isn’t a vector image, so it doesn’t scale. To use an imported page in a template, it is necessary to import it before the call to beginTemplate(), as you can see Listing 4. This results in a much better quality page, as you can see in Figure 5. If you’re currently reading a PDF issue of this magazine, you’ll see that the document is personalized with your name and email. With FPDI and FPDF, you can get similar results. Just import a pre-existing page, and render personalized information on top of the imported data. In Listing 5, you’ll find an example of how you can personalize a PDF with FPDI—the result can be viewed in Figure 6. There’s something you need to know about creating such personalized documents: you should always keep in mind, that FPDI will not and will never manipulate an existing document, but will create a completely new one with its own structure. I should also mention that all dynamic content like links, PDF form elements, or any other annotation will get lost during the import process—they are not part of the content stream of a page. So, this personalization will only work with simple PDF files. Another point to mention is the size of the original document. Because FPDI has to rebuild the whole document, it must decode every content stream and hold them in memory. It will need a lot of computing power and memory for this task, which results in a long process time of the script—the limits of a standard PHP installation can be reached much faster than you think! If you take a closer look at the PDF version of php|a, you’ll see that it is also protected with your personal password (the same as your phparch.com account). PDF allows this, but it cannot be implemented with FPDI, alone. Some time ago, the protection extension for FPDF was written by Klemen Vodopivec, and I was involved as a beta tester and bug hunter—which was a long time before I thought about FPDI. Protection is an essential extension for FPDF—I think it’s the most commonly used one. It gives users or programmers a secure feeling. I’ve received several emails from users who want to mix both extensions to create protected PDFs with FPDF and FPDI, which in the end, resulted in a FPDI_Protection extension, which you also can download from the FPDI project homepage. FPDI_Protection’s task is simple: it must encrypt output made by FPDF’s _putstream() and _textstring() 36 • php|architect • Volume 5 Issue 1
methods, and also by FPDI’s pdf_write_value() method. There is only one particularly tricky part that you must pay attention to: strings which are HEX-encoded, instead of plain strings. These values have to be converted to plain text, first, then encrypted and reconverted to HEX values. To use FPDI_Protection in our example, we have to simply extend our pdf class from FPDI_Protection instead of FPDI. Now, we can simply use the SetProtection() method to add the protection/encryption features to our resulting PDFs.
Future and Dreams I’ve already mentioned some problems and bugs in FPDI, but have you ever found software without bugs? Probably not... I have some plans for the coming releases, which are not only mere bug fixes, but also improvements. On top of my list, there’s the handling of PDFs that contain other boxes than the aforementioned MediaBox. This missing feature is sadly FPDI’s most reported problem. If you’ve run into same problem, you can work around it by simply reprinting the PDF through the Adobe PDF printer, which is shipped with Adobe Acrobat or (maybe) some other PDF printer—I haven’t test the others. Another missing feature that I have not yet mentioned in this article is the handling of rotated pages. A PDF page can be defined as rotated, whereas LISTING 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
FPDI in Detail the coordinate system isn’t. FPDI does not currently care about the rotation, and will import such a page as it is: rotated. This means that it will be shown rotated in the resulting document, whereas it is displayed correctly in the original document. For now, you can use the rotation script at http://www.fpdf.org/en/script/script2.php to correct this behaviour, but FPDI will automatically fix this for you in the next release. Another problem that I already described was the copying of unused resources. Maybe, in the future FPDI will remove the unneeded resources automatically, too. As you can see, there are several things on my to-do list, but I want to take the opportunity to write a little about the most asked question I received after releasing FPDI: “Can I replace placeholders in an existing PDF with new text with FPDI?” No, you can’t—not with FPDI, nor any other program, without preparing the original documents. A PDF cannot be compared with a file in a structural language like HTML, even though a PDF can be a simple text file without any binary data. There is a way that will work with very raw PDF files, but it cannot be generalized. The requisites for such files are a decoded content stream of each object that will output any text string. The text string has to be plain text (not encoded), and the font that is used has to be: a) one of the 14 standard fonts, or b) completely embedded
in the original document. Now, these requirements aren’t too strict, but a PDF can be created in various ways, and you usually don’t have much of a say in how a particular PDF should be build. For example, the text string can be split into various small pieces, because the program that created the PDF used kerning pairs for layout purposes. These individual pieces or even the whole text string at all can be written as HEX-encoded strings. Generally, only a subset of the font is embedded (only the characters that are actually used in the document are included). In this case, even the full version of Acrobat itself cannot change text strings in the document. The only program I know of that will produce PDFs which are suitable is FPDF—but it will not make sense to build your templates in FPDF and replace strings in it afterwards. This intention is a dream and it looks like it will remain so, forever. Don’t waste your time on finding a solution for this. If it was technically possible, someone would have already implemented the solution.
JAN SLABON is author of FPDI and lives in Helmstedt, germany. He has put his mainskills on development of individual PHP solutions for endcustomers or other webdevelopment companies over the whole world. You can contact him at [email protected]
Available Right At Your Desk
All our classes take place entirely through the Internet and feature a real, live instructor that interacts with each student through voice or real-time messaging.
What You Get
Your Own Web Sandbox Our No-hassle Refund Policy Smaller Classes = Better Learning
Curriculum
The training program closely follows the certification guide— as it was built by some of its very same authors.
Sign-up and Save!
For a limited time, you can get over $300 US in savings just by signing up for our training program! New classes start every three weeks!
http://www.phparch.com/cert
Volume 5 Issue 1 • php|architect • 37
FEATURE
i18n
Internationalize Your Web Application with Less PHP Code
If you are looking to internationalize a web application, then you should try this simple technique which uses less PHP code, and consists mainly of easy to maintain HTML. by Carl McDa de
M
aking a web application support multiple languages can be a large job. It is a job that many do not like, and one that a lot of open source projects have avoided until now. It seems everyone is jumping on the multi-lingual train and using all sorts of PHP gadgetry to make it happen. Check a few open source projects to-do lists and you will likely find something to do with Internationalization listed. In this article, I will show you one of the easier methods of internationalizing your code using very little PHP and ordinary HTML files. Using this method is fast, easy to maintain and is as cross-platform as you can get. Before we get started on that, though, we need to go over some points that will make it easier to understand why globalization is necessary.
Globalization explained
Globalization, abbreviated by the little used g11n, is the area where the application of business practices and processes to take a business or a software product to a global market. If you want to know why globalization is important then you only have to take a look at the following statistics. As you can see in Figure 1, the internet is outgrowing its American roots and the default language is not necessarily English. Language is only part of the picture. You have to take into account that none of the countries that make up a great percentage of internet users use the
38 • php|architect • Volume 5 Issue 1
PHP: 4.3+ OTHER SOFTWARE: Macromedia Dreamweaver 2004 MX CODE DIRECTORY: i18n TO DISCUSS THIS ARTICLE VISIT: http://forum.phparch.com/282 same currency and possibly not the same date format. If your software is going to grow with the growing internet market then globalization is the key to it being successful. Now that you are excited about the prospect of people all over the world using your program, let’s take a look at the steps involved with making it useful on a global scale.
Internationalization Explained
There are several reasons why the i18n process of programming should be done at the beginning of the development cycle. Doing so significant decreases the amount of necessary code, and it removes the need to extend the product or make compromises later on in development. In many cases, a little forethought will make sure that the developer does not have to rewrite all of the code. Instead, he will simply need to write a few files to make the existing software adaptable to a different market.
i18n When there is less code to write, fewer programmers will be needed to work on internationalization. Good internationalization support means that your programming resources can be used to improve the software in other areas; the size of the end user market increases, the software becomes more globally popular because it is usable by a more diverse customer base. Using simple text, an end user can easily localize the product to a specific region.
Internationalization is abbreviated i18n because there are 18 letters in between the “i” and “n” in the word. Internationalization is the process of designing software or a web application to handle different linguistic and cultural conventions without rewriting the codebase. Internationalization is only important if you are going to be distributing your software or web application. If you are not doing so, or you are borrowing code from somewhere, then localization might be more important. Localization Explained
Localization (also known as L10n) is the process of adapting your software to the requirements of a target locale. A locale is another word for the countries and languages of a particular region. In software development a locale is mostly used in its abbreviated form. Examples of abbreviations used in software are en_US which stands for “United States English” and en_uk which stands for “European English.” Making sure the locale can be easily changed is the most important part of internationalizing software. When you build or change an application so that it can be localized to multiple languages and countries, this process is called internationalization. Remember, a web application can be localized without being internationalized. You just have to translate all of the
interfaces and content into the language of choice. There are two phases to the localization process of a web application. The first part is the translation of the user interface—the part that controls the events and presentation of the resources. The second phase is the translation of the text, media files or documents—the so called content being delivered by the presentation layer of the application. I will be talking mostly about the first phase of the process, and will touch on the second when necessary. Internationalization of a program includes a few tasks that should be planned out ahead of time. If careful attention is paid to these items at the beginning of development then there is less to debug later on.
Encodings & Code Pages
When building web enabled applications, you need to encode the page using either UTF-8 or UTF-16, and with send it with the appropriate HTML headers. It is very important to have some test content on hand, and to test the HTML page in the web browsers of choice, to make sure that they react to the headers and encoding. The localized text should appear properly with very little (or no) user configuration. The single most important element in internationalizing a web application is the page content-type. <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”>
Plural Text
The plural format of text is the nemesis of a software developer. Plurals, added to gender characteristics and social hierarchy of a language, all add up to a real challenge. The best thing to do would be to minimize the usage of text and to design flows so that the same phrase or text can be used multiple times. • There are 0 Comments • There is 1 Comment • There are 3 posts
FIGURE 1
WORLD INTERNET USAGE AND POPULATION STATISTICS WORLD REGIONS
POPULATION (2005 EST.)
POPULATION % OF WORLD
INTERNET USAGE, LATEST DATA
% POPULATION
896,721,874 14.0 % 23,917,500 2.7 % AFRICA ASIA 3,622,994,130 56.4 % 332,590,713 9.2 % EUROPE 804,574,696 12.5 % 285,408,118 35.5 % MIDDLE EAST 187,258,006 2.9 % 16,163,500 8.6 % NORTH AMERICA 328,387,059 5.1 % 224,103,811 68.2 % LATIN AMERICA/CARIBBEAN 546,723,509 8.5 % 72,953,597 13.3 % OCEANIA / AUSTRALIA 33,443,448 0.5 % 17,690,762 52.9 % WORLD TOTAL 6,420,102,722 100.0 % 972,828,001 15.2 % NOTES: (1) Internet Usage and World Population Statistics were updated on November 21, 2005
USAGE % OF WORLD 2.5 % 34.2 % 29.3 % 1.7 % 23.0 % 7.5 % 1.8 % 100.0 %
i18n Making plurals like this can be avoided using different wording which make localization simpler by removing grammatical differences. • New Comments 0 • Comments to date 1 • Number of posts 3
Dates
Usually, this is where a coder has to show some talent for business logic, or get help from a group. The internationalization of dates is an area where software companies start guarding their secrets. The date format problem can be compounded by the location of the client using the software and the location of the webserver that the software is being run on.
Database Encoding
Database encoding and unicode support are musts. A coder can never tell when the database is going to store or return incorrectly encoded text, seemingly at random. The fact that MySQL, the most popular web database, now supports unicode will make things much easier. You only now have to make sure that unicode support is enabled and ready to go.
Search
If you build search functionality into your application, then how the data is stored is critical, since all searches will likely be based on SQL statements influenced by the language and calendar system being used. The sorting and ordering of database information must also be internationalized; otherwise the search data returned may be invalid or irrelevant. Do not forget to code your PHP to allow for Unicode strings. It does not do any good to go through all the trouble of preparing a Unicodeenabled database and flexible SQL statements when the PHP code cannot insert or retrieve resources in Unicode format.
The PHP Language
It is important to remember that PHP, unlike Java, does not yet have native multi-byte character (or more simply put: Unicode) support. In PHP, a character is the same as a byte, so there are exactly 256 different possible characters. Since a string is a series of characters, this means there is a limitation on how a string is interpreted. As long as the string contains a combination of the 256 characters allowed, then things are okay. But, the internet is a very large place where some languages contain more than 256 characters. This is not quite enough characters to cover all those languages. Japanese, where the number of characters is in the hundreds, is a good example. There is, however, a way to encode and decode
40 • php|architect • Volume 5 Issue 1
strings to and from UTF-8, or Unicode, which allows a much larger set of characters. The PHP utf8_encode() and utf8_decode()functions allow string characters to be stored in multiple bytes. There are also a number of conversion routines to fix the problem of using multibyte characters When using routines like utf8_encode() on its own, the manipulation of strings cannot be trusted to the default single byte string handlers in PHP. This is where the mbstring extension comes into play. mbstring contains functions that are sensitive to multibyte encodings and allow splitting, splicing, searching and other areas of string handling. As of this writing the mbstring extension is not enabled in a default installation of PHP. This means that developers and end users that want to run software that requires mbstring_* functions should check their PHP configuration. There are still many shared hosting companies and server administrators that are unaware of the importance of the mbstring extension.
Using Open Source to Get a Jumpstart
If you are not creating a new PHP application from scratch, using an open source application may take care of most of the internationalization steps involved in building a website. The popular content management systems all use one of the three listed techniques for internationalization. Though using a content management system’s i18n support may be transparent, knowing the underlying techniques used by it can be a deciding factor in choosing a pre-made application as a base for your own projects. Knowledge of what is used in a CMS to internationalize will also influence your choice of shared hosting or what should be installed on your own server to support the software.
Internationalization Techniques
There are very few techniques used in internationalizing a PHP web application. Listed here are the three most popular: • Text definition files written in PHP, using constants • Using PHP gettext to extract and do string substitution • Using a database to store and retrieve translated text The above techniques all have their place and are useful. They also have many things in common. They are not simple enough for lazy web developers, like me. The storage method for the localized text or resources is not always readily accessible. How the resources are stored determines if they are difficult to read and manipulate. Two of the techniques in the list do not
i18n allow for easy visual formatting of the HTML code within resources while they are being translated. Being able to see the visual formatting is important, as it influences the words and choices made when translating text for a web application. Frequently, when doing a translation, it is necessary to see the wording in context with a list, line break, paragraph or the direction the text is read. Let’s take a look at each one of these techniques so that we get a baseline for comparison to the new technique I will be showing later in this article. I will also use some of the more popular open source software as reference examples of the techniques. The reason that I go through these alternate methods is that I feel you have to be familiar with the other more difficult techniques in order to see how easy it can be.
Text Definition Files
This is my personal favorite because of its simplicity and the fact that it works in the widest range of server environments. There is usually no need to do any pre-investigation of the server or shared host before installing software using this method. This is probably the most popular technique used. The reason for this is the reliability and ease of implementation. Distribution and sharing of both the original text and the translated resources is easy and fast. Some of the more popular open source content management systems that use this technique are Xoops, Joomla and PHPnuke.
Disadvantages of Text Definition Files
Duplication of defined variables can easily occur, and these files can be hard to read, at times. Like gettext(), this technique does not allow for easy formatting of HTML code. Using a visual editor to edit, copy and paste helps with this, but there is still room for improvement, as I will explain later. This technique also exposes the translator to the PHP source code and the temptation to “fix” things as they translate. There is a duplicate constant, do they delete it or change the name? It might seem a minor thing, but what if the constant contains an entire page of help text that suddenly does not show? When the application is updated and additional strings are added, there is no way to determine which new strings were added and if they are present in every language. What happens if a newly added string is not yet translated into a specific language? You have to write a script that checks for the instance and location of a variable. Text definition files suffer from a lack of readability if not formatted properly. Formatting is critical as there are no readers or other tools to help with the maintenance
of the files. The use of double or single quotes becomes a factor. Choosing one or the other means that some of the text will have to be escaped to prevent PHP parsing errors. So, while this method is very simple in itself, it does require a bit more code to implement properly. Typically, a file will contain text as shown here. Define(‘_ERROR_1’, ‘You cannot use double quotes (\” \”) ’ . ‘in the text you are sending.’); Define(“_ERROR_1”, “You cannot use single quotes (\’ \’) ” . “in the text you are sending.”);
Choices about the type of variable to be used need to be made, when writing a definition file. The PHP define() function has advantages of being slightly more readable, the use of array elements has the plus of performance and the ability to use the array index to create groups to increase the amount of text that is reusable over the entire application. $language[‘the_index’] = ‘This is some translatable text’;
A bit of advice: leave grammatical logic to the translator. Creating or finding a localization scheme that properly covers plurals is a difficult task, and many times, the coder comes to a point where they will try to use PHP to create some translation logic. Plurals can turn an elegant and simple solution in to a coding nightmare. This usually happens when the coder decides to introduce grammar and plurals to the application to make it “easier” to translate. Take a look at the following code. array( ‘I am X years and Y months old.’ => ‘I am %d years and %d months old.’), ‘es_US’ => array( ‘I am X years and Y months old.’=> ‘Tengo %2$d meses y %1$d años.’) ); ?>
This was a simple array of strings before the coder decided to allow for word plurals and grammar. By doing this, the translator is forced to know PHP. The legibility of the text and the context become lost in the code. When doing this, the coder may also introduce errors in to the text. The coder should save their energy for internationalization of business logic, date formats and try to keep program logic separated from language specific terms. Text definition files are not really meant to deal with complicated language structure. In situations like this, the better option is to allow for variances in text by using multi-dimensional arrays to group plurals $language[‘the_index’][0] = $language[‘the_index’][1] = . ‘in the standard plural $language[‘the_index’][2] = . ‘in the gender specific
‘This is some translatable text’; ‘This is some translatable text ’ form’; ‘This is some translatable text ‘ plural form’;
Volume 5 Issue 1 • php|architect • 41
i18n
Directory Structure
The directory structure for this type of system does not have to be elaborate, but it should have some standard and memorable path mapping to make coding and troubleshooting easier. A slightly modified version of the typical gettext() hierarchy works nicely. Whatever the choice, it should include separate subdirectories for each language. The reason for this is that I have found that frequently, a specialty file or extension may be needed in the localization of a web application. I also recommend that the directory and file names be similar or follow some type of naming scheme that eases the dynamic writing of paths and SQL statements. /languages /en_En en_En.php /sv_SV sv_Sv.php
Setting up definition files
Below are examples of typical definition files. As you can see, creating one of these leaves a lot of room for error on the part of the coder. This particular code does something which I consider to be an internationalization mistake. They have used place holders in the strings. This is not a developer- or translator-friendly mechanism, because it hard-codes the context and removes any possibility of reusing the phrase. It also makes it necessary to hunt down the string that will be used in the place holder. When creating translations, a non-coder may be forced to remove or adjust what is considered to be PHP code. As mentioned earlier, text should be as generic and simple as possible to make this type of thing unnecessary. Doing this is a form of string concatenation, something that should be avoided when globalizing software. // %s is your site name define(‘_US_NEWPWDREQ’,’New Password Request at %s’); define(‘_US_YOURACCOUNT’, ‘Your account at %s’); define(‘_US_MAILPWDNG’,’mail_password: could not update ‘ . ‘user entry. Contact the Administrator’);
Some other PHP software uses this format. Take note of the use of numbered indexing, which makes matching the strings to their location in the program easier. $txt[342] $txt[343] $txt[344] $txt[345] $txt[346]
= = = = =
‘Una palabra por línea’; ‘Coincidir todas las palabras’; ‘Coincidir con cualquier palabra’; ‘Coincidir como frase’; ‘Buscar -Todo- Sólo miembros’;
Advantages of Text Definition Files
Defining variables to hold text strings is the simplest
42 • php|architect • Volume 5 Issue 1
and most developer friendly method of internationalizing a web application. It requires no special tools for creation and maintenance. The technique does not impose a great amount of server resources, such as hard drive space or memory.
PHP gettext The PHP gettext() method of localizing a web application is a blessing for those that have finished a web application and want to internationalize it afterwards. Many open source PHP applications like Drupal and Gallery2 rely on the gettext extension.
Disadvantages of gettext
There are several problems with this the use of this function, though: • gettext() isn’t thread-safe, so it is not advisable in a multi-threaded environment • gettext() relies on setlocale(), but that depends on which languages are installed on the system, and in this case UTF-8 is a very tricky setting to use. I personally dislike gettext() because once you change the default language template you have to review and re-compile all the secondary languages. It is very difficult to design and program around gettext because of this factor. The addition or modification of PHP code pages that contain text which needs localization requires going through a multi-step process over and over again. This redundant process can lead to mistakes, which can waste even more time. In open source web applications, where things are being changed due to security, bug fixes or regular version upgrading, you run the risk of losing your translation in part or entirely. There just may be no translation files for the code that you are using, which may force you into learning about the systems involved and trying to find a translator on short notice. Finally, it is difficult—if not impossible—to reuse translated text when using gettext. The text extraction process is on a by file and per hit basis. So, when creating a translation, you may find yourself writing several instances of the same text, or writing a similar translation with only minor differences for many files. This is costly if you are paying for a translation. “Time is money” as they say. In a large application, where the text is stored in a PO file and there are similar occurrences of the same text, it is difficult to find the text string for just that element on the page that you are looking to change. Message IDs are no indicator of the location of the string being swapped via gettext(). PO files, themselves, are strange things that require some programming knowledge and
i18n careful usage. Although they can be altered manually in a text editor, using a program like POEdit is the preferred method. This is a limitation for many, because POEdit is not a cross-platform program. POedit has no Macintosh version, which leaves those types of users out. This is saddening, since many Mac users are writers or in the news media. They are the ones most likely to also be in the need of, or provide translation services. Computer assisted translation, CAT, is also very difficult to setup and use with PO files. The CAT programs that do this well are very expensive. These shortcomings are probably the reason that Word files are the standard file format for translators. After translation texts are completed, a PO file must compiled into an MO file for use by PHP
Directory Structure
gettext requires that the resource files have a specific
structure and that the information about this structure be set into the PHP code. /locale /en /LC_MESSAGES messages.po messages.mo
Multiple languages are set up in an identical hierarchy. /locale /en /LC_MESSAGES messages.po messages.mo /sv_SV /LC_MESSAGES messages.po messages.mo
Setting a Locale (and Other Requirements)
Setting a locale is requirement for gettext(). This is the main instructions that PHP needs if it is to find resources for translated text.
Designating and Extracting Strings
The PHP code needs to be set up to accommodate the extraction of strings and so that PHP can find the strings
that are to be translated. This is done by using the gettext() function on strings:
The text string in the above code can be extracted and set into a po file using a command line function that will hunt for instances of gettext() and set the strings into indexed messages for each occurrence: $ xgettext -n *.php
After extraction, the po file to be translated should look like Listing 1.
Creating the MO Files
In any case, either you or the volunteers will translate the po file and then you will need to convert the file into a binary file that gettext actually understands. For that, you would use the following command: $ msgfmt messages.po
The line above will create a messages.mo file, which you should save in the appropriate directory. locale//LC_MESSAGES/ ng strings y.
Plurals and ngettext()
Plural form is the toughest part of text translation, especially if you have lots of text where plurals are needed. In this case, you will need ngettext() and not the simpler gettext().
Advantages of gettext
The gettext method of internationalization is not as popular as the other two methods. The reason for this is that it poses a heavy burden on the developers and the end users. In most OSS projects, the developers are responsible for providing the original translation files. After this is done via extraction scripts, the files need to be once again translated and possibly merged to previous translation versions by the translator. The translator can be the end user, a volunteer, or even another development team member. The bottom line is that gettext requires a lot of resources to maintain and support. In a large project with lots of volunteers, or a medium sized company, this is not really a hindrance. But, for the lone developer or small group the burden is large. There is also the factor that gettext does not mean that the developer escapes the job of hunting down text strings and formatting them to use the gettext() function in the
Volume 5 Issue 1 • php|architect • 43
i18n same way that you would have to do if definition files were used. The best thing about the gettext method of internationalization is that the developer does not have to think up unique names for variables. In a large application this can be a tremendous advantage over other techniques.
Database Storage
At first look, working with the database method of storing translated text seems like a joy. I admit I had fun using the Mambelfish component for the Mambo CMS when doing a translation of a website. A database gives what the other techniques seem to lack: order. Relational database systems were built to give power to how information is related, and use these relationships to organize the information into an easily accessible source.
are not part of innovation. Even if I were not so lazy, there are no repositories of MySQL translation tables for Mambelfish which is used in Joomla or any other open source CMS project. Asking for exports from someone else’s database on the Mambo and Joomla CMS forums proved to be less than successful. If a repository for database tables did exist, there is also the problem of not being able to browse the translation beforehand to check its quality. There is always a bit of uncertainty associated with storing information in a database which is why backups are so important. When you start moving information from one database or database server to another, things can rapidly start to fail or acquire bugs. In my experience, you just never know if the encoding is going to be correct after the move. Even when the server configurations are identical, there may be some things that just do not work.
Internationalization of a program includes a few tasks that should be planned out ahead of time. Disadvantages of Using a Database
When internationalizing a web application, distribution of the resources to be translated is very important. Getting the work to the translators is necessary, and there must be a system in place for getting the finished translations to the end users of the product. So far, I have not found one commercial or open source product that offers localization resources in the form of SQL scripts or native database files. As a result, translations are done repeatedly by each end user of that product. Frequently, internationalization using a database is mixed with the other techniques to make up for this shortcoming. I first came across this problem when I found that I wanted to reuse my translation for several different website installations, or borrow one for a language I did not know. Even though the exportation and importation of database tables was not difficult, I found the need maintain an archive of translations because I am the lazy type of coder. I don’t like doing repetitive tasks that
44 • php|architect • Volume 5 Issue 1
You just never know until you determine which part of the chain is responsible for an incorrect encoding bug. Was it the PHP code, the HTML, or the data source? You are just very happy if everything works. I use a lot of web hosting located in the United States, but frequently, my clients are in Sweden or another European country. There have been times where the web host has not installed a UTF-8 character set. The Swedish alphabet only has three characters more than its English counterpart, so fixing any problems was easy. But I do not envy any web developer that has to solve this with any of the Cyrillic alphabet languages. This technique of using a database as a resource for translation strings works well when it works. Using computer assisted translation tools is obviously difficult if not impossible with the database method. You are reduced to using cut and paste operations within a web based interface or a database front end program like MySQL administrator or Microsoft enterprise manager. Caution must be taken when doing this as inputting text this way may work fine and produce the right results at
i18n first glance, but when the actual web application is used to retrieve the text the encoding maybe different from what you expect.
MySQL 4.1
MySQL 3.x or MySQL 4.0.x do not have unicode support. The default character encoding is called latin1 and is single-byte, may not seem like much of a problem at first glance because while the database itself is not aware of the actual encoding, using a varchar field type, it still manages to output the strings in much the same way that they were previously put in the database. But in some cases, you may see incorrect characters when directly accessing the database with code that does not take this into account. Searching or ordering will sometimes not work correctly. These inconsistencies are due to the fact that even though two, three or four bytes should actually represent one character, MySQL interprets them as one character per byte. I have personally had experiences with the Swedish characters äåö being stored as varchar but being seen differently by different versions of phpMyAdmin, the php database administration tool, when exploring a database with these characters.
Many people wondered why I got so excited that MySQL was finally going to support unicode with version 4.1. This is because with unicode support (UTF-8), a more elegant internationalization plan can be implemented. Different character sets can also be set per column, table or database, which means data from many languages can be stored without using elaborate coding routines to encode and decode strings. It also means ordering, searching, indexing and similar stringrelated functions in MySQL work correctly.
Advantages of Using a Database
The greatest advantage of using a database to store the resources for localization is the convenience. An interface can be built to group the translation tasks in to a single area. You don’t have to dig into the file system to find the proper resource file that holds the text strings that you want to translate. If done right, usually within a few clicks you are presented with a user interface and only have to make a few simple choices before enter the
Najib said “?????? ?????” 13 (as-salaam alaykum] to me.
This is the 14 help text for my own idea module possible 15 to see the line ends in Dreamweaver because 16 of the syntax editor. 17 18 19
my own button
20 21 22 send 23 24
my own idea text
1This is some other text in the module this to check for paragraph and line breaks***
en styke till
add some more HTML here lägga till mer HTML text här
this works nicely a in both design and code mode of Dreamweaver.
var bra med öäå också
possibly want to have the headers in so that unicode can be used in the editors. They are easy enough to remove
my own translation scheme
This is the translation scheme for my own idea module which does not make room for HTML yet. but the best thing about it is that translations can be done in a simple HTML editor or a visual editor like Dreamweaver.
my own reset button
reset
my own more text
Detta ar nagra text på svenska
a new button
button text
Volume 5 Issue 1 • php|architect • 45
i18n translation. This method and allows for making small changes quickly. The translator is kept totally separate form the underlying PHP code. Though database resource storage is suitable to content translation for the most part, there are situations where it shines when used on the user interface. Dynamic menu systems are a good example of where this technique is a must. In the Joomla content management system, database translation tables are used in coordination with text definition files to make localization easier. The database tables feed the more dynamic presentation layer, while the definition files deal with the administration areas, which are not changed often (or at all). Searching resources stored in a database results is much more relevant information being returned because more relevant information can be stored. Dates, titles, categories and strings can be searched in the localized language. This is very hard to do when translations are stored in other formats.
But, in most situations, a dynamic web page will access a single array element no more than twice to get the needed texts. Calling an array into your code may require you to set it as a global. Arrays have the benefits of being organized and duplicates can be weeded out easily. When using constants, you always run the risk of name collisions. The plus side to using constants is that they are easily written to a cache table and are not required to be set as global, to be called within your php code. Rather than getting into benchmarking and other aspects of performance I will just say that you should weigh the pluses and minuses and choose the method that seems best for your application. The code to process the HTML into PHP data can be seen in Listing 3.
Editing the Text
Here it is the technique you have been waiting for. It is simple, user friendly and editable without a using database, special tools or exposing the translator to PHP source code. The code is short, easily modified to suit various needs, and PHP makes using this technique easy.
The best thing about this technique is that any text or HTML editor can be used. These are available on most popular desktop operating systems. The translator is not bound by the restrictions of a program like POedit. The text is also seen in a familiar format. When using Dreamweaver in code mode, editing the translation file is easy and straight forward. After setting up the translation file in the code view, using Dreamweaver in design mode makes translating and editing the text even easier. You can also see and edit the comment tags in design mode as shown in Figure 2.
Disadvantages
Computer Assisted Translation
HTML Definition Files
Yes, there are disadvantages to using this method, but they are the same as those when using a typical definition file described earlier. Some of the problems in using text definition are solved by changing the storage method and avoiding using PHP within the resource files.
Creating Resources
First, you need to create a simply formatted HTML document using
tags to show the names of the variables to be created as separate text blocks while in a visual HTML editor. Comment tags are used to designate text blocks to be translated and loaded into PHP variables. When finished and formatted, your HTML file should look like Listing 2.
The PHP Code
Let’s look at two examples. The first uses an array to store the translated text; the second, a set of defined constants. Both of these methods have some minor drawbacks. When using an array, if the array is large with the number of elements in the millions and it is accessed multiple times, then a performance problem may occur.
46 • php|architect • Volume 5 Issue 1
Although much CAT software does not like HTML, this is not really a big problem when using the technique described here. You can easily use a WYSIWYG editor then cut and paste the translation text into the CAT program.
Benefits
Why use the technique I’ve described here? There are many reasons but here are a few of the strong ones. HTML is universally used and accepted with a very shallow learning curve. Translators, developers, programmers, webmasters and designers can easily see the HTML text and know what is going on, thus making it easier to maintain a good translation and share it. HTML pages can be checked in a web browser for proper encoding. CSS can be used to create a more visually pleasing text at the time of translation. In cases of right-to-left or top-to-bottom languages the technique can show the text in the proper read direction while editing. There is less PHP code, fewer server resources and reduced maintenance to worry about. I hope that many PHP developers will start using this simple technique in the future, as it makes everyone’s job easier.
i18n Both commercial and open source projects can benefit from this type of Internationalization technique. It goes very quickly and previous resources used in internationalization maintenance can be used to make improvements elsewhere in the project. The time to get the software to market becomes shorter and more defined.
You might think that delivering an English version of an application is good enough, but is it really? The software market may carry your work across national borders—if it does, an English version is only the beginning.
LISTING 3
FIGURE 2
1
2 <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”> 3 Untitled Document 4 5 6 7 ’,$text); 21 22 foreach ($preVar as $preVar_1){ 23 24 //Start seperation of array item key name and array item value; 25 $preVar_2 = explode(‘-->’, $preVar_1); 26 27 //Seperate out the names for the array keys’; 28 $preVar_3 = explode(‘