; chomp @users; # remove linefeeds @fields = quotewords(':', 0, @users); close PASSWD; } print "@fields";
677
Chapter 18
The keep parameter determines whether quotes and backslashes are removed once their work is done, as real shells do, or whether they should be retained in the resulting list of words. If False, quotes are removed as they are parsed. If True, they are retained. The keep flag is almost but not quite Boolean. If set to the special case of delimiters, both quotes and characters that matched the word separator are kept: # emulate 'shellwords' but keep quotes and backlashes and also store the # matched whitespace as tokens too @words = quotewords('\s+', 'delimiters', @lines);
Batch Parsing Multiple Lines The /etc/passwd/ example above works, but assembles all the resultant fields of each user into one huge list of words. Far better would be to keep each set of words found on each individual line in a separate list. We can do that with the nested_quotewords subroutine, which returns a list of lists, one list for each line passed in. Here is a short program that uses nested_quotewords to do just that: #!/usr/bin/perl # password.pl use warnings; use strict; use Text::ParseWords; @ARGV = ('/etc/passwd'); my @users = nested_quotewords(':', 0, <>); print scalar(@users)," users: \n"; foreach (@users) { print "\t${$_}[0] => ${$_}[2] \n"; }
This program prints out a list of all users found in /etc/passwd and their user id. When run it should produce something resembling the following: > perl password.pl 16 users: root => 0 bin => 1 daemon => 2 adm => 3 lp => 4 sync => 5 shutdown => 6 halt => 7 mail => 8 news => 9 uucp => 10 operator => 11 games => 12 gopher => 13 ftp => 14 nobody => 99
678
Text Processing and Document Generation
As it happens, in this case we could equally well have used split with a split pattern of a colon since quotes do not usually appear in a password file. However, the principle still applies.
Parsing a Single Line Only We mentioned earlier that there were four subroutines supplied by Text::ParseWords. The fourth is parse_line, and it is the basis of the other three. This subroutine parses a single line only, but is otherwise identical in operation to quotewords and takes the same parameters with the exception that the last can only be a scalar string value: @words = parse_line('\s+', 0, $line);
The parse_line subroutine provides no functional benefit over quotewords, but if we only have one line to parse, for example a command-line input, then we can save a subroutine call by calling it directly rather than via quotewords or shellwords.
Formatting Paragraphs with 'Text::Wrap' The Text::Wrap module provides text-formatting facilities to automate the task of turning irregular blocks of text into neatly formatted paragraphs, organized so that their lines fit within a specified width. Although it is not particularly powerful, it provides a simple and quick solution. It provides two subroutines; wrap handles individual paragraphs and is ideally suited for formatting single lines into a more presentable form and fill handles multiple paragraphs and will work on entire documents.
Formatting Single Paragraphs The wrap subroutine formats single paragraphs, transforming one or more lines of text of indeterminate length and converting them into a single paragraph. It takes three parameters; an initial indent string, which is applied to the first line of the resulting paragraph, a following indent string, applied to the second and all subsequent lines, and finally a string or list of strings. Here is how we could use wrap to generate a paragraph with an indent of five spaces on the first line and an indent of two spaces on all subsequent lines: $para = wrap('
',' ', @lines);
Any indentation is permissible. Here is a paragraph formatted (crudely) with HTML tags to force the lines to conform to a given line length instead of following the browser's screen width: $html = wrap(" ", "
", $text);
If a list is supplied, wrap concatenates all the strings into one before proceeding – there is no essential difference between supplying a single string over a list. However, existing indentation, if there is any, is not eliminated so we must take care to deal with this first if we are handling text that has already been formatted to a different set of criteria. For example: map {$_ =~ s/^\s+//} @lines;
# strip leading whitespace from all lines
679
Chapter 18
The list can be of any origin, not just an array variable. For example, take this one line reformatting application: print wrap("\t", "", <>);
# reformat standard input/ARGV
Tabs are handled by Text::Wrap to expand them into spaces, a function handled by Text::Tabs, previously documented. When formatting is complete, spaces are converted back into tabs, if possible and appropriate. See the section on Text::Tabs for more information.
Customized Wrapping The Text::Wrap module defines several package variables to control its behavior, including the formatting width, the handling of long words, and the break text.
Formatting Width This is the most obvious variable to configure. The number of columns to format is held in the package variable Text::Wrap::columns, and has a default value of 76, which is the polite width for things like e-mail messages (to allow a couple of > quoting prefixes to be added in replies before 80 columns is reached). We can change the column width to 39 with: $Text::Wrap::columns = 39;
All subsequent calls to wrap will now use this width formatting.
Long Words Words that are too long to fit the line are left as is (URLs in text documents are a common culprit). This behavior can be altered to a fatal error by setting the variable the following line: $Text::Wrap::huge = 'die';
This will cause wrap to die with an error can't wrap '', where is the paragraph text. Unfortunately wrap does not support anything resembling a user-defined subroutine to handle long words, for example by hyphenating them, though since Text::Wrap is not a very complicated module, creating our own customized version to do this would not prove too difficult.
Break Text Although it is not that likely that we would want to change it, we can also configure the break text, that is, the character or characters that separate words. In fact the break text is a regular expression, defined in the package variable $Text::Wrap::break, and its default value is \s for any whitespace character. As the break text is presumed to lie between words (that is, the actual meaningful content), it is removed from the paragraph before reformatting takes place. That means that while it might seem appealing to specify a break text pattern of \s|- to allow wrap to wrap on hyphenated words, we would lose the hyphen whenever this actually occurred.
Debugging A limited debugging mode can be enabled by setting the variable $Text::Wrap::debug: $Text::Wrap::debug = 1;
680
Text Processing and Document Generation
The columns, break, and huge variables can all be exported from the Text::Wrap package by specifying them in the import list: use Text::Wrap($columns $huge $break); $columns = 39; $huge = 'die';
As with any module symbols we import, this is fine for simple scripts but is probably unwarranted for larger applications – use the fully qualified package variables instead.
Formatting Whole Documents Whole documents can be formatted with the fill subroutine. This will chop the supplied text into paragraphs first by looking for lines that are indented, indicating the start of a new paragraph, and blank lines, indicating the end of one paragraph and the start of another. Having determined where each paragraph starts and ends, it feeds the resulting lines to wrap, before merging the resulting wrapped paragraphs back together. The arguments passed to fill are the same as those for wrap. Here is how we would use it to reformat paragraphs into unindented and spaced paragraphs: $formatted_document = fill("\n", "", @lines);
If the two indents are identical, fill automatically adds a blank line to separate each paragraph from the previous one. Therefore the above could also be achieved with: $formatted_document = fill("", "", @lines);
If the indents are not identical then we need to add the blank line ourselves: $formatted_document = fill("\t", "", @lines); # indent each new paragraph with a tab, paragraphs are continuous $formatted_document = fill("\n\t", "", @lines); # indent each new paragraph with a tag, paragraphs are separated
All the configurable variables that affect the operation of wrap also apply to fill, of course, since fill uses wrap to do most of the actual work. It is not possible to configure how fill splits text into paragraphs. Note that if fill is passed lines already indented by a previous wrap operation, then it will incorrectly detect each new line as a new paragraph (because it is indented). Consequently we must remove misleading indentation from the lines we want to reformat before we pass them to fill.
Formatting on the Command Line Although it is a somewhat simplistic module, Text::Wrap's usage is simple enough for it to be used on the fly on the command-line: > perl -MText::Wrap -e "fill('','',<>)" -- textfile ...
681
Chapter 18
Here we have used -- to separate Perl's arguments from filenames to be fed to the formatter. We can supply any number of files at once, and redirect the output to a file if we wish.
Matching Similar Sounding Words with 'Text::Soundex' The Text::Soundex module is different in nature to the other modules in the Text:: family. Whilst modules such as Text::Abbrev and Text::ParseWords are simple solutions to common problems, Text::Soundex tackles a different area entirely. It implements a version of the Soundex algorithm developed for the US Census in the latter part of the 19th century, and popularized by Donald Knuth of TeX fame. The Soundex algorithm takes words and converts them into tokens that approximate the sound of the word. Similar sounding words produce tokens that are either the same or close together. Using this, we can generate soundex tokens for a predetermined list of words, say a dictionary or a list of surnames, and match queries against it. If the query is close to a word in the list, we can return the match even if the query is not exactly right, misspelled7, for example.
Tokenizing Single Words The Text::Soundex module provides exactly one subroutine, soundex that transforms one word into its soundex token. It can also accept a list of words and will return a list of the tokens, but it will not deal with multiple words in one string: print print print print print
soundex soundex soundex soundex soundex
"hello"; # produces 'H400' "goodbye"; # produces 'G310' "hilo"; # produces 'H400' - same as 'Hello' qw(Hello World); # produces 'H400W643' "Hello World" # produces 'H464'
The following short program shows the Soundex algorithm being used to look up a name from a list given an input from the user. Since we are using Soundex, the input doesn't have to be exact, just similar: #!/usr/bin/perl # surname.pl use warnings; use strict; use Text::Soundex; # define an ABC of names (as a hash for 'exists') my %abc = ( "Hammerstein" => 1, "Pineapples" => 1, "Blackblood" => 1, "Deadlock" => 1, "Mekquake" => 1, "Rojaws" => 1, ); # create a token-to-name table my %tokens; foreach (keys %abc) { $tokens{soundex $_} = $_; }
682
Text Processing and Document Generation
# test input against known names print "Name? "; while (<>) { chomp; if (exists $abc{$_}) { print "Yes, we have a '$_' here. Another? "; } else { my $token = soundex $_; if (exists $tokens{$token}) { print "Did you mean $tokens{$token}? "; } else { print "Sorry, who again? "; } } }
We can try out this program with various different names, real and imaginary, and produce different answers. The input can be quite different from the name if it sounds approximately right: > perl surname.pl Name? Hammerstone Did you mean Hammerstein? Hammerstein Yes, we have a 'Hammerstein' here. Another? Blockbleed Did you mean Blackblood? Mechwake Did you mean Mekquake? Nemesis Sorry, who again?
Tokenizing Lists of Words and E-Mail Addresses
#!/usr/bin/perl # soundex.pl use warnings; use strict; use Text::Soundex;
AM FL Y
We can produce a string of tokens from a string of words if we split up the string before feeding it to soundex. Here is a simple query program that takes input from the user and returns a list of tokens:
TE
while (<>) { chomp; #remove trailing linefeed s/\W/ /g; #zap punctuation, e.g. '.', '@' print "'$_' => '@{[soundex(split)]}'\n"; }
We can try this program out with phrases to illustrate that accuracy does not have to be all that great, as a guide: > perl soundex.pl definitively inaccurate 'definitively inaccurate' => 'D153 I526' devinatovli inekurat 'devinatovli inekurat' => 'D153 I526' As well as handling spaces we have also added a substitution that converts punctuation into spaces first. This allows us to generate a list of tokens for an e-mail address, for example. By concatenating and storing these with a list of e-mail addresses in the same way as the name example we showed earlier we could easily implement a simple e-mail-matching program. Used correctly we could handle e-mails sent to pestmaster instead of postmaster, for example.
683
Team-Fly®
Chapter 18
The 'Soundex' Algorithm As the above examples illustrate, Soundex tokens consist of an initial letter, which is the same as that of the original word, followed by three digits that represent the sound of the first, second and third syllable respectively. Since one has one syllable, only the first digit is non-zero. On the other hand, seven has two syllables, so it gets two non-zero digits. Comparing the two results we can notice that both one and seven contain a 5, which corresponds to the syllable containing n in each word. The Soundex algorithm has some obvious limitations though. In particular, it only resolves words up to the first three syllables. However, this is generally more than enough for simple 'similar sounding' type matches, such as surname matching, for which it was designed.
Handling Untokenizable Words In some rare cases, the Soundex algorithm cannot find any suitable token for the supplied word. In these cases it usually returns nothing (or to be more accurate, undef). We can change this behavior by setting the variable $Text::Soundex::soundex_nocode: $Text::Soundex::soundex_nocode = 'Z000'; print soundex "=>"; # produces 'Z000'
# a common 'failed' token
If we change the value of this variable we must be sure to set it to something that is not likely to genuinely occur. The value of Z000 is a common choice, but matches many words including Zoo. A better choice in this case might be Q999, but there is no code that is absolutely guaranteed not to occur. If we do not need to conform to the Soundex code system (we might want to pass the results to something else that expects valid Soundex tokens as input) then we can simply define an impossible value like _NOCODE_ or ?000, which soundex cannot generate.
Other Text Processing Modules As well as the modules in the Text:: family, several other Perl modules outside the Text:: hierarchy involve text-processing or combine text-processing with other functions. Several of the Term:: modules involve text-processing in relation to terminals. For instance, Term::Cap involves generating ANSI escape sequences from capability codes, while Term::ReadLine provides input line text-processing support. These modules are all covered in Chapter 15.
Documenting Perl Documentation is always a good idea in any programming language, and Perl is no exception. In fact, because Perl is so good (or, as some might have it, so bad) at allowing us to create particularly indecipherable code, documentation is all the more important. There is also such a thing as too much documentation; the more there is, the more we have to keep up to date with the actual code. If we are documenting every line of code it is probably a sign we need to spend more time making the code easier to understand and less time writing commentary to make up for it.
684
Text Processing and Document Generation
Like most programming languages, Perl supports simple comments. However, it also attempts to combine the onerous duties of commenting code and documenting software into one slightly less arduous task through the prosaically named pod syntax.
Comments Perl comments are, as we would expect, very simple and traditional. In the style of C++ (but not C) and too many shells to count, they consist of a leading #, plus whatever it is we want to say: # like this
Anything after the # is ignored by Perl's interpreter. Comments may be placed on a line of their own, or after existing Perl code. They can even be placed in the middle of statements: print 1 * # depth 4* # width 9 # height ;
Perl is smart enough not to try to interpret a # inside a string as the start of a comment, but conversely we cannot comment here documents. Also we also cannot comment multiple lines at once, in C-style /*...*/ comments, but we can use pod for that if we want.
plain old documentation pod is a very simple markup syntax for integrating documentation with the source code that it is documenting. It consists of a series of special one-line tokens that separate pod from anything else in a file (presumably code), and also allows us to define simple structures like headings and lists. In and of itself, pod does nothing more than regular comments do, except give us the ability to write multi-line comments. However, its simple but flexible syntax also makes it very simple to convert into 'real' documents like text files and HTML documents. Indeed, Perl comes with the pod2text and pod2html tools for this very purpose. The elegance of pod is that if we want to read the documentation for a particular module or library file we can just run pod2text against the module file to generate the documentation. The perldoc utility is just a friendlier interface to the same pod translation process. Java programmers will be right at home with pod – it is basically a similar (and incidentally, prior) idea to javadoc, which also encourages self-documenting code.
pod Paragraphs Pod allows us to define documentation paragraphs, which we can insert into other documents – most usually, but by no means exclusively, Perl code. The simplest sequence is =pod ... =cut, which simply states that all the following text is to be taken as pod paragraphs, until the next =cut or the end of the file: ... $scalar = "value"; =pod This is a paragraph of pod text embedded into some Perl code. It is not indented, so it will be treated as normal text and word wrapped by pod translators. =cut print do_something($scalar); ...
685
Chapter 18
Within pod, text is divided into paragraphs, which are simply blocks of continuous text, just as they are anywhere else. A paragraph ends and a new one begins when a completely empty line is encountered. All pod tokens absorb the paragraph after them on the same line or immediately following. Some tokens, such as headings, use this paragraph for display purposes. Others, like =pod and =cut simply ignore it, so we can document the pod itself: =pod this is just a draft document mysubname - this subroutine doesn't do much of anything at all and is just serving as an example of how to document it with a pod paragraph or two =cut end of draft bit
The only partial exception to this is =cut. While translators will absorb text immediately following =cut, Perl itself will not, so we cannot spread a 'cut comment' across more than one line.
Paragraphs and Paragraph Types If a paragraph is indented then we consider it to be preformatted, much in the same way that the HTML tag works. The following example shows three paragraphs two of which are indented: =pod This paragraph is indented, so it is taken as is and not reformatted by translators like: pod2text - the text translator pod2html - the HTML translator pod2man - the UNIX manual page translator Note that 'as is' also means that escaping does not work, and that interpolation doesn't happen. What we see is what we get. This is a second paragraph in the same =pod..=cut section. Since it is not indented it will be reformatted by translators. Again indented This is a third paragraph Useful for Haikus =cut
Headings Headings can be added with the =head1 and =head2 tokens, which provide first and second-level headings respectively. The heading text is the paragraph immediately following the token, and may start (and end) on the same line: =head1 This is a level one heading
Or: =head2 This is a level two heading
686
Text Processing and Document Generation
Or: =head2 As is this
Both heading tokens start a pod section in the same way that =pod does if one is not already active. pod sections do not nest, so we only need one =cut to get back to Perl code. In general we tend to use the first form, but it is important to leave an empty line if we do not want our heading to absorb the following paragraph: =head1 This heading has accidentally swallowed up this paragraph, because there is no separating line. =head2 Worse than that, it will absorb this second level heading too so this is all one long level one heading, and not the two headings and paragraphs as we actually intended. =cut
This is how we should really do it: =head1 This is a level one heading This is a paragraph following the level one heading. =head2 This is a level two heading This a preformatted paragraph following the level 2 heading. =cut
How the headings are actually rendered is entirely up to the translator. By default the text translator pod2text indents paragraphs by four spaces, level-two headings by two, and level-one headings by none – crude, but effective. The HTML translator pod2html uses tags, as we might expect.
Lists Lists can be defined with the =over...=back and =item tokens. The =over token starts a list, and can be given a value such as 4, which many formatters use to determine how much indentation to use. The list is ended by =back, and is optional if the pod paragraph is at the end of the document. =item defines the actual list items of which there should be at least one, and we should not use this token outside an =over...=back section. Here is an example four item list – note again the empty lines that separate the items from the paragraphs between them. =over 4 =item 1 This is item number one on the list =item 2 This is item number two on the list =item 3
687
Chapter 18
This is the third item =item 4 Guess... =back
Since the =over and =item tokens can take values, and because the only kind of value pod knows is the paragraph, we can also say: =over 4 =item 1 ...
In practice we probably do not want to do this, but the same caveats about headings and paragraphs also apply here. Note that =over works like =pod, and starts a pod section if one is not active. The numbers after the =item tokens are purely arbitrary; we can use anything we like for them, including meaningful text. However, to make the job of pod translators easier we should stick to a consistent scheme. For example if we number them, we should do it consistently, and if we want to use bullet points then we should use something like an asterisk. If we want named items, we can do that too. For example, a bullet pointed list with paragraphs: =over 4 =item * This is a bullet pointed list =item * With two items =back
A named items list: =over 4 =item The First Item This is the description of the first item =item The Second Item This is the description of the second item =back;
688
Text Processing and Document Generation
A named items list without paragraphs: =over 4 =item Stay Alert =item Trust No one =item Keep Your Laser Handy =back
pod translators will attempt to do the best they can with pod lists, depending on what it is they think we are trying to do, and the constraints of the document format into which they are converting. The pod2text tool will just use the text after the item name. The pod2html tool is subject to the rules of HTML, which has different tags for ordered, unordered, and descriptive lists (, , and ) so it makes a guess based on what the items look like. A consistent item naming style will help it make a correct guess. If =over is used outside an existing pod section then it starts another one. The =back ends the list but not the pod section, so we also need to add a =cut to return to Perl code: =over 4 =item * This is item 1 =item * This is item 2 =back =cut
Translator-Specific Paragraphs The final kind of pod token are the =for and =begin...=end tokens. The former token takes the name of a specific translator and a paragraph, which is rendered only if that translator is being used. The paragraph should be in the output format of the translator, that is, already formatted for output. Other translators will entirely ignore the paragraph: =for text This is a paragraph that will appear in documents produced by the pod2text format. =for html But this paragraph will appear in HTML documents
Again, like the headings and item tokens, the paragraph can start on the next line as in the first example, or immediately following the format name, as in the second. Since it is annoying to have to type =for format for every paragraph in a collection of paragraphs we can also use the pair of =begin...=end markers. These operate much like =pod...=cut but mark the enclosed paragraphs as being specific to a particular format:
689
Chapter 18
=begin html Paragraph1
Paragraph2 =end
If =begin is used outside an existing pod section then it starts one. The =end ends the format specific section but not the pod section, so we also need to add a =cut to return to Perl code, just as for lists. =begin html
A bit of HTML document =end =cut
The =begin and =end tokens can also be used to create multi-line comments, simply by providing =begin with a name that does not correspond to any translator. We can even comment out blocks of code this way: =begin disabled_dump_env foreach (sort keys %ENV) { print STDERR, "$_ => $ENV{$_}\n"; } =end =begin comment This is an example of how we can use pod tokens to create comments. Since 'comment' is not a pod translator type, this section is never used in documents created by 'pod2text', 'pod2html', etc. =end
Using pod with 'DATA' or 'END' tokens If we are using either a __DATA__ or a __END__ token in a Perl script then we need to take special care with pod paragraphs that lie adjacent to them. pod translators require that there must be at least one empty line between the end of the data and a pod directive for the directive to be seen, otherwise it is missed by the translation tools. In other words, write: ... __END__ =head1 ...
And not: ... __END__ =head1
690
Text Processing and Document Generation
Interior Sequences We mentioned earlier that pod paragraphs could either be preformatted (indicated by indenting), or normal. Normal paragraphs are reformatted by translators to remove extraneous spaces, newlines, and tabs. Then the resulting paragraph is rendered to the desired width if necessary. In addition to this basic reformatting, normal paragraphs may also contain interior sequences – specially marked out sections of the text with particular meanings. Each sequence consists of a single capital letter, followed by the text to treat specially within angle brackets. For example: =pod This is a B<paragraph> that uses I and B markup using the BEtextE and IEtext interior sequences. Here is an example code fragment: C<substr $text,0,1> and here is a filename: F. All these things are of course represented in a style entirely up to the translator. See L for more information. =cut
Note that to specify a real < and > we have to use the E and E sequences, reminiscent of the < and > HTML entities.
Pod Tools and Utilities Having documented our code using pod, it is time to generate some documents from it. Perl provides a collection of modules in the Pod:: family that perform various kinds of translations from pod to other formats. Perl also provides utility modules for checking syntax. We can use these modules programmatically in our own documentation utilities, but more commonly we simply use one of the utilities that Perl provides as standard. All of these tools support both long and short option names, in the manner of the Getopt::Long module (which is what they all use to process the command-line), so we can say -v or --verbose or a number of things in between for the pod2usage tool. We have used the long names below, but all of them can be abbreviated to one extent or another.
Translator Tools We have already mentioned pod2text and pod2html a few times. Perl also comes with other translators and some utilities based on pod too. All the translators take an input and optional output file as arguments, plus additional options to control their output format. Without either, they take input from standard input and write it to standard output. The list of pod translators supplied with Perl is as follows: ❑
pod2text – translates pod into plain text. If the -c option is used and Term::ANSIColor is installed (see Chapter 15) colors will also be used.
❑
pod2html – translates pod into HTML, optionally recursing and processing directories and integrating cross-links between pages.
❑
pod2roff, pod2man – both these tools translate pod into UNIX manual pages using the Pod::Man module.
691
Chapter 18
As always, for more details on these translators we consult the relevant perldoc page using the now familiar command-line and substituting our choice of translator: > perldoc The pod2man translator also supplies font arguments, which only apply to troff, an -official argument for official Perl documentation, and a -lax argument, which is meant to check that all parts of a UNIX manual page are present. Unfortunately at present it does not do this. See: > perldoc pod2man for details on these. Note that there are plenty of additional pod translation tools available from CPAN; including translators for RTF/Word, LaTex, Postscript, and plenty of other formats. Chances are that, if it is a halfway popular format, there will be a pod translator for it.
Retrieval Tools In addition to the translators, Perl provides three tools for extracting information from pods selectively. Of these, perldoc is by far the most accomplished. Although not strictly a translator, perldoc is a utility that makes use of translators to provide a convenient Perl documentation lookup tool. We covered it in detail back in Chapter 1. To attempt to retrieve usage information about the given Perl script, we can use the pod2usage tool. For example: > pod2usage myscript.pl The tool searches for a SYNOPSIS heading within the file and prints it out using pod2text. A verbosity flag may be specified to increase the returned information: -v 1 (default) SYNOPSIS only -v 2 OPTIONS and ARGUMENTS too -v 3 All pod documentation A verbosity of 3 is equivalent to using pod2text directly. If the file is not given with an absolute pathname then -pathlist can be used to provide a list of directory paths to search for the file in. A simpler and more generic version of pod2usage, is podselect. This tool attempts to locate a level 1 heading with the specified section title and returns it from whichever files it is passed: > podselect -s='How to boil an egg' *.pod Note that podselect does not do any translation, so it needs to be directed to a translator for rendering into reasonable documentation.
pod Verification It is easy to make simple mistakes with pod, omitting empty lines or forgetting =cut for example. Fortunately pod is simple enough to be easy to verify as well. The podchecker utility scans a file looking for problems: > podchecker poddyscript.pl
692
Text Processing and Document Generation
If all is well then it will return: poddyscript.pl pod syntax OK. Otherwise it will produce a list of problems, which we can then go and fix, for example: *** WARNING: file does not start with =head at line N in file poddyscript.pl This warning indicates that we have started pod documentation with something other than a =head1 or =head2, which the checker considers to be suspect. Likewise: *** WARNING: No numeric argument for =over at line N in file poddyscript.pl *** WARNING: No items in =over (at line 17) / =back list at line N in file poddyscript.pl This indicates that we have an =over...=back pair, which not only does not have a number after the over, but does not even contain any items. The first is probably an omission. The second indicates that we might have bunched up our items so they all run into the =over token, like this: =over =item item one =item item two =back
If we had left out the space before the =back we would instead have got: *** ERROR: =over on line N without closing =back at line EOF in file poddyscript.pl
AM FL Y
In short, podchecker is a useful tool and we should use it if we plan to write pod of any size in our Perl scripts. The module that implements podchecker is called Pod::Checker, and we can use it with either filenames or file handles supplied for the first two arguments: # function syntax $ok = podchecker($podfile, $checklog, %options);
TE
# object syntax $checker = new Pod::Checker %options; $checker->parse_from_file($podpath, $checklog);
Both file arguments can be either filenames or filehandles. By default, the pod file defaults to STDIN and the check log to STDERR, so a very simple checker script could be: use Pod::Checker; print podchecker?"OK":"Fail";
The options hash, if supplied, allows one option to be defined: enable or disable the printing of warnings. The default is on, so we can get a verification check without a report using STDIN and STDERR: $ok = podchecker(\*STDIN, \*STDERR,'warnings' => 0);
The actual podchecker script is more advanced than this, of course, but not by all that much.
693
Team-Fly®
Chapter 18
Programming pod Perl provides a number of modules for processing pod documentation – we mentioned Pod::Checker just a moment ago. These modules form the basis for all the pod utilities, some of which are not much more than simple command-line wrappers for the associated module. Most of the time we do not need to process pod programmatically, but in case we do, here is a list of the pod modules supplied by Perl and what each of them does:
694
Module
Action
Pod::Checker
The basis of the podchecker utility. See above.
Pod::Find
Search for and return a hash of pod documents. See 'Locating pods' below.
Pod::Functions
A categorized summary of Perl's functions, exported as a hash.
Pod::Html
The basis for the pod2html utility.
Pod::Man
The basis for both the pod2man and the functionally identical pod2roff utilities.
Pod::Parser
The pod parser. This is the basis for all the translation modules and most of the others too. New parsers can be implemented by inheriting from this module.
Pod::ParseUtils
A module containing utility subroutines for retrieving information about and organizing the structure of a parsed pod document, as created by Pod::InputObjects.
Pod::InputObjects
The implementation of the pod syntax, describing the nature of paragraphs and so on. In-memory pod documents can be created on the fly using the methods in this module.
Pod::Plainer
A compatibility module for converting new style pod into old style pod.
Pod::Select
A subclass of Pod::Parser and the basis of the podselect utility, Pod::Select extracts selected parts of pod documents by searching for their heading titles. Any translator that inherits from Pod::Select rather than Pod::Parser will be able to support the Pod::Usage module automatically.
Pod::Text
The basis of the pod2text utility.
Pod::Text::Color
Convert pod to text using ANSI color sequences. The basis of the -color option to pod2text. Subclassed from Pod::Text. This uses Term::ANSIColor, which must be installed; see Chapter 15.
Pod::Text::Termcap
Convert pod to text using escape sequences suitable for the current terminal. Subclassed from Pod::Text. Requires termcap support, see Chapter 15.
Pod::Usage
The basis of the pod2usage utility; this uses Pod::Select to extract usage-specific information from pod documentation by searching for specific sections, for example, NAME, SYNOPSIS.
Text Processing and Document Generation
Using Pod Parsers Translator modules, which is to say any module based directly or indirectly on Pod::Parser, may be used programmatically by creating a parser object and then calling one of the parsing methods: parse_from_filehandle($fh, %options);
Or: parse_from_file($infile, $outfile, %options);
For example, assuming we have Term::ANSIColor installed, we can create ANSIColor text documents using this short script: #!/usr/bin/perl # parseansi.pl use warnings; use strict; use Pod::Text::Color; my $parser = new Pod::Text::Color( width => 56, loose => 1, sentence => 1, ); if (@ARGV) { $parser->parse_from_file($_, '-') foreach @ARGV; } else { $parser->parse_from_filehandle(\*STDIN); }
We can generate HTML pages, plain text documents, and manual pages using exactly the same process from their respective modules.
Writing a pod Parser Writing our own pod parser is surprisingly simple. Most of the hard work is done for us by Pod::Parser, so all we have to do is override the methods we need to replace in order to generate the kind of document we are interested in. Particularly, there are four methods we may want to override:
Y Y Y Y
command – Render and output POD commands. verbatim – Render and output verbatim paragraphs. textblock – Render and output regular (non-verbatim) paragraphs. interior_sequence – Return rendered interior sequence.
By overriding these and other methods we can customize the document that the parser produces. Note that the first three methods display their result, whereas interior_sequence returns it. Here is a short example of a pod parser that turns pod documentation into an XML document (albeit without a DTD):
695
Chapter 18
#!/usr/bin/perl # parser.pl use warnings; use strict; { package My::Pod::Parser; use Pod::Parser; our @ISA = qw(Pod::Parser); sub command { my ($parser, $cmd, $para, $line) = @_; my $fh = $parser->output_handle; $para =~s/[\n]+$//; my $output = $parser->interpolate($para, $line); print $fh "<pod:$cmd> $output \n"; } sub verbatim { my ($parser, $para, $line) = @_; my $fh = $parser->output_handle; $para =~s/[\n]+$//; print $fh "<pod:verbatim> \n $para \n \n"; } sub textblock { my ($parser, $para, $line) = @_; my $fh = $parser->output_handle; print $fh $parser->interpolate($para, $line); } sub interior_sequence { my ($parser, $cmd, $arg) = @_; my $fh = $parser->output_handle; return "<pod:int cmd=\"$cmd\"> $arg "; } } my $parser = new My::Pod::Parser(); if (@ARGV) { $parser->parse_from_file($_) foreach @ARGV; } else { $parser->parse_from_filehandle(\*STDIN); }
To implement this script we need the output filehandle (since the parser may be called with a second argument), which we can get from the output_handle method. We also take advantage of Pod::Parser to do the actual rendering work by using the interpolate method, which in turn calls our interior_sequence method. Pod::Parser provides plenty of other methods too, some of which we can override as well as or instead of the ones we used in this parser, see: > perldoc Pod::Parser
696
Text Processing and Document Generation
for a complete list of them. The Pod::Parser documentation also covers more methods that we might want to override, such as begin_input, end_input, preprocess_paragraph, and so on. Each of these gives us the ability to customize the parser in increasingly finer-grained ways. We have placed the Parser package inside the script in this instance, though we could equally have had it in a separate module file. To see the script in action we can feed it with any piece of Perl documentation – the pod documentation itself, for example. On a typical UNIX installation of Perl 5.6, we can do that with: > perl mypodparser /usr/lib/perl5/5.6.0/pod/perlpod.pod This generates an XML version of perlpod that starts like this: <pod:head1>NAME perlpod - plain old documentation <pod:head1>DESCRIPTION A pod-to-whatever translator reads a pod file paragraph by paragraph, and translates it to the appropriate output format. There are three kinds of paragraphs: <pod:int cmd="L">verbatim|/"Verbatim Paragraph", <pod:int cmd="L">command|/"Command Paragraph", and <pod:int cmd="L">ordinary text|/"Ordinary Block of Text". <pod:head2>Verbatim Paragraph A verbatim paragraph, distinguished by being indented (that is, it starts with space or tab). It should be reproduced exactly, with tabs assumed to be on 8-column boundaries. There are no special formatting escapes, so you can't italicize or anything like that. A \ means \, and nothing else. <pod:head2>Command Paragraph All command paragraphs start with "=", followed by an identifier, followed by arbitrary text that the command can use however it pleases. Currently recognized commands are <pod:verbatim> =head1 heading =head2 heading =item text =over N =back =cut =pod =for X =begin X =end X By comparing this with the original document we can see how the parser is converting pod tokens into XML tags.
697
Chapter 18
Locating pods The UNIX-specific Pod::Find module searches for pod documents within a list of supplied files and directories. It provides one subroutine of importance, pod_find, which is not imported by default. This subroutine takes one main argument – a reference to a hash of options including default search locations. Subsequent arguments are additional files and directories to look in. The following script implements a more or less fully-featured pod search based around Pod::Find and Getopt::Long, which we cover in detail in Chapter 14. #!/usr/bin/perl # findpod.pl use warnings; use strict; use Pod::Find qw(pod_find); use Getopt::Long; # default options my $verbose = undef; my $include = undef; my $scripts = undef; my $display = 1; # allow files/directories and options to mix Getopt::Long::Configure('permute'); # get options GetOptions('verbose!' => \$verbose, 'include!' => \$include, 'scripts!' => \$scripts, 'display!' => \$display, ); # if no directories specified, default to @INC $include = 1 if !defined($include) and (@ARGV or $scripts); # perform scan my %pods = pod_find({ -verbose => $verbose, -inc => $include, -script => $scripts, -perl => 1 }, @ARGV); # display results if required if ($display) { if (%pods) { foreach(sort keys %pods) { print "Found '$pods{$_}' in $_\n"; } } else { print "No pods found\n"; } }
698
Text Processing and Document Generation
We can invoke this script with no arguments to search @INC, or pass it a list of directories and files to search. It also supports four arguments to enable verbose messages, disable the final report, and enable Pod::Find's two default search locations. Here is one way we can use it, assuming we call it findpod: > perl findpod.pl -iv /my/perl/lib 2> dup.log This command tells the script to search @INC in addition to /my/perl/lib (-i), produce extra messages during the scan (-v), and to redirect error output to dup.log. This will capture details of any duplicate modules that the module finds during its scan. If we only want to see duplicate modules, we can disable the output and view the error output on screen with: > perl findpod.pl -i --nodisplay /my/perl/lib The options passed in the hash reference to pod_find are all Boolean and all default to 0 (off). They have the following meanings: Option
Action
-verbose
Print out progress during scan, reporting all files scanned that did not contain pod information.
-inc
Scan all the paths contained in @INC.
-script
Search the installation directory and subdirectories for pod files. If Perl was installed as /usr/bin/perl then this will be /usr/bin for example.
-perl
Apply Perl naming conventions for finding likely pod files. This strips likely Perl file extensions (.pod, .pm, etc.), skips over numeric directory names that are not the current Perl release, and so on. Both -inc and -script imply -perl.
The hash generated by findpod.pl contains the file in which each pod document was found as the key, and the document title (usually the module package name) as the value. This is the reverse arrangement to the contents of the %INC hash, but contains the same kinds of keys and values.
Reports – The 'r' in Perl Reports are a potentially useful but often overlooked feature of Perl that date back to the earliest versions of the language. In short, they provide a way to generate structured text such as tables or forms using a special layout description called a format. Superficially similar in intent to the print and sprintf functions, formats provide a different way to lay out text on a page or screen, with an entirely different syntax geared specifically towards this particular goal. The particular strength of formats comes from the fact that we can describe layouts in physical terms, making it much easier to see how the resulting text will look and making it possible to design page layouts visually rather than resorting to character counting with printf.
Formats and the Format Datatype Intriguingly, formats are an entirely separate data type, unique from scalars, arrays, hashes, typeglobs, and filehandles. Like filehandles, they have no prefix or other syntax to express themselves and as a consequence often look like filehandles, which can occasionally be confusing.
699
Chapter 18
Formats are essentially the compiled form of a format definition, a series of formatting or picture lines containing literal text and placeholders, interspersed with data lines that describe the information used to fill placeholder and comment lines. As a simple example, here is a format definition that defines a single pattern line consisting mainly of literal text and a single placeholder, followed by a data line that fills that placeholder with some more literal text: # this is the picture line This is a @<<<<< justified field # this is the data line "left"
To turn a format definition into a format we need to use the format function, which takes a format name and a multi-line format definition, strongly reminiscent of a here document, and turns it into a compiled format. A single full stop on its own defines the end of the format. To define the very simple format example above we would write something like this: format MYFORMAT = This is a @<<<<< justified field "left" .
Note that the trailing period is very important, as it is the end token that defines the end of the implicit here document. A format definition will happily consume the entire contents of a source file if left unchecked. To use a format we use the write function on the filehandle with the same name as the format. For the MYFORMAT example above we would write: # print format definition to filehandle 'MYFORMAT' write MYFORMAT;
This requires that we actually have an open filehandle called MYFORMAT and want to use the format to print to it. More commonly we want to print to standard output, which we can do by either defining a format called STDOUT, or assigning a format name to the special variable $~ ($FORMAT_NAME with the English module). In this case we can omit the filehandle and write will use the currently selected output filehandle, just like print: $~ = 'MYFORMAT'; write;
We can also use methods from the IO:: family of modules, if we are using them. Given an IO::Handle-derived filehandle called $fh, we can assign and use a format on it like this: $fh->format(MYFORMAT); $fh->format_write();
We'll return to the subject of assigning formats a little later on. The write function (or its IO::Handle counterpart format_write) generates filled-out formats by combining the picture lines with the current values of the items in the data lines to fill in any placeholder present, in a process reminiscent of, but entirely unconnected to, interpolation. Once it has finished filling out, it sends the results to standard output.
700
Text Processing and Document Generation
If we do not want to print output we can instead make use of the formline function. This takes a single picture line and generates output from it into the special variable $^A. It is the internal function that format uses to generate its output, and we will see a little more of how to use it later. There is, strangely, no string equivalent of write in the same way that printf has sprintf, but it is possible to create one using formline. Format picture lines are usually written as static pieces of text, which makes them impossible to adjust to cater for different circumstances like calculated field widths. As an alternative, we can build the format inside a string and then eval it to create the format, which allows us to interpolate variables into the format at the time it is evaluated. Here is an example that creates and uses a dynamically calculated format associated with the STDOUT filehandle: #!/usr/bin/perl # evalformat.pl use warnings; use strict; # list of values for field my @values=qw(first second third fourth fifth sixth penultimate ultimate); # determine maximum width of field my $width=0; foreach (@values) { my $newwidth=length $_; $width=$newwidth if $newwidth>$width; } # create a format string with calculated width using '$_' my $definition = "This is the \@".('<'x($width-1))." line\n". '$_'."\n.\n"; # define the format through interpolation eval "format STDOUT = \n$definition"; # print out the field values using the defined format write foreach @values;
The advantage of this approach is it allows us to be more flexible, as well as calculate the size of fields on the fly. The disadvantage is that we must take care to interpolate the \n newlines, but not placeholders and especially not variables in the data lines, which can lead to a confusing combination of interpolated and non-interpolated strings. This can make formats very hard to read if we are not very careful.
Formats and Filehandles Formats are intimately connected with filehandles, and not just because they often look like them. Formats work by being directly associated with filehandles, so that when we come to use them all we have to do is write to the filehandle and have the associated format automatically triggered. It might seem strange that we associate a format with a filehandle and then write to the filehandle, rather than specifying which format we want to use when we do the writing, but there is a certain logic behind this mechanism. There are in fact two formats that may be associated with a filehandle; the main one is the one that is used when we write, but we can also have a top-of-page format that is used whenever Perl runs out of room on the current page and is forced to start a new one. Since this is associated with the filehandle, Perl can use it automatically when we use write rather than needing to be told.
701
Chapter 18
Defining the Top-of-Page Format Perl allows two formats to be associated with a filehandle. The main format is used whenever we issue a write statement. The top-of-page format, if defined, is issued at the start of the first page and at the top of each new page. This is determined by the special variable $= (length of page) and $- (the number of lines left). Each time we use write the value of $- increases. When there is no longer sufficient room to fit the results of the next write, a new page is started, a new top-of-page format is written and only then is the result of the last write issued. The main format is automatically associated with the filehandle of the same name, so that the format MYFORMAT is automatically used when we use write on the filehandle MYFORMAT. Giving it the name of the filehandle with the text _TOP appended to it can similarly associate the top-of-page format. For instance, to assign a main and top-of-page format to the filehandle MYFORMAT we would use something like this: format MYFORMAT = ...main format definition... . # define a format that gives the current page number format MYFORMAT_TOP = This is page @<<< $= -----------------------.
Assigning Formats to Standard Output Since standard output is the filehandle most usually associated with formats, we can omit the format name when defining formats. Here is a pair of formats defined explicitly for standard output: format STDOUT= The magic word is "@<<<<<<<<"; $word . format STDOUT_TOP= Page @> $# ----------.
We can however omit STDOUT for the main format and simply write: format = The magic word is "@<<<<<<<<"; $word .
This works because standard output is the default output filehandle. If we change the filehandle with select then format creates a format with the same name as that filehandle instead. The write function also allows us to omit the filehandle; to write out the formats assigned to standard output (or whatever filehandle is currently selected) we can simply put: write;
702
Text Processing and Document Generation
Determining and Assigning Formats to Other Filehandles We are not constrained to defining formats with the same name as a filehandle in order to associate them. We can also find their names and assign new ones using the special variables $~ and $^. The special variable $~ ($FORMAT_NAME with use English) defines the name of the main format associated with the currently selected filehandle (which will be standard output unless we have issued a select statement). For example, to find out the name of the format associated with standard output we would write: $format = $~;
Likewise, to set the current format we can assign to $~: # set standard output format to 'MYFORMAT'; $~ = 'MYFORMAT'; use English; $FORMAT_NAME = 'MYFORMAT';
# more legibly
Note that the variable is set to the name of the format as a string, not to the format itself, hence the quotes.
AM FL Y
The special variable $^ ($FORMAT_TOP_NAME with use English) performs the identical role for the top-of-page format: # save name of current top-of-page format $topform = $^; # assign new top-of-page format $^ = 'MYFORMAT_TOP';
TE
# write out main format associated with standard out, using top-of-page format if necessary write; # restore original top-of-page format $^ = $topform;
Setting formats on other filehandles using the variables $~ and $^ requires special maneuvering with select. select makes a different filehandle the default filehandle, and as a by-product causes all the special variables such as $~ to refer to it. The original filehandle is returned, so we can use that to restore the default filehandle once we have finished: # set formats on a different filehandle $oldfh = select MYHANDLE; $~ = 'MYFORMAT'; $^ = 'MYFORMAT_TOP'; select $oldfh;
703
Team-Fly®
Chapter 18
A better way to handle this without resorting to select is to use the IO::Handle module, or IO::File if we want to open and close files too. This provides an altogether simpler object-oriented way of setting reports: $fh = new IO::File ("> $outputfile"); ... $fh->format_name ('MYFORMAT'); $fh->format_top_name ('MYFORMAT_TOP'); ... write $fh;
# or $fh->format_write ();
Format Structure Having seen how to assign and use formats it's now time to take a look at actually defining them. As we have already explained in brief, formats consist of a collection of picture and data lines, interspersed with optional comments, combined into a here-style document that is ended with a single full stop. Of the three, comments are by far the simplest to explain, but have no effect on the results of the format. They resemble conventional Perl comments and simply start with a # symbol, as this example demonstrates: format FORMNAME = # this is a comment. The next line is a picture line This is a pattern line with one @<<<<<<<<<<. # this is another comment. # the next line is a data line "placeholder" # and don't forget to end the format with a '.': .
Picture and data lines take a little more explaining. Since they are the main point of using formats at all, we will start with picture lines.
Picture Lines and Placeholders Picture lines consist of literal text intermingled with placeholders, which the write function fills in with data at the point of output. If a picture line does not contain any placeholders at all it is treated as literal text, and can simply printed out. Since it does not require any data to fill it out it is not followed by a data line. This means that several picture lines can appear one after the other, as this static top-of-page format illustrates: STATIC_TOP = This header was generated courtesy of Perl formatting See Chapter 18 of Professional Perl for details ------------------------------------------.
704
Text Processing and Document Generation
Placeholders are defined by either an @ or a ^, followed by a number of <, |, >, or # characters that define the width of the placeholder. Picture lines that contain placeholders must be followed by a data line (possibly with comments in between) that defines the data to be placed into the placeholder when the format is written. Formats do not support the concept of a variable-width placeholder. The resulting text will always reserve the defined number of characters for the substituted value irrespective of the actual length of the value, even if it is undefined. It is this feature that makes formats so useful for defining structured text output – we can rely on the resulting text exactly conforming to the layout defined by the picture lines. For example, to define a ten-character field that is left justified we would use: This is a ten character placeholder: @<<<<<<<<< $value_of_placeholder
Note that the @ itself counts as one of the characters, so there are nine < characters in the example, not ten. To specify multiple placeholders we just use multiple instances of @, and supply enough values in the data line to fill them. This example has a left, center, and right justified placeholder: This picture line has three placeholders: @<<<@|||@>>> $first, $second, $third
The second example defines three four character wide placeholders. The <, |, and > characters define the justification for fields more than one character wide; we can define different justifications using different characters as we will see in a moment. Programmers new to formats are sometimes confused by the presence of @ symbols. In this case @ has nothing to do with interpolation, it indicates a placeholder. Because of this, we also cannot define a literal @ symbol by escaping it with a backslash, that is an interpolation feature. In fact the only way to get an actual @ (or indeed ^) into the resulting string is to substitute it from the data line: # this '@' is a placeholder: This is a literal '@' # but we can get a literal '@' by substituting one in on the data line: '@'
Simple placeholders are defined with the @ symbol. The caret ^ or 'continuation' placeholder however has special properties that allow it to be used to spread values across multiple output lines. When Perl sees a ^ placeholder it fills out the placeholder with as much text as it reasonably can and then truncates the text it used from the start of the string. It follows from this that the original variable is altered and that to use a caret placeholder we cannot supply literal text. Further uses of the same variable can then fill in further caret placeholders. For example, this format reformats text into thirty-eight columns with a > prefix on each line: format QUOTE_MESSAGE = > ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message .
705
Chapter 18
This creates a format that processes the text in the variable $message into four lines of forty characters, fitting as many words as possible into each line. When write comes to process this format it uses the special variable $: to determine how and where to truncate the line. By default it is set to ' \n-' to break on spaces, newlines or hyphens, which works fine for most plain text. There are a number of problems with this format – it only handles four lines, and it always fills them out even if the message is shorter than four lines after reformatting. We will see how to suppress redundant lines and automatically repeat picture lines to generate extra ones with the special ~ and ~~ strings shortly.
Justification It frequently occurs that the width of a field exceeds that of the data to be placed in it. In these cases, we need to decide how the format will deal with the excess, since a fixed width field cannot shrink (or grow) to fit the size of the data. A structured layout is the entire point of formats. If the data we want to fill the placeholder is only one character wide, we need no other syntax. As an extreme case, to insert six single character items into a format we can use: The code is '@@@@@@' # use first six elements of digits, assumed to be from 0 to 9. @digits
For longer fields we need to choose how text will be aligned in the field through one of four justification methods, depending on which character we use to define the width of the placeholder: Placeholder
Alignment
Example
<
Left justified.
@<<<<
>
Right justified.
@>>>>
|
Center justified.
@||||
#
Right justified numeric.
@####
The <, |, and > justification styles are mostly self-explanatory; they align values shorter than the placeholder width to the left, center, or right of the placeholder. They pad the rest of the field with spaces (note that padding with other characters is not supported; if we want to do that we will have to generate the relevant value by hand before it is substituted). If the value is the right length in any case, then no justification occurs. If it is longer, then it is truncated on the right irrespective of the justification direction. The numeric # justification style is more interesting. With only # characters present it will insert an integer based on the supplied value – for an integer number it substitutes in its actual value, but for a string or the undefined value it substitutes in 0, and for a floating point number it substitutes in the integer part. To produce a percentage placeholder for example we can use: Percentage: @##% $value * 100
If however we use a decimal point character within the placeholder then the placeholder becomes a decimal placeholder, with floating-point values point-justified to align themselves around the position of the point: Result (2 significant places): @####.## $result
706
Text Processing and Document Generation
This provides a very simple and powerful way to align columns of figures, automatically truncating them to the desired level of accuracy at the same time. If the supplied result is not a floating-point number then the fractional places are filled in with 0, and for strings and undefined values the ones column is also filled in with 0. The actual character used by the decimal placeholder to represent the decimal point is defined by the locale, specifically the LC_NUMERIC value of the locale. In Germany for instance, the conventional symbol to separate the integer and fractional parts is a comma, not a point. Formats are in fact the only part of Perl that directly accesses the locale in this way, possibly because of their long history; all other parts of the language adhere to the use locale directive. Although deprecated in modern Perl, we can also use the special variable $# to set the point character. The final placeholder format is the * placeholder. This creates a raw output placeholder, producing a complete multiple line value in one go and consequently can only be placed after an @ symbol; it makes no sense in the context of a continuation placeholder since there will never be a remainder for a continuation to make use of. For example: > @* < $multiline_message
In this format definition the value of $multiline_message is output in its entirety when the format is written. The first line is prefixed with a > and the last is suffixed with <. No other formatting of any kind is done. Since this placeholder has variable width (and indeed, variable height) it is not often used since it is effectively just a poor version of print that happens to handle line and page numbering correctly.
Data Lines Whenever a picture line contains one or more placeholders it must be immediately followed by a data line consisting of one or more expressions that supply the information to fill them. Expressions can be literal numeric, string values, variables, or compound expressions: format NUMBER = Question: What do you get if you multiply @ by @? 6, 9 Answer: @# 6*9 .
Multiple values can be given either as an array or a comma separated list: The date is: @###/@#/@# $year, $month, $day
If insufficient values are given to fill all the placeholders in the picture line then the remaining placeholders are undefined and padded out with spaces. Conversely if too many values are supplied then the excess ones are discarded. This behavior changes if the picture line contains ~~ however, as shown below.
707
Chapter 18
If we generate a format using conventional quoted strings rather than the here document syntax we must take special care not to interpolate the data lines. This is made more awkward because in order for the format to compile we need to use \n to create newlines at the end of each line of the format, including the data lines, and these do need to be interpolated. Separating the format out onto separate lines is probably the best approach, though as this example shows even then it can be a little hard to follow: # define page width and output filehandle $page_width = 80; $output = "STDOUT_TOP"; # construct a format statement from concatenated strings $format_st = "format $output = "; 'Page @<<<'. "\n". '$='. "\n". ('-'x$page_width). "\n". ".\n"; # don't forget the trailing '.' # define the format - note we do not interploate, to preserve '$=' eval $format_st;
Note that continuation placeholders (which are defined by a leading caret) need to be able to modify the original string supplied in order to truncate the start. For this reason an assignable value such as a scalar variable, array element or hash value must be used with these fields.
Suppressing Redundant Lines format and write support two special picture strings that alter the behavior of the placeholders in the same picture line, both of which are applied if the placeholders are all continuation (caret) placeholders. The first is a single tilde or ~ character. When this occurs anywhere in a picture line containing caret placeholders the line is suppressed if there is no value to plug into the placeholder. For example, we can modify the quoting format we gave earlier to suppress the extra lines if the message is too short to fill them: format QUOTE_MESSAGE = > ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~ $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~ $message ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...~ $message .
In this example the bottom three picture lines have a ~ suffix, so they will only be used if $message contains sufficient text to fill them after it has been broken up according to the break characters in $:. When the format is written, the tildes are replaced with spaces. Since they are at the end of the line in this case, we will not see them, which is why conventionally they are placed here. If we have spaces elsewhere in the picture line we can replace one of them with the tilde and avoid the trailing space.
708
Text Processing and Document Generation
We modify the last picture line to indicate that the message may have been truncated because we know that it will only be used if the message fills out all the previous lines. In this case we have replaced the last three < characters with dots. The ~ character can be thought of as a zero-or-one modifier for the picture line, in much the same way that ? works in regular expressions. The line will be used if Perl needs it, but it can also be ignored if necessary.
Autorepeating Pattern Lines If two adjacent tildes appear in a pattern line then write will automatically repeat the line while there is still input. If ~ can be likened to the ? zero-or-one metacharacter of regular expressions, ~~ can be likened to *, zero-or-more. For instance, to format text into a paragraph of a set width but an unknown number of lines we can use a format like this: format STDOUT = ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~ $text .
Calling write with this format will take the contents of $text and reformat it into a column thirty characters wide, repeating the pattern line as many times as necessary until the contents of $text are exhausted. Anything else in the pattern line is also repeated, so we can create a more flexible version of the quoting pattern we gave earlier that handles a message of any size: format QUOTE = >~~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $message .
Like ~, the ~~ itself is converted into a space when it is output. It also does not matter where it appears, so in this case we have put it between the > quote mark and the text, to suppress the extra space on the end of the line it would otherwise create. Note that ~~ only makes sense when used with a continuation placeholder, since it relies on the continuation to truncate the text. Indeed, if we try to use it with a normal @ placeholder Perl will return a syntax error since this would effectively be an infinite loop that repeats the first line. Since write cannot generate infinite quantities of text, Perl prevents us from trying.
Page Control Perl's reporting system uses several special variables to keep track of line and page numbering. We can use these variables in the output to produce things like line and page numbers. We can also set them to control how pages are produced. There are four variables of particular interest: Variable
Corresponds to:
$=
The page length
%$The page number
$-
The number of lines remaining
$^L
The formfeed string
709
Chapter 18
$= (or $FORMAT_LINES_PER_PAGE with use English) holds the page length, and by default is set to 60 lines. To change the page length we can assign a new value to $=, for example: $= = 80;
# set page length to 80 lines
Or more legibly: use English; $FORMAT_LINES_PER_PAGE = 80;
If we want to generate reports without pages we can set $= to a very large number. Alternatively we can redefine $^L to nothing and avoid (or subsequently redefine to nothing) the 'top-of-page' format. $% (or $FORMAT_PAGE_NUMBER with use English) holds the number of the current page. It starts at 1 and is incremented by one every time a new page is started, which in turn happens whenever write runs out of room on the current page. We can change the page number explicitly by modifying $%, for example; $% = 1;
# reset page count to 1
$- (or $FORMAT_LINES_LEFT with use English) holds the number of lines remaining on the current page. Whenever write generates output it decrements this value by the number of lines in the format. If there are insufficient lines left (the size of the output is greater than the number of lines left) then $- is set to zero, the value of $% is incremented by one and a new page is started, starting with the value of $^L and followed immediately by the top-of-page format, if one is defined. We can force a new page immediately on the next write by setting $- to zero: $- = 0;
# force a new page on the next 'write'
Finally, $^L (or $FORMA_FORMFEED with use English) is output by write whenever a new page is started. Unlike the top-of-page format it is not issued before the first page, but it is issued before the topof-page format for all subsequent pages. By default it is set to a formfeed character, \f, but can be set to nothing or a longer string if required. See 'Creating Footers' below for a creative use of $^L. As an example of using the page control variables, here is a short program that paginates its input file, adding the name of the file and a page number to the top of each page. It also illustrates creating a format dynamically with eval so we can define not only the height of the resulting pages but their width as well. #!/usr/bin/perl # paginate.pl use warnings; use strict; no strict 'refs'; use Getopt::Long; # get parameters from the user my $height = 60; # length of page my $width = 80; # width of page my $quote = ""; # optional quote prefix
710
Text Processing and Document Generation
GetOptions ('height|size|length:i', \$height, 'width:i', \$width, 'quote:s', \$quote); die "Must specify input file" unless @ARGV; # get the input text into one line, for continuation undef $/; my $text = <>; # set the page length $= = $height; # if we're quoting, take that into account $width -= length($quote); # define the main page format - a single autorepeating continuation field my $main_format = "format STDOUT = \n". '^'.$quote.('<' x ($width-1))."~~\n". '$text'. "\n". ".\n"; eval $main_format; # define the top of page format my $page_format = "format STDOUT_TOP = \n". '@'.('<' x ($width/2-6)). ' page @<<<'. "\n". '$ARGV,$%'. "\n". '-'x$width. "\n". ".\n"; eval $page_format; # write out the result write;
To use this program we can feed it an input file and one or more options to control the output, courtesy of the Getopt::Long module, for example: > perl paginate.pl input.pl -w 50 -h 80
Creating Footers Footers are not supported as a concept by the formatting system, there is no 'bottom-of-page' format. However, with a little effort we can improvise our own footers. The direct and obvious way is to keep an eye on $-, and issue the footer when we get close to the bottom of the page. For instance, if the footer is smaller in lines than the output of the main format we can use something like the following, assuming that we know what the size of output is and that it is consistent. print "\nPage $%\n" if $- < $size_of_format;
This is all we need to do, since the next attempt to write will not have sufficient space to fit and will automatically trigger a new page. If we want to make sure that we start a new page on the next write we can set $- to '0' to force it: print ("\nPage $% \n"), $- = 0 if $- < $size_of_format;
711
Chapter 18
A more elegant and subtle way of creating a footer is to redefine $^L. This is a lot simpler to arrange, but suffers in terms of flexibility since the footer is fixed once it is defined, so page numbering is not possible unless we redefine the footer on each new page. For example, if we want to put a two line footer on the bottom of sixty line pages, we can do so by putting the footer into $^L (suffixed with the original formfeed) and then reducing the page length by the size of the footer, in this case to fifty eight lines: # define a footer. $footer = ('-'x80). "\nEnd of Page\n"; # redefine the format formfeed to be the footer plus a formfeed $^L = $footer. "\f"; # reduce page length from default 60 to 58 lines # if we wanted to be creative we could count the instances of '\n' instead. $= -= 2;
Now every page will automatically get a footer without any tracking or examination of the line count. The only work we have to do now is to add the footer to the last page, since the formatting will not do that for us. That is easily achieved by outputting new line characters up to the page length and then printing the footer ourselves. The number of lines left to fill is already held by $-, so this turns out to be trivial: print ("\n" * $-); # fill out the rest of the page (to 58 lines) print $footer; # print the final footer
As we mentioned earlier, arranging for a changing footer such as a page number is slightly trickier, but it can be done by remembering and checking the value of $- after each write: $lines = $-; write; redefine_footer() if $- > $lines;
This will work for a lot of cases, but will not always work if we are using ~~, since it may cause write to generate more lines than the page has left before we get a chance to check it. Like many things in Perl, which approach is preferable is often down to the particular task at hand.
Combining Reports and Regular Output It is perfectly possible to print both formatted output, such as that generated by write, and unformatted output, such as that generated by print, on the same filehandle. We have two possible approaches; mixing write and print, or using the formline function to generate formatted text that can then be printed using print at a later date.
Mixing 'write' and 'print' write and print can be freely mixed together, sending both formatted and unformatted output to the same filehandle. However, print knows nothing about the special formatting variables such as $=, $-, and $% that track pagination and trigger the top-of-page format. Consequently pages will be of uneven length unless we take care of tracking line counts ourselves.
712
Text Processing and Document Generation
For instance, if we print out a three-line record after each write, we can keep write up-to-date on what is happening by decrementing $-, the number of lines left: write; foreach (@extra_lines) { print $_, "\n"; --$-; # decrement $-. }
Unfortunately this does not take into account that $- might become negative if there is not enough room left on the current page. Due to the complexities of managing mixture of write and print it is often simpler to either use formline, or create a special format that is simply designed to print out the information we were using print for. Here's an example that defines several different formats and switches between them to handle different contingencies in outputting records with varying content and structure: #!/usr/bin/perl # multiformat.pl use warnings; use strict;
AM FL Y
#set the lines remaining on the default page size $- = 60; # define some simple records to demonstrate the technique my @records=( { main => 'This is the first record, with three extra lines of data', extra => ['Extra 1','Extra 2','Extra 3'], comment => 'Each part of the record is printed out using a different format' }, { main => 'This is the second record, which has only one extra line', extra => ['An extra line'] }, { main => 'The third record has no extra data at all', comment => 'So we switch to a different format and print a special message instead' },{ main => 'This is the fourth record, with three more extra lines', extra => ['Extra 4','Extra 5','Extra 6'] } );
TE
# define main format for main body of record format MAIN = @#: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $_+1, $records[$_]{main} ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~ $records[$_]{main} . # define a format for displaying extra data format EXTRA = [ @<<<<<<<<<<<<<<<<<<<<<<<<<<<< ] $_ . # define a format for no extra data format NO_EXTRA = < No extra data for this record > .
713
Team-Fly®
Chapter 18
# define a format for displaying extra data format COMMENT = Comment: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $records[$_]{comment} ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~ $records[$_]{comment} . # iterate through the list of hashes, using extra format when required foreach (0..$#records) { # write out the main format $~ = 'MAIN'; write; if (exists $records[$_]{extra}) { # change format $~ = 'EXTRA'; # write out the extra lines foreach (@{$records[$_]{extra}}) { write; } } else { # change to the no data message format $~ = 'NO_EXTRA'; write; } $~ = 'COMMENT', write if exists $records[$_]{comment}; }
A less flexible but simpler solution to the same problem might use the special @* placeholder to output the extra lines; see earlier for a description of what this placeholder does and its limitations.
Generating Report Text with 'formline' The formline function is a lower-level interface to the same formatting system that write uses, and is in fact the internal function that write uses to perform its task. formline generates text from a single picture line and a list of values, the result of which is placed into the special variable $^A. For example, this is how we could create a formatted string containing the current time using formline: ($sec, $min, $hour) = localtime; formline '@#/@#/@#', $hour, $min, $sec; $time = $^A; print "The time is: $hour:$min.$sec \n";
Of course in this case it is probably easier to use sprintf, but we can also use formline to create text from more complex patterns. For instance to format a line of text into an array of text lines we could use formline like this: $text = get_text();
# get a chunk of text from somewhere
@lines; while ($text) { formline '^<<<<<<<<<<<<<<<<<<<', $text; push @lines, $^A; }
714
Text Processing and Document Generation
This code takes the contents of $text and turns it into an array of lines word-wrapped to fit inside a twenty-character column. formline is only designed to handle single lines, so it ignores newlines in the supplied pattern and treats the whole of the picture text as one line for the purposes of processing. This means that we cannot feed formline a complete format definition and expect it to produce the correct result in $^A. Strangely, there is no simple way to generate text from write, other than by redirecting filehandles, since write sends its results to a filehandle. However, we can produce a version of write that returns its result instead, in the same way that sprintf returns a string instead of printing it like printf. sub swrite ($@) { $picture = shift; formline ($picture, @_); return $^A; }
This function is a friendly version of formline, but it is not a direct replacement for write, since it only operates on a single picture line and expects a conventional list of values as an argument. However, for generating text for use in print and other code, it is a lot more convenient than either write or formline.
Summary This chapter dealt with text-processing in some depth. To begin with, we looked at text-processing modules, including Text::Teb and Text::Abbrev. Following this we covered parsing, with particular reference to the following topics:
Y Y Y Y
Parsing space-separated text. Parsing arbitrarily delimited text. Batch parsing multiple lines. Parsing a single line.
A section followed on customized wrapping discussed, among other things, the following topics:
Y Y Y Y
Formatting width. Long words. Break text. Debugging.
Then we looked at tokenizing text, and the Soundex algorithm and module. After this we covered pod, with emphasis on the following topics:
Y Y Y
pod paragraphs. pod tools and utilities. Programming pod.
The final section in this chapter dealt with reports. We looked at the format datatype, formats and filehandles, format structure (including justification), and page control. To top it all off we briefly talked about combining reports and regular output, and how to generate report text with formline.
715
Chapter 18
716
Object-oriented Perl Objects are, in a nutshell, a way to hide complexity behind an opaque value which holds not only data, but all the code necessary to access, manipulate, and store it. All objects belong to an object class, and the class defines what kind of object they are. The code that implements the object's features also belongs to the class, and the objects, sometimes called object instances, are simply values that belong to a given class. They 'know' what kind of object they are, and therefore which class the subroutines that can be used through them come from. In Perl, an object class is just a package, and an object instance is just a reference that knows its class and points to the data that defines the state of that particular instance. Perl was not originally an object-oriented language; only from version 5 did it acquire the necessary features (symbolic references and packages) to implement objects. As a result, Perl's object-oriented features are relatively basic, and not compulsory. Perl takes a belt-and-braces approach to objectoriented programming, espousing no particular object-oriented doctrine (of which there are many), but permitting a broad range of different object-oriented styles. Many object-oriented languages take a much stricter line. Being strict is the entire point for some languages. Java, for instance, requires that everything be an object, even the main application. Other languages have very precise views about what kind of object model they support, how multiple inheritance works, how public and private variables and methods are defined, how objects are created, initialized, and destroyed, and so on. Perl does not have any particular perspective, which make it both extremely flexible and highly disconcerting to programmers used to a different object-oriented style. Because Perl does not dictate how object-oriented programming should be done, it can leave programmers who expect a more rigorous framework confused and aimless, which is one reason why Perl sometimes has a bad reputation for object-oriented programming. However, during the course of this chapter we hope to show that by learning the basics of how Perl implements objects, a programmer can wield Perl in a highly effective way to implement object-oriented programs.
Chapter 19
In this chapter we introduce object-oriented programming from the Perl perspective. We then go on to using objects (which need not imply an object-oriented program), and then tackle the meat of the chapter – writing object classes, including constructors and destructors, properties and attributes, and single and multiple inheritance. We also take a look at a uniquely Perlish use of objects – mimicking a standard data type by tieing it to an object-oriented class. The DBM modules are a well-known example, but there are many other interesting uses for tied objects too.
Introducing Objects Object-oriented programming is an entirely different method of implementing libraries and applications from the traditional or functional approach. Object orientation allows us to clearly define the interface to our library, and thereby allows us to abstract the actual implementation so it can be easily reused and adapted through inheritance. Although we can do the same thing in functional implementations, it is a lot harder and more prone to error. Object orientation helps us to rethink the way we design programs so that we gain these advantages more easily. In order to appreciate how Perl implements and provides for object-oriented programming, therefore, a basic grasp of object-oriented concepts is necessary.
Object Concepts Since this is not a treatise on object orientation, we will not dwell on the fundamentals of object orientation in detail. Indeed, one of the advantages of Perl's approach is that we do not need to pay nearly so much attention to them as we often do in other languages; Perl's hands-on approach means that we can strip away a lot of the jargon that object orientation often brings with it. However, several concepts are key to any kind of object-oriented programming, so here is a short discussion of the most important ones, along with Perl's perspective on them:
Classes An object class provides the implementation of an object. It consists of class methods, which are routines that perform functions for the class as a whole. It also consists of object methods, routines that perform functions for individual objects (or object instances). It may also contain package variables, or in object-oriented terminology, class attributes. The details of the class are hidden behind the interface provided by these methods, in the same way that a regular functional module hides its details. All object classes contain at least one important class method; a constructor that generates new object instances. In addition, they may have a destructor, for tidying up after objects that are destroyed. Perl implements object classes with packages. In fact, a package is an object class by another name. This basic equivalence is the basis for much of Perl's simple and obvious approach to objects in general. A class method is just a subroutine that takes a package name as its first argument, and an object method is a subroutine that takes an object name as its first argument. Perl automatically handles the passing of this first argument when we use the arrow (->) operator.
Objects Objects are individual instances of an object class, consisting of an opaque value representing the state of the object but abstracting the details. Because the object implicitly knows what class it belongs to, we can call methods defined in the object class through the object itself, in order to affect the object's state. Objects may contain, within themselves, different individual values called object attributes (or occasionally instance attributes).
718
Object-oriented Perl
Perl implements objects through references; the object's state is held by whatever it is that the reference points to, which is up to the object's class. The reference is told what class it belongs to with the bless function, which marks the references as belonging to a particular class. Since a class is a package, method calls on the object (using ->) are translated by Perl into subroutine calls in the package. Perl passes the object as the first argument so the subroutine knows which object to operate on. An object may have properties and attributes representing its state. The storage of these is up to the actual data type used to store this information; typically it is a hash variable and the attributes are simply keys in the hash. Of course, the point of object orientation is that the user of an object does not need to know anything about this.
Inheritance, Multiple Inheritance, and Abstraction One important concept of object-oriented programming, and the place where objects score significant gains over functional programming, is inheritance. An object's classes may inherit methods and class attributes from parent classes, in order to provide some or all of their functionality; a technique also known as subclassing. This allows an object class to implement only those features that differentiate it from a more general parent, without having to worry about implementing the classes contained in its parent. Inheritance encourages code reuse, allowing us to use tried and tested objects to implement our core features rather than reinventing the wheel for each new task. This is an important goal of any programming environment, and one of the principal motivations behind using object-oriented programming. Multiple inheritance occurs when a subclass inherits from more than one parent class. This is a contentious issue, since it can lead to different results depending on how parent classes are handled when two classes both support a method that a subclass needs. Accordingly, not all object-oriented languages allow or support it. Dynamic inheritance occurs when an object class is able to change the parent or parents from which it inherits. It also occurs when a new subclass is created on the fly during the course of execution. Again, not all languages allow or support this. An important element of inheritance is that the subclass does not need to know how the parent class implements its features, only how to use them to implant its own variation – the interface. This gives us abstraction, an important aspect of object-oriented programming that allows for easy reuse of code; the parent class should be able to change its implementation without subclasses noticing. Inheritance in most object-oriented languages happens through some sort of declaration in the class. In Perl, inheritance is supported through a special array that defines 'is a' relationships between object classes. Logically enough, it is called @ISA, and it defines what kind of parent class a given subclass is. If anything is in the @ISA array of a package, then the object class defined by that package 'is a' derived class of it. Perl allows for multiple inheritance by allowing an object class to include more than one parent class name in its @ISA array. When a method is not found in the package of an object, its parents are scanned in order of their place in the array until the method is located. If a parent also has an @ISA array, it is searched too. Multiple inheritance is not always a good thing, and Perl's approach to it has problems, but it makes up for it by being blindingly simple to understand. Inheritance in Perl can also be dynamic, since @ISA has all the properties of a regular Perl array variable, so it can be modified during the course of a program's execution to add new parent classes, remove existing ones, entirely replace the parent class(es), or reorder them.
719
Chapter 19
Public and Private Methods and Data Both object classes and object instances may have public data, private data, and methods. Public data and methods make up the defined interface for the object class and the objects it implements that external code may use. Private data and methods are intended for use only by the object class and its objects themselves (such as supporting methods and private state information). Making parts of an object class private is also known as encapsulation, though that is not an exclusively object-oriented concept. Good object-oriented design suggests that all data should be encapsulated (accessed by methods, as opposed to being accessed directly). In other words, there should be no such thing as public data in a class or object. Perl does not have any formal definition of public and private data; it operates an open policy whereby all data and methods are visible to the using package. There is no 'private' declaration, though my can by used to declare file-scoped variables, which are effectively private. The using package is expected to abide by the intended and documented interface and not abuse the fact that it can circumvent it if it chooses. If we really want to we can enforce various types of privacy (for example with a closure), but only by implementing it by hand in the object class. Strangely, by not having an explicit policy on privacy, Perl is a lot simpler than many object-oriented languages that do, because dubious concepts like friend classes, and selective privacy simply do not exist.
Polymorphism Another concept that is often associated with objects is polymorphism. This is the ability of many different object classes to respond to the same request, but in different ways. In essence, this means that we can call an object method on an object whose class we do not know precisely, and get some form of valid response. The class determines the actual response of the object, but we do not need to know which class the object is contained in, in order to call the method. Inheritance provides a very easy way to create polymorphic classes. By inheriting and overriding methods from a single parent class, many subclasses can behave the same way to the user. Because they inherit a common set of methods, we can know with surety that the parent interface will work for all its subclasses. In Perl, polymorphism is simply a case of defining two or more methods (subroutines), in different classes (packages), with the same name and handling the same arguments. A method may then be called on an object within any of the classes without knowing in which class the object actually is. In some cases we might want to use a method which may or may not exist; either we can attempt the call with -> inside an eval, or use the special isa and can methods supported by all objects in order to determine what an object is and isn't capable of. These methods are provided for by the UNIVERSAL object, from which all objects implicitly inherit.
Overloading Overloading is the ability of an object class to substitute for existing functionality supplied by a parent class or the language itself. There are two types of overloading, method overloading and operator overloading. Method overloading is simple in concept. It occurs whenever a subclass implements a method with the same name as a parent's method. An attempt to call that method on the subclass will be satisfied by the subclass, and the parent class will never see it. Its method is said to have been overloaded. In the context of multiple inheritance some languages also support parameter overloading, where the correct method can be selected by examining the arguments passed to the method call, and comparing it to the arguments accepted by the corresponding method in each parent class.
720
Object-oriented Perl
Operator overloading is more interesting. It occurs when an object class implements special methods for the handling of operators defined by the core language. When the language sees that an operator, for which an object class supplies a method, is used adjacent to an object, it replaces the regular use of the operator with the version supplied by the class. For instance, this allows us to 'add' objects together using +, even though objects cannot be added. The object class supplies a meaning for the operator, and returns a new object reflecting the operation. Perl supports both kinds of overloading. Method overloading is simply a case of defining a subroutine with the same name as the subroutine to be overloaded in the parent. The subclass can still access the parent's method if it wishes, by prefixing the package name with the special SUPER:: prefix. There is no such thing as parameter overloading in Perl, since its parameter passing mechanism (the @_ array) does not lend itself to that kind of examination. However, a method can select a parent class at runtime by analyzing the arguments passed to it. Operator overloading is also supported through the overload pragmatic module. With this module we can implement an object replacement for any of Perl's built-in operators, including all the arithmetic, logical, assignment, and dereferencing operators.
Adaptabilty (also called Casting or Conversion) Objects may sometimes be reclassified and assigned to a different class. For instance, a subclass will often use a parent class to create an object, then adjust its properties for its own needs before reclassifying the object as an instance of itself rather than its parent. Objects can also be reclassified en route through a section of code; for example, an object representing an error may be reclassified into a particular kind of error, or reclassified into a new class representing an error that has already been handled. In Perl, objects can be switched into a new class at any time, even into a class that does not exist. We can bless a reference into any class simply by naming the class. If we also create and fill an @ISA array inside this class, then it can inherit from a parent class too, enabling us to create a functional subclass on the fly.
Programming with Objects Although Perl supports objects, it does not require that we use them exclusively; it is possible and feasible to use objects from otherwise entirely functional applications. Using objects is therefore not inextricably bound up with writing them. So, before delving into implementation, we will take a brief look at using objects from the outsider's perspective, with a few observations on what Perl does behind the scenes.
Creating Objects All object classes contain at least one method known in general object-oriented programming circles as a constructor – a class method that creates brand new object instances based on the arguments passed to it. In many object-oriented languages (C++ and Java being prime examples) object creation is performed by a keyword called new. Perl allows us to give a constructor any name, since it is just a subroutine. For example: $object = My::Object::Class->new(@args);
721
Chapter 19
In deference to other languages that provide a new keyword, Perl also allows us to invert this call and place the new before the package name: $object = new My::Object::Class(@args);
This statement is functionally identical to the one above, but bears a stronger resemblance to traditional constructors in other languages. However, since Perl does not give new any special meaning, a constructor method may have any name, and take any arguments to initialize itself. We can therefore give our constructor a more meaningful name. We can have multiple constructors too if we like; it is all the same to Perl: $object = old My::Object::Class(@args); $object = create_from_file My::Object::Class($filename); $object = empty My::Object::Class ();
Using Objects The principal mechanism for accessing and manipulating objects is the -> operator. In a non-objectoriented context this is the dereferencing operator, and we use it on an unblessed reference to access the data it points to. For example, to access a hash by reference: $value = $hashref->{'key'};
However, in object-oriented use, -> becomes a class access operator, providing the means to call class and object methods (depending on whether the left-hand side is a class name or an object) and access properties on those objects: $object_result = $object->method(@args); $class_result = Class::Name->classmethod(@args); $value = $object->{'property_name'};
Since an object is at heart a reference, these two uses are not as far apart as they might at first seem; the difference is that a blessed reference allows us to call methods because the reference is associated with a package. A regular reference is not associated with anything, and so cannot have anything called through it.
Accessing Properties Since an object is just a blessed reference, we can access the underlying properties of the object by dereferencing it just like any other reference. For instance, if the object is implemented in terms of a hash we can access its properties like this: $value = $object->{'property_name'};
Similarly, we can set a property or add a new one with: $object->{'property_name'} = $value;
722
Object-oriented Perl
We can also undef, delete, push, pop, shift, unshift, and generally manipulate the object's properties using conventional list and hash functions. If the underlying data type is different, say an array, or even a scalar, we can still manipulate it, using whatever processes are legal for that kind of value. However, this is really nothing to do with object orientation at all, but rather the normal non-objectoriented dereferencing operator. Perl uses it for object orientation so that we can think of dereferencing an object in terms of accessing an object's public data. In other words, it is a syntactic trick to help keep us thinking in object-oriented terms, even though we are not actually performing an object-oriented operation at heart. One very major disadvantage of accessing an object's data like this, is that we break the interface defined by the object class. The class has no ability to restrain or control how we access the object's data if we access it directly; neither can it spot attempts to access invalid properties. Indeed, even the fact that we know what the underlying data type is breaks the rules of good object-oriented programming. The object should be able to use a hash, array, scalar, or even a typeglob, but we should not have to know. Therefore it is better for us to use methods defined by the class to access and store properties on its objects whenever possible. In an ideal world the class should be even able to change the underlying data representation and still work perfectly in existing code – the object should essentially be an abstract data type.
Calling Class Methods
AM FL Y
A class method is a subroutine defined in the class, which operates on a class as a whole, rather than a specific object of that class. To call a class method, we use the -> operator on the package name of the class, which we give as a bare, unquoted term, just as we do for use: $result = My::Object::Class->classmethod(@args);
The new class method typically implemented by most classes is one such case of a class method, and the inverted syntax we used before will also work with any other class method, for example: $result = classmethod My::Object::Class(@args);
TE
In general this syntax should only be used for constructors, where its ordering makes logical sense. The subroutine that implements the class method is called with the arguments supplied by us, plus the package name, which is passed first. In other words, this class method call and the following subroutine call are handled similarly for classes that do not inherit: # method call - correct object-oriented syntax My::Object::Class->method(@args); # subroutine call - does not handle inheritance My::Object::Class::method ('My::Object::Class', @args);
For classes that do inherit, it uses the @ISA array to search for a matching method if the class in which the method is looked for does not implement it. This is because the -> operator has an additional important property that differentiates method calls from subroutine calls. A subroutine call has no such magic associated with it.
723
Team-Fly®
Chapter 19
It might seem redundant that we pass the package name to a class method, since the method surely already knows what package it is in, and can find out by using __PACKAGE__ even if it did not know. Again, however, this is only true for classes that do not inherit. If a parent method is called because a subclass did not implement a class method (a result of using the -> operator), then the package name passed will not be that of the parent but that of the subclass.
Calling Object Methods An object method is a subroutine defined in the class that operates on a particular object instance. To call an object method we use the -> operator on an object of a class that supports that method: $result = $object->method(@args);
The subroutine that implements the object method is called with the argument we supply, preceded by the object itself (a blessed reference, and therefore a scalar). Therefore the following calls are again nearly the same: # method call - correct object-oriented syntax $object->method(@args); # subroutine call - does not handle inheritance My::Object::Class::method($object, @args);
Just as with class methods, if the package into which the object is blessed does not provide the named method, the -> operator causes Perl to search for it in any and all parent classes, as defined by the @ISA array; the method call will find a method in a parent class (provided it is present), but the subroutine call will not.
Nesting Method Calls If the return value from a method (class or object) is another object, we can call a second method on it directly, without explicitly using a temporary variable to store the returned object. Such methods usually occur where there is a 'has-a' relationship between two object classes, and objects of the holding class contain instances of the held class as attributes. The result of this is that we can chain several method calls together: print "The top card is ", $deck->card(0)->fullname;
This particular chain of method calls is from the Game::Deck example which we provide later in the chapter. It prints out the name of the playing card on the top of the deck of playing cards represented by the $deck object. Game::Deck supplies the card method, which returns a playing card object. In turn, the playing card object (Game::Card, not that we need to know the class) provides the fullname method. We will return to this subject again when we cover 'Has-a' versus 'Is-a' relationships.
Determining what an Object Is An object is a blessed reference. Calling the ref function on an object returns not the actual data type of the object but the class into which it was blessed: $class = ref $object;
724
Object-oriented Perl
If we really need to know the underlying data type (which in a well designed object-oriented application should be never, but it can be handy for debugging) we can use the reftype subroutine supplied by the attributes module, see 'References' in Chapter 5 for details. However, knowing what class an object belongs to does not always tell us what we want to know; for instance, we cannot easily use it to determine if an object belongs to a subclass of a given parent class, or even find out if it supports a particular method or not.
Determining Inherited Characteristics The ref function will tell us the class of an object, but it cannot tell us anything more than this. Because determining the nature and abilities of an object is a common requirement, Perl provides the UNIVERSAL object class, which all objects automatically inherit from. UNIVERSAL is a small class, and contains only three methods for identifying the class, capabilities, and version of an object or object class. Because Perl likes to keep things simple, this class is actually implemented as a module in the standard library, and is not a built-in part of the language.
Determining an Object's Ancestry The isa method, provided by UNIVERSAL to all objects, allows us to determine whether an object belongs to a class or a subclass of that class, either directly or through a long chain of inheritance. For example: if ($object->isa("My::Object::Class")) { $class = ref $object; if ($class eq "My::Object::Class") { print "Object is of class My::Object::Class \n"; } else { print "Object is a subclass of My::Object::Class \n"; } }
We can also use isa on a class name or string variable containing a class name: $is_child = My::Object::Subclass->isa("MyObjectClass"); $is_child = $packagename->isa($otherpackagename);
Before writing class names into our code, however, we should consider the issue of code maintenance. Explicitly hard-coding class names is an obstacle to portability, and can trip up otherwise functional code if used in an unexpected context. If the class name is derived programmatically, it is more acceptable.
Determining an Object's Capabilities Knowing an object's class and being able to identify its parents does not tell us whether or not it can support a particular method. For polymorphic object classes, where multiple classes provide versions of the same method, it is often more useful to know what an object can do rather than what its ancestry is. The UNIVERSAL class supplies the can method for this purpose: if ($object->can('method')) { return $object->method(@args); }
725
Chapter 19
If the method is not found in either the object's class or any of its parents can returns undef. Otherwise, it returns a code reference to the method that was found: if ($methodref = $object->can('method')) { $object->$methodref(@args); }
We can also use can on an object class to determine whether it or any of its ancestors provides a given method: foreach (@methods) { $can{$_} = My::Object::Class->can('method'); }
Again, the package name may also be given in a string variable: $can_do = $package->can($methodname);
Alternatively, we can simply try to call the method and see if it works or not, wrapping the call in an eval to prevent it from generating a fatal error: $result = eval {$object->method(@args)}; if ($@) { # error - method did not exist }
Determining an Object's Version The final method provided by the UNIVERSAL object is VERSION, which looks for a package variable called $VERSION in the class on which it is called: # version of a class $version = $packagename->VERSION; # version of an object's class $version = $object->VERSION; # test version if ($packagename->VERSION < $required_version) { die "$packagename version less than $required_version"; }
In practice we usually don't need to call VERSION directly, because the use and require statements do it for us providing we supply a numeric value rather than an import list after a package name: # use class only if it is at least version 1 require My::Object::Class 1.00;
Note that use differs from require in that, apart from using an implicit BEGIN block, it imports from the package as well. However, since an object-oriented class should rarely define anything for export, since this breaks the interface and causes problems for inheritance, there is usually no advantage to useing an object class over requireing it.)
726
Object-oriented Perl
However, we can in some cases use version to alter behavior depending on another module's version, using an added method in more recent versions and resorting to a different approach in older versions. For example, here is a hypothetical object having a value assigned to one of its attributes. The old class did not provide a method for this attribute, so we access the attribute directly from the underlying hash. From version 1 onward all attributes are accessed by method: if ($object->VERSION < 1.00) { # old version - set attribute directly $object->{'attribute'} = $value; } else { # new version - use the provided method instead $object->attribute($value); }
Note that we could have done the same thing with can, but this is slower since it tests the entire object hierarchy of parents looking for a method; it's also less clear, since without additional comments it is not obvious why we would be checking for the availability of the method in the first place. In these cases checking the version is the better approach.
Writing Object Classes Writing an object class is no more difficult than writing a package, it just has slightly different rules. Indeed, an object class is just a package by a different name. Like packages, object classes can spread across more than one file but more often than not are implemented in a single module with the same name (after translation into a pathname) as the package that implements them. What makes object classes different from packages is that they tend to have specific features within them. The first and most obvious is that they have at least one constructor method. As well as this, all the subroutines take an object or a class name as a first parameter. The package may also optionally define a DESTROY block for destroying objects, analogous to an END block in ordinary packages. A final difference, and arguably one of the most crucial, is that object classes can inherit methods from one or more parent classes. We are going to leave the bulk of this discussion to later in the chapter, but because inheritance and designing object classes for reuse are so fundamental to object-oriented programming, we will be introducing inheritance examples from time to time before we come to discuss it in full. Fortunately, inheritance in Perl is very easy to understand, at least in the simpler examples given here.
Constructors The most important part of any object class is its constructor; a class method whose job it is to generate new instances of objects. Typically the main (or only) constructor of an object class is called new, so we can create new objects with any of the following statements: Using traditional object-oriented syntax: $object = new My::Object::Class; $object = new My::Object::Class('initial', 'data', 'for', 'object');
727
Chapter 19
Or, using class method call syntax: $object = My::Object::Class->new(); $object = My::Object::Class->new('initial', 'data', 'for', 'object');
This new method is just a subroutine that accepts a class name as its first parameter (supplied by the -> operator) and returns an object. At the heart of any constructor is the bless function. When given a single argument of a reference, bless bestows a new name upon it to mark it as belonging to the current package. Here is a fully functional (but limited, as we will see in a moment) constructor that illustrates it in action: #Class.pm package My::Object::Class; use strict; sub new { $self = {}; # create a reference to a hash bless $self; # mark reference as object of this class return $self; # return it. }
The problem with this simple constructor is that it does not handle inheritance. With a single argument, bless puts the reference passed to it into the current package. However, this is a bad idea because the constructor may have been called by a subclass, in which case the class to be blessed into is the subclass, and not the class that the constructor is defined in. Consequently, the single argument form of bless is rarely, if ever used, and we mention it only in passing. Correctly written object classes use the two argument version of bless to bless the new object into the class passed as the first argument. This enables inheritance to work correctly, as the object created is now the one asked for, which may be a subclass inheriting our constructor: sub new { $class = shift; $self = {}; bless $self, $class; return $self; }
Or, equivalently but much more tersely: sub new { return bless {}, shift; }
Typically we want to be able to initialize an object when we create it, which we can do by passing arguments to the constructor. For example, here is a package that implements the bare essentials of a playing card class. It takes two additional parameters, a name, and a suit: # card1.pm package Game::Card1; use strict; sub new { my ($class, $name, $suit) = @_; my $self = bless {}, $class;
728
Object-oriented Perl
$self->{'name'} = $name; $self->{'suit'} = $suit; return $self; } 1;
The underlying representation of the object is a hash, so we can store attributes as hash keys. We could also check that we actually get passed a name and suit, but in this case we are going to handle the possibility that a card has no suit (a joker, for example), or even no name (in which case it is, logically, a blank card). A user of this object could now access the object's properties (in an non-object-oriented way) through the hash reference: #!/usr/bin/perl # ace.pl use warnings; use strict; use Game::Card1; my $card = new Game::Card1('Ace', 'Spades'); print $card->{'name'}; # produces 'Ace'; $card->{'suit'} = 'Hearts'; # change card to the Ace of Hearts
Just because we can access an object's properties like this does not mean we should. If we change the underlying data type of the object (as we are about to do) this code will break. A better way is to use accessor and mutator methods, which we cover in the appropriately titled section 'Accessors and Mutators' shortly.
Choosing a Different Underlying Data Type Objects are implemented in terms of references, therefore we can choose any kind of reference as the basis for our object. The usual choice is a hash, as shown in the previous example, since this provides a simple way to store arbitrary data by key; it also fits well with the 'properties' or 'attributes' of objects, which in general object-oriented parlance are named values that can be set and retrieved on objects.
Using an Array However, we can also choose other types that might suit our design better. For instance, we can use an array, as this constructor does: # Card.pm package Game::Card2; use strict; use Exporter; our @ISA = qw(Exporter); our @EXPORT = qw(NAME SUIT); use constant NAME => 0; use constant SUIT => 1; sub new { my ($class, $name, $suit) = @_; my $self = bless [], $class;
729
Chapter 19
$self->[NAME] = $name; $self->[SUIT] = $suit; return $self; } 1;
This object is functionally identical to the hash-based one (although, since it is at this point only a constructor, not a terribly useful one), but has a different internal representation. In this simple example, we want to allow users to access properties directly, so we export the constants to them when they use our object with use Exporter. Now the code to use the object looks like this (note that by accessing the object properties directly we are forced to change the code that uses the object. This is why methods are better, as we mentioned earlier): #!/usr/bin/perl # arrayuse.pl use warnings; use strict; use Game::Card2;
# imports 'NAME' and 'SUIT'
my $card = new Game::Card2('Ace', 'Spades'); print $card->[NAME]; # produces 'Ace' $card->[SUIT] = 'Hearts'; # change card to the Ace of Hearts print " of ", $card->[SUIT]; # produces ' of Hearts'
The advantage of the array-based object is that arrays are a lot faster to access than hashes are, so performance is improved. The disadvantage is that it is very hard to reliably derive a subclass from an array-based object class because we need to know what indices are taken, and which are safe to use. Though this can be done, it requires extra effort and outweighs the benefits of avoiding a hash. It also makes the implementation uglier, which is usually a sign that we are on the wrong track. For objects that we do not intend to use as parent classes, however, arrays can be an effective choice.
Using a Typeglob Hashes and arrays are the most logical choice for object implementations because they allow the storage of multiple values within the object. However, we can use a scalar or typeglob reference if we wish. For instance, we can create an object based on a typeglob and use it to provide object-oriented methods for a filehandle. Indeed, this is exactly what the IO::Handle class (which is the basis of the object-oriented filehandle classes IO::File, IO::Dir, and IO::Socket) does. Here is the actual constructor used by IO::Handle, with additional comments: sub new { # determine the passed class, by class method, object method, # or preset it to 'IO::Handle' if nothing was passed (subroutine). $class = ref($_[0]) || $_[0] || "IO::Handle"; # complain if additional arguments were passed @_ == 1 or croak "usage: new $class"; # create an anonymous typeglob (from the 'Sybmol' module) $io = gensym; # bless it into the appropriate subclass of 'IO::Handle' bless $io, $class; }
730
Object-oriented Perl
Just as we can dereference a blessed hash or array reference to access and set the underlying values contained within, Perl automatically dereferences the reference to a filehandle contained in a typeglob. So we can pass the objects returned from this handle to Perl's IO functions, and they will use them just as if they were regular filehandles. But because the filehandle is also an object, we can call methods on it as this example of using the IO::File subclass demonstrates: #!/usr/bin/perl # output1.pl use warnings; use strict; use IO::File; my $object_fh = new IO::File ('> /tmp/io_file_demo'); $object_fh->print ("An OO print statement\n"); print $object_fh "Or we can use the object as a filehandle"; $object_fh->autoflush(1); close $object_fh;
# this is much nicer than 'selecting'
We have already seen the IO:: family of modules in use, most particularly in Chapter 12. Now we can see how and why these modules work as they do.
Using a Scalar Limited though it might seem, we can also use a scalar to implement an object. For instance, here is a short but functional 'document' object constructor, which takes a filehandle as an optional argument: # Document.pm package Document; use strict; # scalar constructor sub new { my $class = shift; my $self; if (my $fh = shift) { local $/ = undef; $$self = <$fh>; } return bless $self, $class; }
We can now go on to implement methods that operate on text, but hide the details behind the object. We can for example create methods to search the document through other methods that hide the details of regular expressions behind a friendlier object-oriented interface.
Using a Subroutine Finally, we can also use a subroutine as our object implementation, blessing a reference to the subroutine to create an object. In order to do this we have to generate and return an anonymous subroutine on-the-fly in our constructor. This might seem like a lot of work, but it provides us with a way to completely hide the internal details of an object from prying eyes; because all properties are accessed via the subroutine, it can permit or deny whatever kinds of access it likes. We will see an example of this kind of object later in the chapter under 'Keeping Data Private'.
731
Chapter 19
Methods Methods, as we have already observed, are just subroutines that are designed to be called with the -> operator. There are two broad types: ❑
Class methods perform tasks related to the class as a whole, and are not tied to any specific object instance.
❑
Object methods perform a task for a particular object instance.
Although in concept these are fundamentally different ideas, Perl treats both types of method as just slightly different subroutines, which differ only in the way that they are called and in the way they process their arguments. With only minor adjustments to our code, we can also create methods that will operate in either capacity, and even as a subroutine too, if the design supports it.
Class Methods A class method is a method that performs a function for the class as a whole. Constructors, which we have already seen examples of, are a common example, and frequently they are the only class methods an object class provides. Here is another, which sets a pair of global resources that apply to all objects of the class: $MAX_INSTANCES = 100; sub set_max { ($class, $max) = @_; $MAX_INSTANCES = $max; } sub get_max { return $MAX_INSTANCES; }
We would call these class methods from our own code with: My::Object::Class->set_max(1000); print "Maximum instances: ", My::Object::Class->get_max();
Setting and returning class data like this is probably the second most common use for a class method after constructors. Only class-level operations can be performed by a class method, therefore all other functions will be performed by object methods. A special case of a class method that can set class data is the import method, which we dwelt on in Chapter 17. We will take another look at import methods when we come to discuss class data in more detail later on in the chapter.
Object Methods An object method does work for a particular object, and receives an object as its first argument. Traditionally we give this object a name like $self or $this within the method, to indicate that this is the object for which the method was called. Like many aspects of object-oriented programming in Perl (and Perl programming in general) it is just a convention, but a good one to follow. Other languages are stricter; we automatically get a variable called self or sometimes this and so don't have a choice about the name.
732
Object-oriented Perl
Here is a pair of object methods that provide a simple object-oriented way to get and set properties (also known as attributes, but either way they are values) on an object. In this case the object is implemented as a hash, so within the class this translates into setting and getting values from the hash: # get a property - read only sub get { ($self, $property) = @_; return $self->{$property} if exists $self->{$property}; return undef; # no such property! } # set a property - return the old value sub set { ($self, $property, $value) = @_; $oldvalue = $self->property if exists $self->{$property}; $self->{$property} = $value; return $oldvalue; #may be undef }
In practice the users of our objects could simply dereference them and get to the hash values directly. However, we do not want to encourage this since it bypasses our interface. So instead we provide some methods to do it for them. These methods are still very crude though, since they do not check whether a property is actually valid. We will look at some better ways of handling properties later.
# Document.pm package Document; use strict;
TE
# scalar constructor sub new { my $class = shift;
AM FL Y
As a more practical example, here are a pair of search methods that belong to the Document object class we created a constructor for earlier, along with the constructor and the rest of the package, to make it a complete example. Note that the wordsearch method itself makes an object-oriented call to the search method to carry out the actual work:
my $self; if (my $fh = shift) { local $/ = undef; $$self = <$fh>; }
return bless $self, $class; } # search a document object sub search { my ($self, $pattern) = @_; my @matches = $$self =~ /$pattern/sg; return @matches; }
733
Team-Fly®
Chapter 19
# search and return words sub wordsearch { my ($self, $wordbit) = @_; my $pattern = '\b\w*'.$wordbit.'\w*\b'; return $self->search($pattern); } 1;
We can use this object class to perform simple searches on documents read in by the constructor, in an object-oriented style: #!/usr/bin/perl # search.pl use warnings; use strict; use IO::File; use Document; my $fh = new IO::File('file.txt'); my $document = new Document($fh); # find words containing e or t print join(' ', $document->wordsearch('[et]'));
If the file file.txt contains this text: This is a file of text to test the Document object on.
The program produces: file text to test the Document object This is not a very well developed object class; it does not allow us to create an object from anything other than a filehandle, and it needs more methods to make it truly useful. However, it is the beginning of a class we could use to abstract simple text operations on documents. Already it has removed a lot of the ugliness of regular expression code and abstracted it behind the class implementation, with relatively little effort.
Multiple Context Methods Class methods expect a package name as their first argument, whereas object methods expect an object. Other than this, however, they are identical. Since we can determine an object's class by using ref, we can easily write a method that works as both a class and an object method. The most common kind of method we can develop with this approach is a class method that is adapted to work from objects as well, by extracting the class of the object from it using ref and using that instead. For example: sub classmethod { $self = shift; $class = (ref $self)?(ref $self):$self; ... }
734
Object-oriented Perl
Or, more tersely, using a logical ||: sub classmethod { $self = shift; $class = ref $self || $self; ... }
Or, even more tersely: sub classmethod { $class = (ref $_[0]) || $_[0]; ... }
The methods that use this trick the most are constructors that allow us to create a new object from an old one. This allows users to create new objects without even knowing exactly what they are – abstraction taken to the extreme. Here is a version of the Game::Card constructor that does this: sub new { ($class, $name, $suit) = @_; $class = (ref $class) || $class; $self = bless {}, $class; $self->{'name'} = $name; $self->{'suit'} = $suit; return $self; }
We can also create a subroutine that can be called as a subroutine, in addition to being called as a method. This takes a little more thought, since the first argument is whatever we pass to the subroutine when we use it as such. If the subroutine gets no arguments at all it knows it must have been called as a subroutine and not a method. For example, if a constructor takes no initialization data, we can do this: # a constructor that may be called as a subroutine sub new { $class = (ref $_[0]) || $_[0] || __PACKAGE__; return bless {}, $class; }
The first line of this subroutine translates as: 'if we were passed an object, use the class returned as its reference, otherwise if we were passed anything at all use that as the class, otherwise we must have been called as a subroutine so use the name of the package we are in'. We can construct an object from this subroutine using any of the following means: # as class method: $object = My::Flexible::Constructor::Class->new; $object = new My::Flexible::Constructor::Class; # as object method: $object = $existing_flexible_object->new; # as subroutine: $object = My::Flexible::Constructor::Class::new;
735
Chapter 19
Here is another version of the Game::Card constructor that also handles being called as a subroutine. Because it takes additional arguments we have to make some assumptions in order to work out what the first argument is. In this case we will assume that the class name, if supplied, will start with Game::. This is a limitation, but one we are willing to accept in this design: sub new { ($class, $name, $suit) = @_; $class = (ref $class) || $class; # check for the first argument and adjust for subroutine call unless ($class =~ /Game::/) { ($class, $name, $suit) = (__PACKAGE__, $class, $name); } $self = bless {}, $class; $self->{'name'} = $name; $self->{'suit'} = $suit; return $self; }
Of course, whether or not this is actually worth doing depends on whether we actually expect a method to be called as a subroutine. If this is not part of the design of our object class we should probably avoid implementing it, just to discourage non-object-oriented usage.
Object Data Object properties, also called object attributes, are values that are stored within the object. We do not necessarily know how they are stored, but we know what they are because the object class documentation will (or at least, should) tell us. If we know the object's underlying implementation we can access and set these values directly. The attribute becomes just an array element or hash value: print $card->{'suit'}; $card->{'name'} = 'Queen';
This is very bad, however, for several reasons. First, since we are bypassing the object class by not calling a method to do this, the object will have no knowledge of what we are doing. Hence it cannot react or correct us if we do something unexpected, like add a season attribute. The design of this class is not supposed to include an attribute for season, but it has no way of knowing what we are doing. Second, if the names of the attributes change, or the implementation alters in any way, we may find that our code breaks. For instance, if we alter the class to use an array rather than a hash as its underlying data type, all our code will instantly break. Both problems are symptomatic of violating the principle of encapsulation, which dictates that the implementation of the object should be hidden behind the interface. They derive from the fact that we, as users of the class, are determining how the object is accessed, when we should really be using an interface provided for us by the class. In other words, we should be using object methods to both get and set the values of the object's attributes. A related concept to both class and object data is the idea of private class and object data. Perl does not provide a strict mechanism for enforcing privacy, preferring that we respect the design of the object class and don't attempt to work around the provided interface. For cases where we do want to keep data private we can resort to several options, which we will cover a little later in the chapter.
736
Object-oriented Perl
Accessors and Mutators The methods that are provided to get and set object attributes are known in object-oriented circles as accessors and mutators. Despite this jargon, they are really just subroutines that set values in a hash, array, or whatever data type we used to implement the object. Here is an example of accessor and mutator methods for the suit attribute of the Game::Card class (we can just duplicate them for the name attribute), complete with prototypes: # get passed card object and return suit attribute sub get_suit ($) { return shift->{'suit'}; } # set suit attribute on passed card object sub set_suit ($$){ $_[0]->{'suit'} = $_[1]; }
Having separate accessor and mutator methods can be a little awkward, however, especially if we have a lot of attributes to deal with. An object with twenty possible attributes needs forty subroutine definitions to handle them. A popular alternative is to combine accessors and mutators into one method, using the number of arguments passed to determine what to do. For example: sub suit ($;$) { my ($self, $suit) = @_; if ($suit) { my $oldsuit = $self->{'suit'}; $self->{'suit'} = $suit; return $oldsuit; } return $self->{'suit'}; }
This accessor/mutator method gets the current value of the suit attribute if no value is passed, or sets it if one is. As a bonus, it also makes a note of and returns the old value of the attribute. We do not need to check whether the attribute exists because we know (because we wrote the class) that the constructor always sets the hash keys up, even if they have undefined values. This method does not allow us to unset the suit either, in this case intentionally. If we want to permit the suit to be unset we will have to check to see if we were passed a second argument of undef as opposed to no second argument at all. We can do that by replacing the line: if ($suit) {
With: if (scalar(@_)>1) {
737
Chapter 19
Or, more tersely: if ($#_) {
Either replacement checks that at least two arguments were passed to the method, without checking what the second argument is, so undef can be passed as the new attribute value, as can an empty string. However, since the fullname method concatenates this attribute into a string, we probably do not want to allow undef, so this is probably the best variation to use: sub suit ($;$){ ($self, $suit) = @_; if (defined $suit) { $oldsuit = $self->{'suit'}; $self->{'suit'} = $suit; return $oldsuit; } return $self->{'suit'}; }
This will allow an empty string to set the attribute, but not undef.
Generic Accessors/Mutators Combining accessors and mutators into one method is an improvement over the case of separate accessors and mutators methods, at least in terms of the number of subroutines we define, but it comes at the cost of increased code complexity. We do not really want to have to repeat the same subroutine twenty times for each attribute we might want to set or get. However, since all the attributes are essentially just different cases of key-value pairs, we can write one generic accessor/mutator and make all the actual attribute methods wrappers for it, as this example demonstrates: sub _property ($$;$) { my ($self, $attr, $value) = @_; if (defined $value) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; } return $self->{$attr}; } sub suit ($;$) {return shift->_property('suit', @_);} sub name ($;$) {return shift->_property('name', @_);}
Now each new method we want to add requires just one line of code. Better still, the relationship between the name of the method and the name of the hash key has been reduced to one single instance. The setting and getting of attributes is also fully abstracted now. Subclasses that make use of a parent class containing this method can set and get their own properties without needing to know how or where they are stored. This is a very attractive benefit for a properly written object class.
738
Object-oriented Perl
The underscore at the start of _property is meant to imply that this subroutine is not meant for public consumption, it is a private method for the use of the class only. Perl does not enforce this, but by naming it this way and not including it in the user documentation, we make our intention clear. If a user of the class chooses to ignore the design and use it anyway, they cannot say they have not been warned. Having a generic accessor/mutator method gives us great power to develop our object class. Anything we implement in this method will apply to all attributes supported by the object. As an example, here is another version that allows new attributes to be created, but only if another force flag is added. We also add a nonfatal flag to determine whether or not an attempt to set a non-existent attribute is fatal or not: sub _property ($$;$$$) { my ($self, $attr, $value, $force, $nonfatal) = @_; if (defined $value) { if ($force || exists $self->{$attr}) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; } else { croak "Attempt to set non-existent attribute '$attr'" unless $nonfatal; return undef; } } return (exits $self->{$attr})?$self->{$attr}:undef; }
To handle fatal errors we have made use of croak, from the Carp module; we will need to add a use Carp; line to our object class for it to work. As we covered in Chapter 16, croak reports errors in the context of the caller rather than the place at which the error occurs. This is very useful in packages, and object classes are no exception. If we are using accessors and mutators for attribute access we should use them everywhere, including inside the object itself. Here is another version of the constructor for the Game::Card class, written to use the methods of the class rather than initializing them directly. We have also added a prototype at the same time: sub new ($;$$) { my ($class, $name, $suit) = @_; $class = (ref $class) || $class; my $self = bless {}, $class; $self->name($name); $self->suit($suit); return $self; }
This further protects the object class against alterations, this time of itself. However, it does come at a performance penalty of an additional subroutine call. For the constructor above, this is minor and acceptable, but if we are writing a method that uses a loop to access or modify attributes, we may want to compromise and access the attributes directly (only because we're inside the object class) for speed.
739
Chapter 19
Class Data Class data is associated with a class as a whole, rather than with an individual object, and is used to define properties that affect the class as a whole. This can include things like global constants that never change, or changing values like serial numbers or a maximum limit on the number of permitted objects. As an example, here is an object class that generates serial numbers for objects. It keeps a record of the next serial number in the start key of the global variable %conf, and increments it each time a new object is created: # extract from Serial.pm package Serial; use strict; use Carp; our %conf = ( 'start' => 1, 'increment' => 1, ); sub new { my $class = (ref $_[0]) || $_[0]; $conf{'start'} = $_[1] if defined $_[1]; my $self=bless {}, $class; $self->{'serial'} = $conf{'start'}; $conf{'start'} += $conf{'increment'}; return $self; } sub serial { return shift->{'serial'}; } 1;
Having built this class we can use it to create new Serial objects, as this short script does. If we pass in a serial number, the count is reset to that point: #!/usr/bin/perl # serial1.pl use warnings; use strict; use Serial; my @serials; foreach (1..10) { push @serials, new Serial; } print $serials[4]->serial, "\n"; my $serial = new Serial(2001); print $serial->serial, "\n";
740
Object-oriented Perl
The value of the two serial numbers displayed by this program produces the output: 5 2001 This shows us that the fifth serial number (index 4, counting from zero) is 5, as we would expect if the loop generated successive serial numbers starting from 1, with an increment of 1. The second serial number has the value 2001 as we specifically requested. If we generated another serial number object it would have a serial number of 2002. Although it might not seem terribly useful, this is actually a perfectly functional and usable object class because we can inherit from it. Any object can add serial number functionality to itself by inheriting from this object and making sure to call the new method of Serial from its own new method. We will see more about this later. The class data of the Serial object class is declared with our, which means that it is package data and therefore accessible outside the class. We can configure the object class itself by altering these values. For example, to reset the serial number to 42 and the increment to 7: $Serial::conf{'start'} = 42; $Serial::conf{'increment'} = 7;
Now when we call new we will get serial numbers 42, 49, 56, and so on. However, just as we do not really want to allow users to control object data directly, neither do we want to allow users to set class data without supervision. The better approach is to implement a class method to set, and check, new values for class data. Here is a configure method that we can add to the Serial object class to handle this for us: # extract from Serial.pm sub configure { my $class = shift; return unless @_; while (my ($key, $value) = (shift, shift)) { $key =~ /start/ and $conf{'start'} = int($value), last; $key =~ /increment/ and do { $value = int($value); croak "Invalid value '$_'" unless $value; $conf{'increment'} = $value; last; }; croak "Invalid name '$key' in import list"; } }
Better still, we can now change the definition of %conf to make it a lexically scoped variable with my: my %conf = ( 'start' => 1, 'increment' => 1, );
741
Chapter 19
Defined this way, the configuration hash is inaccessible to external users, and has become private class data – only the configure method can alter it, because only it is in the same lexical scope. We can now call this method to configure the class, as shown by this modified version of our script: #!/usr/bin/perl # serial_a.pl use warnings; use strict; use Serial; Serial->configure(start => 42, increment => 7); my @serials; foreach (1..10) { push @serials, new Serial; } print $serials[4]->serial, "\n"; my $serial = new Serial(2001); print $serial->serial, "\n";
The output from this version of the script is (as a little arithmetic would lead us to expect): 70 2001 This class method ensures that we only try to set the two configuration values the class actually supports, and also checks that we do not try to set an increment of zero (which would cause all objects to have the same serial number, of course); neither check would be possible if we simply reached in and changed the hash values directly.
Inheriting Class Data In the class method shown above we ignore the class passed to us because we want to set the class data of this particular class, even if we are being called by a subclass. This means that different objects from different classes, all inheriting from Serial will all have different and distinct serial numbers. However, if we wanted to set class data on a per-class basis we would alter the assignments in this method to something like: ${"$class\:\:conf"}{'start'} = int($value);
# or ${$class.'::conf'}..., # or ${"${class}::conf"}
This sets the value of $conf{'start'} as a package variable in whichever package the call to the configure method actually originated from; in this version of the class, each class that inherits from Serial would have its own configuration and its own sequence of serial numbers. If we have classes inheriting from each other, each with the same class data values, we need to pay attention to this kind of detail or we can easily end up setting class data in the wrong class. See 'NonObject Classes' for an example of a class that works this way. An alternative approach that gets around these difficulties is to set class data via objects; we will also look at that in a moment.
742
Object-oriented Perl
Setting Class Data through Import Methods We mentioned the import method briefly earlier in the chapter, and discussed it at some length in Chapter 10. From the point of view of object-oriented programming, the import method is just another class method, with the unusual property that the use statement calls it. We can easily adapt our earlier example of the configure method to work as an import method too, simply by renaming it import. However, calling Serial->import to configure a variable is confusing, so instead we can just create an import method that calls the configure method, leaving it available under the old name: sub import { shift->configure(@_); }
We could also just alias it, if we like typeglobs: *import = &configure;
Either way, we can now configure the Serial class with: use Serial qw(start => 42, increment => 7);
Everything that applies to the import method also applies to the unimport method, of course, for when we use no rather than use. This is a lot less common, but is ideal for controlling Boolean flags, as the next example illustrates.
Non-Object Classes
AM FL Y
All the classes we have looked at so far have been object classes, containing at least one constructor. However, we can create classes that work entirely on class data. This might seem like a strange idea, but it makes sense when we also include the possibility of inheritance. Just as inheriting object methods from a parent class allows us to abstract details of an object's workings, we can inherit class methods to abstract class data manipulations.
TE
The following class continues the theme of the previous section by implementing both import and unimport methods to create a set of Boolean flags stored as class data. It provides methods to get, set, delete, and create new variables, as well as a list method to determine which variables exist, or are set or unset. It is also an example of a class-data only package, and contains no constructor at all. Rather than being an object class in its own right, it is designed to be inherited by other object classes to provide them with class data manipulating methods. # Booleans.pm package Booleans; use strict; no strict 'refs'; use Carp; # establish set boolean vars sub import { my $class = shift; map {${"$class\:\:conf"}{$_} = 1} @_; } # establish unset boolean vars sub unimport { my $class = shift;
743
Team-Fly®
Chapter 19
map {${"$class\:\:conf"}{$_} = 0} @_; } # private method -- does all the actual work for set, unset, and delete # only variables already established may be altered here. sub _set ($$$) { my $class = (ref $_[0]) || $_[0]; unless (exists ${"$class\:\:conf"}{$_[1]}) { croak "Boolean $_[1] not imported"; } if (defined $_[2]) { ${"$class\:\:conf"}{$_[1]} = $_[2]?1:0; } else { delete ${"$class\:\:conf"}{$_[1]}; } } # return variable value sub get ($$) { my $class = (ref $_[0]) || $_[0]; return ${"$class\:\:conf"}{$_[1]}; } # set a variable sub set ($$) { shift->_set(@_, 1); } # clear a variable sub unset ($$) { shift->_set(@_, 0); } # delete an existing variable sub delete ($$) { shift->_set(@_, undef); } # invent a new variable -- _set doesn't allow this sub create ($$$) { ${"$_[0]\:\:conf"}{$_[1]}=$_[2]?1:0; } # return a list of all set, all unset or all variables sub list ($;$) { my ($class,$set)=@_; if (defined $set) { # return list of set or unset vars my @vars; foreach (keys %{"$class\:\:conf"}) { push @vars,$_ unless ${"$class\:\:conf"}{$_} ^ $set; } return @vars; } else { # return list of all vars in set return keys %{"$_[0]\:\:conf"}; } } 1;
744
Object-oriented Perl
This non-object class is designed to be inherited by other classes, which can use the methods it supplies to set their own Boolean flags. This accounts for all the occurrences of things like: %{"$class\:\:conf"}
This is a symbolic reference that resolves to the %conf hash in the package that was first accessed through the method call, so every class that makes use of this class gets its own %conf hash, rather than sharing one in Booleans. Indeed, we do not even declare a %conf hash here since it is not necessary, both because we do not expect to use the module directly and because we always refer to the hash by its full package-qualified name, so no our or use vars declaration is necessary. This is an important point, because it means that a subclass can have a conf array just by subclassing from Booleans. It need not declare the hash itself. It is still free to do so, but the point of this class is that all Boolean variable access should be via the methods of this class, so it should never be necessary. We can test out this class with a short script that does use the module directly, just to prove that it works: #!/usr/bin/perl # booleans.pl use warnings; use strict; use Booleans qw(first second third); no Booleans qw(fourth fifth); print "At Start: \n"; foreach (Booleans->list) { print "\t$_ is\t:", Booleans->get($_), "\n"; } Booleans->set('fifth'); Booleans->unset('first'); Booleans->create('ninth', 1); Booleans->delete('fourth'); print "Now: \n"; foreach (Booleans->list) { print "\t$_ is\t:", Booleans->get($_), "\n"; } print "By state: \n"; print "\tSet variables are: ", join(', ', Booleans->list(1)), "\n"; print "\tUnset variables are: ", join(', ', Booleans->list(0)), "\n";
The output of this program is: At Start: first is fifth is fourth is third is second is
:1 :0 :0 :1 :1
745
Chapter 19
Now: first is :0 fifth is :1 ninth is :1 third is :1 second is :1 By state: Set variables are: fifth, ninth, third, second Unset variables are: first Clearly the most immediately useful improvement we could make to this class is to have list return the flags in the order of creation (if only for cosmetic purposes) but it serves to prove that the class, such as it is, performs as designed.
Accessing Class Data via Objects We sometimes want to allow users of our object class to find out the contents of our class data, and even in some cases set it. The lazy way to do this is simply to access the data directly, but as we have already noted this is not object-oriented and therefore bad practice. Better is to use a class method, as we did earlier. Both approaches suffer from the fact that we need to use the class name to access either the data or the method to manipulate it, which varies between a class and its subclasses. An alternative approach, which avoids all this complexity, and also avoids instances of constructing package variable names on the fly with expressions like $class\:\:variable, is to store references to the class data in the objects. We can then access and set the class data through the object using an accessor/mutator style method. In order to use this approach we need to add some private attributes to the object instance that contain references to the class data. The key point of this technique is that we establish a relationship between the object and its class data in the constructor at the moment the object is created. From this point on, so long as we use the object to access the class data, there is no ambiguity as to which class we are referring to, and we do not need to figure out package prefixes or object class names. The object 'knows' which class data applies to it. Here is a modified version of the constructor for the Serial class that provides references for the class data of the class. The key lines are the ones containing the definitions for the _next and _incr attributes: # part of xxx_serial2/Serial.pm sub new { my $class = (ref $_[0]) || $_[0]; $conf{'start'} = $_[1] if defined $_[1]; my $self = bless {}, $class; $self->{'serial'} = $conf{'start'}; $self->{'_next'} = \$conf{'start'}; $self->{'_incr'} = \$conf{'increment'}; $conf{'start'} += $conf{'increment'}; return $self; }
We create the two new attributes by assigning references to the hash values we want to provide access for. We give them underscored names, to indicate that these are intended for private use, not for public access. Finally, we provide object methods that use these new attributes. Here is an accessor/mutator object method that we can append to our Serial class that performs this duty for us:
746
Object-oriented Perl
sub next ($;$) { my ($self, $new) = @_; ${$self->{'_next'}} = $new if $new; return ${$self->{'_next'}}; }
Now, if we want to change the serial number for the next object created we can find any existing Serial object (or subclass thereof) and use the next method, as this short script illustrates: #!/usr/bin/perl # serial2.pl use warnings; use strict; use Serial; my $serial = new Serial; print "First serial number is ", $serial->serial, "\n"; $serial->next(10000); my $serial2 = new Serial; print "Second serial number is ", $serial2->serial, "\n";
When we run this script we get the expected output: First serial number is 1 Second serial number is 10000 When we use the Serial class directly, the next method has exactly the same effect as configuring the increment through the configure class method. However, if we create a subclass for this class we can now choose whether to keep the references to Serial's class data, or replace them with references to class data for the subclass. If we change the references to point to class data for the new class, the next method will adapt automatically because it works through the reference. When we cover inheritance more fully we'll develop a subclass of Serial that makes use of this.
Debugging Object Classes We frequently want to include some form of debugging support in our code, where messages are conditionally logged depending on the value of a debugging flag set elsewhere. In the case of object classes, we have two different places to put such a flag, as class data in the class itself for overall debugging of all objects in that class, or at a per-object level, using object properties. In the case of classlevel debugging, we also have the choice of handling all classes with a single debugging flag, or giving each class its own debugging flag. This is more than just a debugging design issue; all data must fall into one of these three categories. Debugging support is thus a good general example of how to handle data in an object-oriented context. It pays to consider inheritance here, because it is better to create a generic debugging module and use it in many places, rather than repeat it for each object class we create. So, while we have yet to cover that subject in detail we will introduce some inheritable debugging object classes too. It is never too early to start planning code reuse.
747
Chapter 19
Class-Level Debugging Class-level debugging revolves around a class data variable, which controls whether debug messages will be printed on a class-wide basis. The debug method itself, which prints out the messages, can work as either a class or object method, but the flag that enables or disables messages applies to all objects. Here is a simple example that provides two new methods for multi-level debugging support. Any method within the class can use them, and any user of the class can also use them to enable debugging for the class: package My::Debuggable::Class; use switch; my $debug_level = 0; # accessor/mutator sub debug_level { my ($self, $level) = @_; $debug_level = $level; } # debug method sub debug { my ($self, $level) = (shift, shift); print STDERR @_, "\n" if $level <= $debug_level; }
We can now write methods that use these debugging methods like so: sub a_method { my $self = shift; $self->debug(1, "This is a level one debug message"); ... }
Or, from elsewhere: My::Debuggable::Class->debug_level(2); My::Debuggable::Class->debug(1, "A debug message"); $my_debuggable_class_object->debug(2, "Another debug message");
We can also inherit the debug methods into a subclass to implement a generic debugging facility for that class: package Other::Class; use Debuggable; our @ISA = qw(Debuggable); __PACKAGE__->debug_level(1);
748
Object-oriented Perl
Or, from elsewhere: Other::Class->debug_level(1); $other_class_object->debug(1, "Debug message");
In this version, the class provides a global debugging switch; since all objects inherit the same debug flag, all objects would have their debugging messages enabled or disabled according to that debug flag. Alternatively, if we want to enable debugging on a per-class basis, we can do by replacing the private $debug_level with a constructed package variable name, as this adjusted debugging class does: # Debugpack.pm package Class::Debuggable; use strict; no strict 'refs'; # accessor/mutator sub debug_level { my ($self, $level) = @_; my $class = (ref $self) || $self; ${"$class\:\:debug_level"} = $level; } # debug method sub debug { my ($self, $level) = (shift, shift); print STDERR @_, "\n" if $level <= ${"$class\:\:debug_level"}; } 1;
Note that because we are using symbolic references to compute the name of the class-specific debugging flag we disable strict references for this class only. Now when we enable or disable the debug level for a particular object class, only the objects of that class will notice: # set different debug levels for different classes Class::One->debug_level(2); Class::Two->debug_level(1); Class::Three->debug_level(0);
Object-Level Debugging Debugging at the object level is essentially similar to the class level, but this time we are enabling debugging on particular objects only. To do that, we need to create a debug attribute associated with the object, which we can do in the constructor. The following debugging class is designed to be inherited, but we could plug its methods into an existing class with only a little modification. It implements an object-level debugging scheme, and is broadly similar to the earlier examples, except that it uses an object attribute instead of class data. The object attribute is stored as a hash key, so any object class that inherits from it must use a hash as its underlying representation, which is an important limitation we need to be aware of when using it (but if we plan to stick to hashes for our objects then we may consider this acceptable).
749
Chapter 19
It also provides a constructor, for classes that want to inherit from it, and a separate initializer to create the debugging flag attribute independently for objects that already exist. This gives us maximum flexibility when using the class in combination with others; we will delve into this in more detail when we come to discuss writing inheritable classes later in the chapter. # Object/Debuggable.pm package Object::Debuggable; use strict; no strict 'refs'; # constructor - for new objects sub new { my ($self, $level) = @_; my $class = (ref $self) || $self; $self = bless {}, $class; $self->initialize($level); return $self; } # initializer - for existing objects sub initialize { my ($self, $level) = @_; $self->{'debug_level'} = $level; } # accessor/mutator sub debug_level { my ($self, $level) = @_; if (defined $level) { my $oldlevel = $self->{'debug_level'}; $self->{'debug_level'} = $level; return $oldlevel; } return $self->{'debug_level'}; } # debug method sub debug { my ($self, $level) = (shift, shift); print STDERR @_, "\n" if $level <= $self->{'debug_level'}; } 1;
Implementing a Multiplex Debug Strategy Having shown three different ways to implement debugging support in our object classes, it would be nice to be able support all of them simultaneously. In fact, with a little thought, we can. Here is a jackof-all-trades debugging class that implements a single global debugging flag that affects all classes and objects that inherit from it, a per-class debugging flag, and individual debugging flags for each object: # Debuggable.pm package Debuggable; use strict; no strict 'refs';
750
Object-oriented Perl
# global debugging flag -- all classes my $global_level = 0; # object constructor sub new { my ($self, $level) = @_; my $class = (ref $self) || $self; $self = bless {}, $class; $self->initialize($level); return $self; } # initializer for existing objects # (actually just a wrapper for debug_level) sub initialize { my ($self, $level) = @_; $self->debug_level($level); } # get/set global debug level sub global_debug_level { my ($self, $level) = @_; if (defined $level) { # set new level, return old one my $old = $global_level; $global_level = $level; return $old; } # return current global debug level return $global_level; } # get/set class debug level sub class_debug_level { my ($self, $level) = @_; my $class = (ref $self) || $self; if (defined $level) { # set new level, return old one my $old = ${"$class\:\:class_debug"}; ${"$class\:\:class_debug"} = $level; return $old; } # return current class debug level return ${"$class\:\:class_debug"}; } # get/set object debug level sub debug_level { my ($self, $level) = @_; my $class = ref $self; # check to see if we were called as class method unless (my $class = ref $self) { # '$self' is a class name return $self->class_debug_level($level); }
751
Chapter 19
if (defined $level) { # set new level, return old one my $old = $self->{'debug_level'}; $self->{'debug_level'} = $level; return $old; } # return current object debug level return $self->{'debug_level'}; } sub debug { my ($self, $level) = (shift, shift); # if no message, set the (class or object) debug level itself return $self->debug_level($level) unless @_; # write out debug message if criteria allow # object debug is allowed if object, class or global flag allows # class debug is allowed if class or global flag allows my $class = (ref $self) || $self; print STDERR @_, "\n" if $level <= $global_level || (defined ${"$class\:\:class_debug"} and $level <= ${"$class\:\:class_debug"}) || (ref($self) and $level <= $self->{'debug_level'}); } 1;
This class attempts to handle almost any kind of debugging we might want to perform. We can set debugging globally, at the class level, or on individual objects. If we do not have an object already we can use the constructor, otherwise we can just initialize debugging support with initialize, which operates as either a class or object method and sets the class-wide or object-specific flag appropriately. In fact, it just calls debug_level, which does exactly the same thing, but initialize sounds better in a constructor. The three accessor/mutator methods all work the same way and perform the same logic, applied to their particular debugging flag, global, class, or object, with the exception of the object-level accessor/mutator, which passes class-level requests to the class method if it is called with a package name rather than an object. The debug routine itself now checks all three flags to see if it should print a debug message; if any one of the three is higher than the level of the message, it is printed, otherwise it is ignored. Class-level debugging messages ignore the object-level debug flag, since in their case there is no object to consult. As a convenience, if we supply no message at all we can set and get the debug level instead. Since we simply pass the request to debug_level, and debug_level in turn passes class method calls to class_debug_level, we can in fact handle all of the module's functionality except the global flag through debug. To demonstrate all this in action, here is a short script, complete with its own embedded test object class (which does nothing but inherit from Debuggable) that tests out the features of the module: #!/usr/bin/perl # debugged.pl use warnings; use strict;
752
Object-oriented Perl
# a test object class { package Debugged; use Debuggable; our @ISA = qw(Debuggable); sub new {return bless {}, shift;} } # create a test object from the test class my $object = new Debugged; # defined below so no 'use' # set debug levels globally, at class level, and on the object Debugged->global_debug_level(0); Debugged->class_debug_level(2); $object->debug_level(1); # print class and object level debug messages Debugged->debug(1, "A class debug message"); $object->debug(1, "A debug message"); # find current debug levels with _level methods print "Class debug level: ", Debugged->class_debug_level, "\n"; print "Object debug level: ", $object->debug_level, "\n"; # find current debug levels with no-argument 'debug' print "Class debug level: ", Debugged->debug, "\n"; print "Object debug level: ", $object->debug, "\n";
The output from this program is:
TE
A class debug message A debug message Class debug level: 2 Object debug level: 1 Class debug level: 2 Object debug level: 1
AM FL Y
# switch off class and object debug with 1-argument 'debug' Debugged->debug(0); $object->debug (0);
Inheriting from an object class, and then using the subclass as a substitute for the parent class is a good test of inheritance. If the module is written correctly, everything the parent class does should work identically for the subclass. However, to really appreciate this we really need to discuss inheritance properly.
Inheritance and Subclassing Inheritance is the cornerstone of code reuse in object-oriented programming. By inheriting the properties and methods of one class into another, we can avoid having to reimplement code a second time, while at the same time retaining abstraction in the existing implementation. A well designed object class can allow classes that inherit from it, also known as subclasses, to carry out their own tasks using the parent class as a foundation without knowing more than the bare minimum of how the class is actually implemented. If a subclass can carry out all its work in terms of methods supplied by the parent, it need not know anything at all. Inheritance is thus a powerful tool for abstraction, which in turn allows us to write more robust and more reusable object classes. There is no need to reinvent the wheel, particularly if the wheel is already working and is well tested.
753
Team-Fly®
Chapter 19
Perl's inheritance mechanism is, yet again, both matter-of-fact and extremely powerful. It consists of two parts, the special package variable @ISA, which defines inheritance simply by naming the class or classes from which an object class inherits methods and/or class data, and the -> operator, which searches the packages named by @ISA whenever the immediate class does not provide a method. There are many different forms of inheritance in the object-oriented world, but all of them revolve around the basic principles of abstraction and propagation of method calls from subclasses to their parents. Some, but not all, languages permit multiple inheritance, where a subclass can inherit from more than one parent. Similarly, some but not all languages permit dynamic inheritance (a class's parents can be changed by the class at will). Perl permits both of these mechanisms in the @ISA array; adding more than one package to @ISA is multiple inheritance, and modifying the contents of @ISA programmatically is dynamic inheritance. It is this kind of simple implementation of complex concepts that makes Perl such an endearing language. One limitation (or strength, depending on how we look at it) of Perl's approach is that subclasses always know who their parents are, but parents have no idea who their children are. There is no way for a parent class to determine what classes may be using it, nor any mechanism, short of returning undef from a constructor, to bar inheritance. Perl's attitude is that classes ought to be inherited, and that knowledge of subclasses is neither necessary nor desirable for good object-oriented programming. In addition to whatever classes they may explicitly inherit from, all classes inherit automatically from the UNIVERSAL object, which, as we mentioned nearer the start of the chapter, provides the can, isa, and version methods for all objects. We can place our own methods in this special package if we wish, to give all objects new capabilities.
Inheriting from a Parent Class The basis of all inheritance is the @ISA array, a special package variable, which, if defined, describes the package or packages from which an object class will inherit methods. To inherit from a parent class we need only specify its package name in the @ISA array. For example: package Sub::Class; use Parent::Class; our @ISA = qw(Parent::Class);
@ISA works hand in glove with the -> operator, which actually enacts the process of searching for inherited methods. Any method call for a method that is not implemented by this class will automatically be forwarded up to the parent class. If it also does not implement the method then its parent class(es) are searched until Perl runs out of ancestors or a matching method is found.
Overriding and Calling Overridden Methods We frequently want to implement our own version of a method supplied by a parent, but in doing so, we also often want to take advantage of the parent's method, to carry out tasks related to the parent's implementation of our object. We do not necessarily need to know what these are (that is the point of abstraction), but we would still like to have them done for us. Since by replacing (or in object-oriented terms, overriding) a method in a parent class, we suppress the call to the parent, we need to call it explicitly ourselves in the replacement method if we want to make use of it. For instance, if we are in a package which is a subclass of Parent::Class, we might write: sub a_method { $self = shift;
754
Object-oriented Perl
$self->Parent::Class::a_method(@_); ... do our own thing ... }
The effect of this, is to call the method a_method in the parent class, but with the package or object specified by $self: that of our own class rather than that of our parent's. From the parent's perspective there is not (or should not) be any difference. Our class is a subclass of it and therefore has all the same properties except that which it overrides. Many objects only have one parent class, but still name the parent class explicitly, as in the example above. This is an irritation – there is only one parent, and so we should not have to mention it. Fortunately, we can make use of the special package prefix SUPER::, which allows us to refer to our parent class without explicitly naming it: sub a_method { $self = shift; $self->SUPER::a_method(@_); ... do our own thing ... }
Note that this is different from writing the incorrect: SUPER->a_method(@_);
In this case, the call to a_method gets the package that SUPER resolved to as its first argument – this is not the object making the call. The SUPER:: package does not actually exist, but evaluates at run time to all of the packages (since we may have more than one) from which this package inherits. Perl carries out another search for the method, starting from each of the packages in @ISA in turn, until it either finds the method, or it runs out of object classes to search. If we only inherit from one parent class then SUPER:: is almost (but not quite) the same as $ISA[0]::: $self->$ISA[0]::a_method(@_);
The difference is subtle, but important. All objects implicitly inherit from the UNIVERSAL object, but only SUPER:: takes this into account when searching for methods. $ISA[0]:: does not.
Overriding and Calling Overridden Constructors Constructors are an important case of overriding and inheriting from a parent class. Here we frequently want to have a parent class create our object for us, but we want it to create an object of our class, and not its own; to achieve that, we must call the constructor in the parent class with our own class name as the first argument: sub new { $self = shift; $class = (ref $self) || $self; $self = $class->SUPER::new(@_); ... do our own thing ... return $self; }
755
Chapter 19
If the parent class is well behaved and allows inheritance to take place properly, it should return an object blessed into whatever package we sent it. We do not even know what that package is ourselves, because it was passed to us. It might be our own class, but it might equally be a subclass of ourselves. By avoiding naming a package explicitly, we handle all these possibilities without having to overtly worry about any of them.
Writing Inheritable Classes In order for an object class to be inheritable, we need to implement it so that it does not contain or refer to any non-inheritable elements. Although this takes a little extra work, it pays dividends because we create object classes that can be reused and customized by implementing subclasses that add to and override the methods of the original. In addition, the very act of writing an object class with inheritance in mind helps us develop it more robustly, and exposes weaknesses in the design that might otherwise pass unnoticed until much later. The three golden rules for writing inheritable objects are: ❑
Do not refer directly to class data
❑
Always write constructors to use the passed class name
❑
Never, ever, export anything
A fourth less formal but handy rule-of-thumb is: ❑
Work through the passed object
This is a more general form of the second rule, but directed towards object methods rather than constructors. Class methods are off the hook a little here. Since we do not usually expect a class method to be inherited (they work on a class-wide basis), they tend to be more class-specific. Having said that, it is better if we can write our class methods to be inheritable too. Finally, a fifth rule-of-thumb is: ❑
Call parent constructors first
Following this rule ensures that resources are properly allocated by the parent class (or classes) before we do our own initialization. Of course we might need to do some initialization to call a parent constructor, but as a general rule-of-thumb it is a good one to stick to.
More on Class Data through Objects As we saw earlier, if we do not want to refer to the same class data in every subclass, we need to take special steps involving symbolic references (though we could also use the qualify_to_ref subroutine provided by the Symbol module). The problem with either approach is that we are stuck with it, we cannot choose whether we want to access the original parent's class data or our own version in further subclasses. Grandchildren get the grandparent's data or their own, depending on how the parent decides to implement its class data accessors and mutators. The solution is to allow the class to be accessed indirectly, through the objects that we build, which we do by building references to the class data as properties of the objects of the class. We have already seen this in action earlier in the chapter when we altered the constructor and accessor/mutator methods of the Serial module to work with the classes' data through object properties. Here is the constructor of the last version of the Serial class again, as a reminder:
756
Object-oriented Perl
sub new { my $class = (ref $_[0]) || $_[0]; $conf{'start'} = $_[1] if defined $_[1]; my $self = bless {}, $class; $self->{'serial'} = $conf{'start'}; $self->{'_next'} = \$conf{'start'}; $self->{'_incr'} = \$conf{'increment'}; $conf{'start'} += $conf{'increment'}; return $self; }
The essential point of this design, from an inheritance point of view, is that any object classes that inherit this constructor may override it with their own constructor. This can choose to either leave the references defined in the object alone, or replace them with new references. Here is a complete subclass that does just this, as well as adding a new read-only attribute, at the time of creation: # MySerial.pm package MySerial; use strict; use Serial; our @ISA = qw(Serial); my $next = 1; my $plus = 1; sub new { my $class = shift; # call Serial::new my $self = $class->SUPER::new(@_); # override parent serial with our own $self->{'serial'} = $next; # replace class data references $self->{'_next'} = \$next; $self->{'_incr'} = \$plus; # add a creation time $self->{'time'} = time; return $self; } sub time { return shift->{'time'}; } 1;
757
Chapter 19
To test out this new subclass we can use a modified version of our last test script: #!/usr/bin/perl # myserial.pl use warnings; use strict; use MySerial; my $serial = new MySerial; print "Serial number ", $serial->serial, " created at ", scalar(localtime $serial->time), "\n"; $serial->next(10000); sleep(1); my $serial2 = new MySerial; print "Serial number ", $serial2->serial," created at ", scalar(localtime $serial2->time), "\n";
The output of this script should look like (depending on the time we run this script, of course) the following: Serial number 1 created at Mon Jan 1 12:11:17 2001 Serial number 10000 created at Mon Jan 1 12:11:18 2001 What makes this subclass interesting is that it can continue to use the methods provided by the parent class, which use the object properties to access the class data. We do not even need to replace the class data with the same kinds of variables – in the original it was a hash of two key-value pairs. In the subclass it is two discrete scalar variables. By altering these properties, the subclass effectively reprograms the parent's methods to work with the class data that the subclass wants to use. We have moved the choice of which class data is accessed from the parent to the subclass. There is one remaining flaw in the subclass constructor – it sets the attributes of the parent object class directly. In order to enable the subclass to set its class data without referring directly to the hash we should provide a method to set the class data, and a method to set the serial number. Here are two methods that when appended to the Serial class will do the job for us: # private method to set location class data sub _set_config { my ($self, $nextref, $incrref) = @_; $self->{'_next'} = $nextref; $self->{'_incr'} = $incrref; } # private method to set serial number sub _set _serial { $self->{'serial'} = shift; }
We have avoided mentioning a method to set the serial number up to now because it is not a feature we want to allow publicly. Therefore we have given it a leading underscore and called it _set_serial to emphasize that it is for the use of subclasses only. The method to set the class data is similarly only for the use of subclasses, so we have called it _set_config. The subclass constructor can now be modified to use these methods, resulting in this new, correctly object-oriented, constructor:
758
Object-oriented Perl
# part of MySerial.pm sub new { my $class = shift; # call Serial::new my $self = $class->SUPER::new(@_); # override parent serial with our own $self - >_set_serial($next); # replace class data references $self->set_config(\$next, \$plus); # add a creation time $self->{'time'} = time; return $self; }
Not only is this more correct, it is simpler to understand too. It is not quite perfect though – the time attribute assumes that the Serial class uses a hash as its underlying representation. We may be happy to live with that, but if we wanted to fix that too we could make use of the private _property method we created earlier in 'Generic Accessors/Mutators' for the easy creation of, and access to, attributes. By adding this method to the Serial class we could then replace the line setting the time attribute: $self->{'time'} = time;
to: $self->_property('time', time);
Exports Avoiding exports from object classes almost goes without saying. If we export any variable or subroutine from an object class we allow it to be accessed outside the method calling scheme, breaking the design. Avoiding exports is of course very easy; we just do not do it. In particular we do not use the Exporter. There is one exception to this rule, however, which comes about when we design an object class to be used both with objects, and with a functional interface, as the CGI module does. Here we generally use a default object created inside the package as global data, and use it whenever an explicit object is not passed to us. To do that requires writing the required methods so they can also be called as functions. For example: package My::FunctionalObject; use strict; use Exporter; our @ISA = qw(Exporter); our @EXPORT_OK = qw(method_or_function); my $default_object = new My::FunctionalObject; sub new { ... }
759
Chapter 19
sub method_or_function { my $self = (ref $_[0])?shift:$default_object; return $self->do_something_else (@_); }
This hypothetical method relies on the fact that, if called as a function, its first argument will not be a reference. It uses this fact to test whether it had been called as a function or method, and sets $self to be either the passed object or the default object accordingly. So long as we design all our methods/subroutines so that the first argument passed in the argument list can never be a reference, this technique will work very well.
Private Methods A private method is one that can only be called from the class in which it is defined, or alternatively, a method that can only be called from a subclass. Both concepts are implemented as specific features in many object-oriented languages. Typically, Perl does not provide any formal mechanism for defining either type of method. However, its pragmatic approach makes both easy to implement, simply by checking the class name within the method. The catch is that although easy to do, this kind of privacy is enforced at run time, rather than at compile time. Pragmatism is good, but it has its drawbacks too. To make a method private to its own class we can check the package name of the caller and refuse to execute unless it's our own, as returned by the __PACKAGE__ token. Here is the _property method we developed for the Game::Card class from earlier in the chapter, adapted to enforce privacy rather than just implying it with a leading underscore: # _property method private to methods of this class sub _property ($$;$) { my ($self, $attr, $value) = @_; my $class = ref $self; croak "Attempt to call private method if $class ne __PACKAGE__"; if ($value) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; } return $self->{$attr}; }
To restrict access to subclasses and ourselves only, we can make use of the isa method. This returns True if a class is a subclass of the named class, so we can use it to create a subclass-private method like this. This version of _property takes this approach, which makes it much more suitable in an inheritable parent class. # _property method private to methods of this class sub _property ($$;$) { my ($self, $attr, $value) = @_; my $class = ref $self; croak "Attempt to call private method unless $class->isa(__PACKAGE__)";
760
Object-oriented Perl
if ($value) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; } return $self->{$attr}; }
Extending and Redefining Objects Perl's objects are dynamic in more than one way. We can change their ancestry by manipulating the @ISA array, though this is not a common, and arguably not an advisable thing to do. More interestingly and somewhat more usefully, we can call bless on an object a second time to alter the package to which it belongs. We do not need to do this for extending an object class, because we should be able to pass the new class as the first argument to the constructor of a parent class through inheritance, but we can also use bless to create specialized subclasses, to indicate particular situations.
Extending a Parent Class Properly designed parent classes should be able to create objects in any requested subclass on demand, by virtue of the two-argument form of bless. For example, say we want to extend the DBI module to batch up do requests and then execute them all at once. We could just place our methods directly into the DBI class to ensure that a database handle object (of class DBI) can use them. This is a bad idea, however, because we might break the parent class if it happens to define a method with the same name. This is, again, a case of defining the class interface from outside the class, which is an object-oriented no-no. Instead, the DBI module can bless a returned handle into our own class rather than the DBI class: # Extended/DBI.pm package Extended::DBI; use strict; use DBI; our @ISA = qw(DBI); my @cache; sub do_later { my ($dbh, $statement) = @_; push @cache, $statement; if ($#cache == 3) { map {$dbh->do($_)} @cache; @cache = (); } } sub do_now { return map {$dbh->do($_)} @cache; } 1;
761
Chapter 19
In our code: use Extended::DBI; my $dbh = Extended::DBI->connect($dsn, $user, $password); $dbh->do_later($statement); $dbh->do_later($another_statement); ... $dbh->do_now();
This works because while we call the DBI class method connect to create the database handle, we pass the name of the class into which we want the returned handle blessed. Inside connect, the new database handle is blessed into the class we asked for, Extended::DBI. The new handle works just like a regular DBI handle, but we can now cache and execute do statements with it too.
Constructing Subclasses On the Fly Another interesting case of extending a parent object is where the parent object does it itself in order to reflect the circumstances under which it was created. There is no particular reason why a constructor cannot create an object of a different class to the one it is in, or the one passed to it. If we define an @ISA array for the new class (which we can do simply by assigning it with a fully qualified package name), we can also have this new class inherit its methods from ourselves. This gives us the possibility of creating different versions of ourselves for different occasions, an approach that is often referred to as a 'factory object'. For example, a common approach taken by other object-oriented languages is to represent errors as objects (generically known as exception objects) rather than as raw numbers or some other basic data type, so errors can be handled in a more object-oriented style, without relying on an arbitrary system error value. Perl does not support such a concept since error objects are altogether too complicated for a pragmatic language like Perl. However, we can create an error object class that supports various different methods for handling errors, and then subclass this class on the fly when a new type of error occurs. By choosing to do it this way we need not create subclasses for every error type, and we automatically support any new errors that may arise without needing to change the code. The following example demonstrates an object class that translates system errors, as communicated by the special variable $! (or $ERRNO, if using the English module), into objects whose class is based on the common name for the error, ENOENT for error number 2, for example. Each object class is created on the fly if it does not exist, and an @ISA array is placed into it to subclass it from the parent class passed to the constructor (which is either ourselves or something that inherits from us; either way, the methods of our class can be called from the new subclass). The key parts of the module are the new constructor and the _error_from_errno subroutine: # ErrorClass.pm package ErrorClass; use strict; use Errno; no strict 'refs'; my %errcache; sub new { my ($class, $error) = @_; $class = (ref $class) || $class; $error = $error || $!;
762
Object-oriented Perl
# construct the subclass name my $subclass = $class. '::'. _error_from_errno($error); # cause subclass to inherit from us unless (@{$subclass. '::ISA'}) { @{$subclass. '::ISA'} = qw(ErrorClass); } # return reference to error, blessed into subclass return bless \$error, $subclass; } # return the integer value of the error sub number { my $self = shift; return int($$self); } # return the message of the error sub message { my $self = shift; return ''.$$self; }
AM FL Y
# accessor/mutator sub error { my ($self, $error) = @_; if (defined $error) { my $old = $$self; $$self = $error; return $old; } return $$self; } # subroutine to find the name of an error number sub _error_from_errno { my $errno = int(shift); # do we already know this one by name? if (defined $errcache{$errno}) { return $errcache{$errno}; }
TE
# otherwise, search for it in the export list my ($name, $number); foreach $name (@Errno::EXPORT_OK) { $number = Errno->$name; $errcache{$number} = $name; return $name if $errno == $number; } # unlikely, but just in case... return 'UNKNOWN'; } 1;
The class works by using the Errno package to find names for errors, and then scanning each in turn by calling the constant subroutine defined by Errno. When it finds a match, it creates a new subclass comprising the passed class name and the error name, and then blesses the error into that class before returning it. It also caches the error numbers and error names as it searches so it can avoid calling too many subroutines in future.
763
Team-Fly®
Chapter 19
It is interesting to see just how easily a class can be created in code. The only thing that exists in the subclasses we create is a definition for the @ISA array – everything else is inherited. Despite this, they are still fully functional object classes, with methods that can be called, and, if we wish, they can even be subclassed. The ErrorClass object class is also an example of an object based on a scalar value, in this case, the value of $!. If no initial value is given, the constructor defaults to using the current value of $! as the basis for the object it is about to return. $! is an interesting value because it has both an integer and string definition, as we covered in Chapter 16. This object simply blesses that value, which retains its dual nature even when copied to a new variable, and allows the integer and string aspects of it to be retrieved by different accessors. Here is a short script that demonstrates how this class can be used: #!/usr/bin/perl # error.pl use warnings; use strict; use ErrorClass; # generate an error unless (open IN, "no.such.file") { my $error = new ErrorClass; print "Error object ", ref $error, "\n"; print "\thas number ", $error->number, "\n"; print "\thas message '", $error->message, "'\n"; print "It's not there! \n" if $error->isa("ErrorClass::ENOENT"); }
Running this script produces the output: Error object ErrorClass::ENOENT has number 2 has message 'No such file or directory' It's not there! Note that we will also get a warning from Perl because we have only used IN once – we never intended to use it anyway. The last line of this script demonstrates the real point of the class. Rather than checking numbers or using the Errno module to call a constant subroutine to determine the numeric value of the error, we convert the error into an object. We can now, in a properly object-oriented way, check the type of an error by looking to see what kind of error object it is.
Multiple Inheritance In object-oriented circles, the whole idea of multiple inheritance is fiercely debated, with many languages banning it outright. The idea of multiple inheritance (inheriting from several parents at once) seems simple enough, but while multiple inheritance is very powerful, it is also a very quick route to object-oriented chaos. Perl abstains from this debate – although it does allow it, it doesn't take great pains to support it either.
764
Object-oriented Perl
To inherit from more than one class is very simple indeed; we just place more than one class into the @ISA array: package My::Subclass; use Parent::Class::One; use Parent::Class::Two; our @ISA = qw(Parent::Class::One Parent::Class::Two);
When we come to call a method on My::Subclass, Perl will first check that the package actually implements it. If it does, there is no problem; the method is called and its value returned. If it does not, Perl checks for the method in the parent classes in @ISA, in the order in which they are defined. This means that Parent::Class::One is searched first, including all the classes that it in turn inherits from. If none of them satisfy the method call, Parent::Class::Two is searched. If both packages (or ancestors thereof) define the method then only the one found in Parent::Class::One is called, and the version in Parent::Class::Two is never seen. The manner of searching is therefore crucially affected by the order in which packages are named in @ISA. The following statement is at first sight the same as the one above, but in actuality differs in the order in which packages are searched, leading to potentially different results: our @ISA = qw(Parent::Class::Two Parent::Class::One);
Writing Classes for Multiple Inheritance Inheriting more than one class into our own is easy; we just add the relevant package names to our @ISA array. Writing a class so that it is suitable for multiple inheritance takes a little more thought. By definition, we can only return one object from a constructor, so having multiple constructors in different parent classes gives the subclass a dilemma; it cannot call a constructor in each parent class, since that will yield it several objects of which it can only return one. It also cannot just call one parent constructor, since that will prevent the other parent classes from doing their part in the construction of the inherited object. The same problem, but to a lesser degree, also affects other methods; do we call all the parent methods that apply, or just one? And in that case, how do we choose which one? In the case of constructors, the best way to implement the class so that it can easily be a co-parent with other classes is to separate the construction of the object from its initialization. We can then provide a separate method to initialize the object if another constructor has already created it. For example, here is the constructor for the Game::Card class we displayed at the start of the chapter: sub new { my ($class, $name, $suit) = @_; $class = (ref $class) || $class; my $self = bless {}, $class; $self->{'name'} = $name; $self->{'suit'} = $suit; return $self; }
765
Chapter 19
This constructor is fine for single inheritance, but it's no good for multiple inheritance, since the only way a subclass can initialize the name and suit properties is to create an object containing them. If it has already created an object, say with the Serial module, it cannot easily combine the two together. (Well, in fact, it can because both modules produce hash-based objects and combining hashes is easy. But that requires the subclass to mess with the underlying data of the object, which is bad for abstraction.) Instead, we split the constructor in two, like this: sub new { my $class = shift; $class = (ref $class) || $class; my $self = bless {}, $class; $self->initialize(@_); return $self; } sub initialize { my ($self, $name, $suit) = @_; $self->{'name'} = $name; $self->{'suit'} = $suit; }
With this done, we can now create a subclass that inherits from the Serial, Game::Card, and Debuggable object classes all at the same time, in order to create a serialized debuggable game card class: # Debuggable.pm package Game::Card::Serial::Debuggable; use strict; use Game::Card; use Serial; use Debuggable; our @ISA = qw(Serial Debuggable Game::Card); sub new { my $class = shift; $class = (ref $class) || $class; my $self = $class->Serial::new; $self->Debuggable::initialize(); $self->Game::Card::initialize(@_); return $self; }
We can test that this combined subclass works using the following script: #!/usr/bin/perl # dsgamecard.pl use warnings; use strict;
766
Object-oriented Perl
use Game::Card::Serial::Debuggable; my $card = new Game::Card::Serial::Debuggable('Ace', 'Spades'); print $card->fullname, " (", $card->serial, ") \n"; $card->debug(1); $card->debug(1, "A debug message on object #", $card->serial, "\n");
The output from this script should be: Ace of Spades (1) A debug message on object #1 Although a simplistic example, classes like this are actually genuinely useful, just for the fact that they combine the features of multiple parent classes. Both the Serial and Debuggable object classes can be added to any other object to add serial numbers and debugging support. This is one of the more common uses of multiple inheritance, and a great example of how object-oriented programming helps encourage code reuse. In this example we used the Serial module to create the object. In practice, it does not matter which of the three parents actually creates our object (unless of course one of them does not have a separate initializer method). Analyzing the pattern here, we can see that in fact we could call initializer methods on all three parent objects, Serial::initialize, Debuggable::initialize, and Game::Card::Initialize, and create the object ourselves: sub new { my $class = shift; $class = (ref $class) || $class; my $self = bless {}, $class; $self->Serial::initialize(); $self->Debuggable::initialize(); $self->Game::Card::initialize(@_); return $self; }
One problem with this scenario is how to deal with arguments if more than one parent initializer method needs arguments. In this case we did not need to worry about that because only the Game::Card class needs arguments to initialize it. Another is that, implicitly, all the object classes involved have to use a hash as their underlying data representation, since they all presume a hash. However, this is a problem for regular inheritance too, rather than being a specific issue for multiple inheritance. As a rule, alternative object implementations can be good for isolated classes, but they do not work well when inheritance enters the picture.
Drawbacks with Multiple Inheritance Multiple inheritance can get very messy very quickly, which is one of the reasons why many languages disapprove of it. The normal inheritance hierarchy is rooted at the top and spreads out and down, each subclass adding to and overriding the methods of its parent. Multiple inheritance stands this arrangement on its head, with each subclass inheriting methods from more than one parent, which in turn may inherit methods from more than one grandparent.
767
Chapter 19
This is problematic for two reasons. The first, and nastiest, is that we can end up with a cyclic dependency, where an ancestor inherits from one of its descendants. This is possible to do in single inheritance too, but it is easier to do it accidentally with multiple inheritance. Fortunately Perl can detect this and aborts if we try to create a cyclic dependency. The second is simply that objects with multiple inheritance can quickly become hard to understand and even harder to debug, because their behavior is no longer predictable, we do not know in advance which parent object will satisfy a given method call without studying all the parent methods and their grandparents and the contents of the @ISA arrays of all the objects concerned. A particularly nasty but all too common example of object inheritance that is hard to predict is where two parent classes inherit from the same grandparent, or in more general cases, where two ancestors inherit from the same common ancestor. This is programmatic incest, so to speak. We now end up with recombinant branches where a method in the common ancestor can potentially be called by either of two different routes from the original subclass, entirely dependent on the order of parent classes in the @ISA array. There are arguments for designing applications which use object hierarchies like this, but in general recombinant inheritance is a sign that the design of our classes is over complex and that there is probably a different organization of classes that would better suit our purposes. There is also a rather pragmatic reason why multiple inheritance can cause us problems: it prevents the effective use of prototypes. In many languages, methods can be selected not just by their name but also by the type and number of arguments that they accept. Perl does not support this model, so prototypes of inherited classes are ignored. (Having said this, the Class::Multimethods module on CPAN does provide an argument-list based mechanism for searching for methods in parent classes, if that is what we want to do.) The bottom line for designing classes with multiple inheritance is therefore to keep them as simple as possible, and avoid complex inverted trees of parent classes if at all possible.
A 'UNIVERSAL' Constuctor As we have already shown, all object classes implicitly inherit from the UNIVERSAL class and therefore can always make use of the can, isa, and VERSION methods that it provides. With a little thought, we can also add our own methods to UNIVERSAL to make them generic across all the methods in our application. This is not something to be done lightly, because it can cause unexpected problems by inadvertently satisfying a method call that was intended for a different parent class. Having said this, there are some methods that we can place into UNIVERSAL that are so common among objects that it is worth creating a generic method for them. For instance, we introduced the concept of initializer methods when we talked about multi-classing. The constructor we created as a result of this was totally generic; so we could place it in UNIVERSAL to give all our object classes a new constructor that calls an initializer, then just define an initializer rather than a constructor. To place anything into UNIVERSAL we only have to specify it as the package name. Here is an example object class that adds a generic new constructor to all objects: # Univeral/Constructor.pm package UNIVERSAL; use strict;
768
Object-oriented Perl
sub new { my $class = shift; $class = (ref $class) || $class; my $self = bless {}, $class; $self->initialize(@_); return $self; } sub initialize { } 1;
To use this constructor we now only have to use the module and provide an initializer method if we want to actually perform some initialization: package My::Class; use strict; use Universal::Constructor; sub initialize { my ($self, $value1, $value2) = @_; $self->{'attr1'} = $value1; $self->{'attr2'} = $value2; return $self; } 1;
Note that we don't need an @ISA definition for this class because it implicitly inherits from UNIVERSAL already. We could similarly, assuming we based all our objects on a hash, create universal accessors, universal mutators, or a universal accessor/mutator of the types we created before. If any object wants to have its own constructor it only has to override the new provided by our universal class, or just not use it at all. Whether or not putting methods into UNIVERSAL is a good idea or not is debatable. Certainly, overcomplex or task-specific methods are a bad idea, as is any kind of class data. If most of our object classes all use the same methods then it can be a workable approach, but in the end it is only one line less than using a different package name and placing it in an @ISA definition. It is also possible that we want to avoid having another class satisfy a method provided by the UNIVERSAL package. It would be strange, but not impossible, for a parent class to provide a can method (an object that defines an AUTOLOAD method might want to do this to return a True result for methods that it supports but has not yet defined). If we need to avoid calling it, we can explicitly add UNIVERSAL to the start of the @ISA array for our object class (or at the least in front of the class with the troublesome can method). For example: our @ISA = qw(UNIVERSAL Problem::Parent Other::Class);
769
Chapter 19
Is-a Versus Has-a So far we have only talked about one kind of relationship between objects, the is-a relationship. However, not all object relationships can be expressed in terms of is-a. It may be true that a tire is-a wheel, but it does not make sense to say that a deck of playing cards is-a playing card. The other form of relationship is has-a; a deck of cards has-a collection of cards in it. Perl does not provide support for the has-a relationship inside the language in the same way that the @ISA array supports is-a relationships. However, it does not need to. Since an object is just a scalar value, one object has another just by storing it as an attribute. Using objects to manage groups of other objects is a very effective technique; such objects are sometimes called containers or 'container objects'. Extending our Game::Card example, here is new class Game::Deck, that manages a list of Game::Card objects. This new class allows us to create a new deck of cards of our chosen suits and names, and then deal cards from it. We can also replace them on the top or bottom, peek at any card in the deck, and print out the whole deck: # Game/Deck.pm package Game::Deck; use strict; use Game::Card; ### Constructor sub new { my ($class, $suits, $names, $cardclass) = @_; my $self = bless {}, $class; if ($suits) { # create cards according to specified arguments # these allow us to specify a single suit or name $suits = \$suits unless ref $suits; $names = \$names unless ref $names; # record the names and suits for later $self->{'suits'} = $suits; $self->{'names'} = $names; # generate a new set cards my @cards; foreach my $suit (@$suits) { foreach my $name (@$names) { my $card = new Game::Card($name, $suit); bless $card, $cardclass if defined $cardclass; push @cards, $card; } } # add generated cards to deck $self->{'cards'} = \@cards; } else { # initialize an empty deck $self->{'cards'} = []; } return $self; }
770
Object-oriented Perl
### Cards, Suits and Names # return one or more cards from deck by position sub card { my ($self, @range) = @_; return @{$self->{'cards'}} [@range]; } sub suits { return @{shift->{'suits'}}; } sub names { return @{shift->{'names'}}; } ### Deal and Replace # shuffle cards randomly sub shuffle { my $self = shift; # create a hash of card indices and random numbers my %order = map {$_ => rand()} (0..$#{$self->{'cards'}}); # rebuild the deck using indices sorted by random number my @newdeck; foreach (sort {$order{$a} <=> $order{$b}} keys %order) { push @newdeck, $self->{'cards'}[$_]; } # replace the old order with the new one $self->{'cards'} = \@newdeck; } # deal cards from the top of the deck sub deal_from_top { my ($self, $qty) = @_; return splice @{$self->{'cards'}}, 0, $qty; } # deal cards from the bottom of the deck sub deal_from_bottom { my ($self, $qty) = @_; return reverse splice @{$self->{'cards'}}, -$qty; } # replace cards on the top of the deck sub replace_on_top { my ($self, @cards) = @_; unshift @{$self->{'cards'}}, @cards; } # replace cards on the bottom of the deck sub replace_on_bottom { my ($self, @cards) = @_;
771
Chapter 19
push @{$self->{'cards'}}, reverse @cards; } ### Nomenclature # return string for specified cards sub fullnames { my ($self, @cards) = @_; my $text; foreach my $card (@cards) { $text .= $card->fullname. "\n"; } return $text; } # return string of whole deck ('fullname' for Deck class) sub fulldeck { my $self = shift; return $self->fullnames (@{$self->{'cards'}}); } # print out the whole deck sub print { my ($self, @range) = @_; if (@range) { print $self->fullnames (@{$self->{'cards'}}[@range]); } else { print $self->fulldeck; } } 1;
To use this class we first call the constructor with a list of suits and card names. We can then manipulate the deck according to our whims. Here is a short script that puts the class through its paces. Note how simple the construction of a standard 52-card deck is: #!/usr/bin/perl # gamedeck.pl use warnings; use strict; use Game::Deck; # create a standard deck of playing cards my $deck = new Game::Deck( ['Spades', 'Hearts', 'Diamonds', 'Clubs'], ['Ace', 2..10, 'Jack', 'Queen', 'King'], ); # spread it out, shuffle it, and spread it out again print "The unshuffled deck looks like this: \n"; $deck->print; $deck->shuffle; print "After shuffling it looks like this: \n"; $deck->print;
772
Object-oriented Perl
# peek at, deal, and replace a card print "Now for some card cutting... \n"; print "\tTop card is ", $deck->card(0)->fullname, "\n"; my $card = $deck->deal_from_top(1); print "\tDealt ", $card->fullname, "\n"; print "\tTop card is now ", $deck->card(0)->fullname, "\n"; $deck->replace_on_bottom($card); print "\tReplaced ", $card->fullname, " on bottom \n"; print "The deck now looks like this: \n"; $deck->print;
As we briefly mentioned earlier in the chapter, if the result of calling an object method is another object then we can chain method calls together and avoid storing the intermediate objects in temporary variables. We can see this happening in the above script in the expression: $deck->card(0)->fullname
Because the has-a relationship fits this model very well, the meaning of this expression is obvious just by inspection, give me the full name of the first card in the deck. The concept of the deck implemented by this object class extends readily to other kinds of deck, for example, a hand of cards is really just a small deck; it has an order with a top and bottom card, we can deal from it, and so on. So, to deal a hand of cards we just create a hand deck and then deal cards from the main deck, adding them to the hand. This extension to the test script shows this in action (we have dealt from the bottom just for variety):
AM FL Y
# gamedeck.pl (continued) # deal a hand of cards - a hand is just a small deck my $hand = new Game::Deck; my @cards = $deck->deal_from_bottom(7); $hand->replace_on_top(@cards); print "Dealt seven cards from bottom and added to hand: \n"; $hand->print;
TE
This example demonstrates how effective a has-a relationship can be when it fits the requirement for the design of a class.
Autoloading Methods
There is no intrinsic reason why we cannot use the autoloader to load methods on their first use, just as we can load subroutines on their first use in ordinary non-object-oriented packages; both the SelfLoader and AutoLoader modules, which we covered in Chapter 10 will work just as well with methods as they do with subroutines. We can also generate methods on the fly using the autoloader if we wish. Typically this is useful for accessor or mutator methods when we do not want to define them all at once (or indeed at all, unless we have to). Fortunately, Perl searches for methods in parent classes before it resorts to the autoloader; otherwise this technique would be impractical. As an example of how we can adapt a class to use autoloading, here is a generic accessor/mutator method and the first two of a long line of specific methods that use it, borrowed from an edition of the Game::Card object class.
773
Team-Fly®
Chapter 19
sub _property ($$;$) { my ($self, $attr, $value) = @_; if (defined $value) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; } return $self->{$attr}; } sub propone ($;$) {return shift->_property('propone', @_);} sub proptwo ($;$) {return shift->_property('proptwo', @_);} ...
Even though this is an efficient implementation, it becomes less than appealing if we have potentially very many attributes to store. Instead, we can define an AUTOLOAD method to do it for us. Here is a complete and inheritable object class that takes this approach, allowing us to set and get any attribute we please by calling an appropriately named method: # Autoloading.pm package Autoloading; use strict; sub new {return bless {}, shift} sub _property ($$;$) { my ($self, $attr, $value) = @_; if ($value) { my $oldv = $self->{$attr}{$value}; $self->{$attr} = $value; return $oldv; } return $self->{$attr}; } sub AUTOLOAD { our $AUTOLOAD; my $attr; $AUTOLOAD =~ /([^:]+)$/ and $attr = $1; # abort if this was a destructor call return if $attr eq 'DESTROY'; # otherwise, invent a method and call it eval "sub $attr {return shift->_property('$attr', \@_);}"; shift->$attr(@_); } 1;
774
Object-oriented Perl
To test out this class we can use the following script: #!/usr/bin/perl # autol.pl use warnings; use strict; use Autoloading; my $object = new Autoloading; $object->name('Styglian Enumerator'); $object->number('say 6'); print $object->name, " counts ", $object->number, "\n";
The output should be: Styglian Enumerator counts say 6 This class is, in essence, very similar to examples we covered using the discussion on autoloading in Chapter 10, only now we are in an object-oriented context. Because of that, we have to take an additional step and suppress the call to DESTROY that Perl will make when an object falls out of scope. Otherwise, if this class is inherited by a subclass that defines no DESTROY method of its own, we end up creating an attribute called DESTROY on the object (shortly before it becomes recycled). In this case it would not have done any harm, but it is wasteful. Another problem with this class is that it is not very selective; it allows any method to be defined, irrespective of whether the class or any subclass actually requires or intends to provide that method. To fix this, we need to add a list of allowed fields to the class and then check them in the autoloader. The following modified example does this, and also creates an object reference to the list of allowed fields (which is created as a hash for easy lookup) to allow the method to be properly inherited: package Autoloading2; use strict; use Carp; my %attrs=map {$_ => 1} qw(name number rank); sub new { my $class = shift; my $self = bless {},$class; $self->{'_attrs'} = \%attrs; return $self; } sub _property ($$;$) { my ($self,$attr,$value) = @_; if ($value) { my $oldv = $self->{$attr}; $self->{$attr} = $value; return $oldv; }
775
Chapter 19
return $self->{$attr}; } sub AUTOLOAD { our $AUTOLOAD; my $attr; $AUTOLOAD=~/([^:]+)$/ and $attr=$1; # abort if this was a destructor call return if $attr eq 'DESTROY'; # otherwise, invent a method and call it my $self=shift; if ($self->{'_attrs'}{$attr}) { eval "sub $attr {return shift->_property('$attr',\@_);}"; $self->$attr(@_); } else { my $class=(ref $self) || $self; croak "Undefined method $class\:\:$attr called"; } } sub add_attrs { my $self=shift; map {$self->{'_attrs'}{$_}=1} @_; } 1;
This module presets the attributes name, number, and rank as allowed. So any attempt to set a different attribute, for example size, on this class (or a subclass) will be met with an appropriate error, directed to the line and file of the offending method call by the Carp module. Here is a script that tests it out: #!/usr/bin/perl # auto2.pl use warnings; use strict; use Autoloading; my $object = new Autoloading; $object->name('Styglian Enumerator'); $object->number('say 6'); $object->size('little'); # ERROR print $object->name, " counts ", $object->number, "\n"; print "It's a ", $object->size, " one.\n";
When we run this program we get: Undefined method Autoloading::size called at auto2.pl line 12 Subclasses can create their own set of attributes simply by defining a new hash and assigning a reference to it to the special _attrs key. However, any subclass that wishes to also allow methods permitted by the parent needs to combine its own attribute list with that defined by the parent. This is done most easily by calling the constructor in the parent, and then appending new fields to the hash attached to the _attrs hash key.
776
Object-oriented Perl
Since this is a generic requirement for all subclasses, the first thing we should do is add a method to the Autoloading class that performs this function for subclasses. Here is a method that does what we want, taking a list of attributes and adding them to the hash. Again, we have it working through the object attribute to preserve inheritance: # added to 'Autoloading' class sub add_attrs { my $self = shift; map {$self->{'_attrs'}{$_} = 1} @_; }
Having added this method to the parent Autoloading class, we can now have subclasses use it to add their own attributes. Here is a subclass that uses it to add two additional attributes to those defined by the parent class: #!/usr/bin/perl # Autoloading::Subclass package Autoloading::Subclass; use warnings; use strict; use Autoloading; our @ISA = qw(Autoloading); my @attrs = qw(size location); sub new { my $class = shift; my $self = $class->SUPER::new(); $self->add_attrs(@attrs); return $self; } 1;
If we adjust our test script to use and create an object of this new class it now works as we intended: Styglian Enumerator counts say 6 It's a little one. Although in this example our main autoloading class defines its own attributes, by removing the hard-coded attributes defined in the Autoloading module, we could create a generic autoloading module suitable for adding to the @ISA array of any object wishing to make use of it.
Keeping Data Private In other (some might say proper) object-oriented languages, methods, class data, and object data may be declared private or public, and in some cases, with varying degrees of privacy (somewhere between absolute privacy and complete public access). Perl does not support any of these concepts directly, but we can provide for all of them with a little thought. We have already seen and covered private methods in the discussion on inheritance. Private class and object data are, however, another matter.
777
Chapter 19
Private Class Data Keeping class data private is very simple indeed, we just declare our global variables with my to give them a lexical scope of the file that contains them, and prevent them having entries in the symbol table. Since code in the same file can access them, no other object class can touch class data declared this way (not even by declaring the same package name in a different file). If we want to keep class data private to only a few methods within a file, we can do that too by adding some braces to define a new lexical scope, then defining a lexically scoped variable and methods to use it within them. To emphasize the fact that it is private to the class, we use underscores for the variable name. Here is a simple example: package My::Class; use strict; # create private data and methods to handle it { my $_class_private_data; sub _set_data {shift; $_class_private_data = shift;} sub _get_data {return $_class_private_data;} } # methods outside the scope of the private data must use the methods provided to access it
#
sub store_value { my ($self, $value) = @_; $self->_set_data($value); } sub retrieve_value { my $self = shift; return $self->_get_data(); } 1;
Strictly speaking this is not object-oriented programming at all, but a restriction of scope within a file using a block, a technique we have seen before. However, 'private class data' does not immediately translate into 'lexical variables and block scope' without a little thought.
Private Object Data Perl's general approach to privacy is that it should be respected, rather than enforced. While it is true that we can access the data underlying any blessed reference, the principles of object-oriented programming request that we do not, and if we do so, it is on our own head if things break later. However, if we really want to enforce the privacy of object data there are a few ways that we can accomplish it. One popular way is to implement the object not in terms of a reference to a basic data type like a hash or array, but as a reference to a subroutine, which alone has the ability to see the object's internal data.
778
Object-oriented Perl
This technique is known as a closure, and we introduced it in Chapter 7. Closures in object-oriented programming are no different except that each reference to a newly created closure is turned into an object. The following example demonstrates one way of implementing an object as a closure: # Closure.pm package Closure; use strict; use Carp; my @attrs = qw(size weight shape); sub new { my $class = shift; $class = (ref $class) || $class; my %attrs = map {$_ => 1} @attrs; my $object = sub { my ($attr, $value) = @_; unless (exists $attrs{$attr}) { croak "Attempt to ", (defined $value)?"set":"get", " invalid attribute '$attr'"; } if (defined $value) { my $oldv = $attrs{$attr}; $attrs{$attr} = $value; return $oldv; } return $attrs{$attr}; }; return bless $object, $class; } # generate attribute methods for each valid attribute foreach my $attr (@attrs) { eval "sub $attr {\$_[0]('$attr', \$_[1]);}"; } 1;
And here is a short script to show it in action: #!/usr/bin/perl # closure.pl use warnings; use strict; use Closure; my $object = new Closure; $object->size(10); $object->weight(1.4); $object->shape('pear');
779
Chapter 19
print "Size:", $object->size, " Weight:", $object->weight, " Shape:", $object->shape, "\n"; print "Also size:", &$object('size'), "\n";
When run, the output of this script should be: Size:10 Weight:1.4 Shape:pear Also size:10 This object class uses a constructor to create and return a blessed code reference to a closure. The closure subroutine itself is just what it seems to be; a subroutine, not a method. However, it is defined inside the scope of the new method, so the lexical variables $class and %attrs created at the start of new are visible to it. At the end of new, both lexical variables fall out of scope, but %attrs is used inside the anonymous subroutine whose reference is held by $object, so it is not recycled by the garbage collector. Instead, the variable becomes persistent, surviving as long as the object created from the closure reference does, and becomes the internal storage for the object's attributes. Each time new is called, a fresh %attrs is created and used by a new closure subroutine, so each object is distinct and independent. We generate the accessor/mutator methods using an eval, and also demonstrate that, with a little ingenuity, we can avoid naming the attributes in more than one place. This code is evaluated when the module is compiled, so we do not lose any performance as a result of doing it this way. Another efficiency measure we could use, in order to improve the above example, is to reduce the size of the closure subroutine, since Perl holds a compiled copy of this code for every object we create. With a little inspection, we can see that by passing a reference to the %attrs hash to an external subroutine we can reduce the size of the closure and still retain its persistent data. In fact, we can reduce it to a single line: # Closure2.pm package Closure; use strict; use Carp; my @attrs = qw(size weight shape); sub new { my $class = shift; $class = (ref $class) || $class; my %attrs = map {$_ => 1} @attrs; my $object = sub { return _property_sub(\%attrs, @_); }; return bless $object, $class; }
780
Object-oriented Perl
sub _property_sub { my ($href, $attr, $value) = @_; unless (exists $href->{$attr}) { croak "Attempt to ", (defined $value)?"set":"get", " invalid attribute '$attr'"; } if (defined $value) { my $oldv = $href->{$attr}; $href->{$attr} = $value; return $oldv; } return $href->{$attr}; } # generate attribute methods for each valid attribute foreach my $attr (@attrs) { eval "sub $attr {\$_[0]('$attr', \$_[1]);}"; } 1;
The _property_sub subroutine in this example bears more than a little resemblance to the _property method we used in earlier examples. Here though, the first argument is an ordinary hash reference, and not the blessed hash reference of an object. We can test this version of the Closure class with the same script as before, and with the same results. However, note the last line of that test script: print "Also size:", &$object('size'), "\n";
As this shows, we can call the closure directly rather than via one of its officially defined accessor methods. This does no harm, but if we wanted to put a stop to it we could do so by checking the identity of the caller. There are several ways we can do this, of which the fastest is probably to use the caller function to determine the package of the calling subroutine. If it is the same package as the closure's, then it must be an accessor, otherwise it is an illegal external access. Here is a modified version of the anonymous subroutine from the new constructor that performs this check: my $object = sub { croak "Attempt to bypass accessor '$_[0]'" if caller(1) ne __PACKAGE__; return _property_sub(\%attrs, @_); };
Now if we try to access the size attribute directly we get: Attempt to bypass accessor 'size' at closure.pl line 10 Now we have an object class that cannot be used in any way other than the way we intended.
781
Chapter 19
Destroying Objects Just as we can create objects with a constructor, we can destroy them with a destructor. For many objects this is not a step we need to take, since by eliminating all references to an object, we will cause Perl to perform garbage collection and reclaim the memory it is using. This is also known more generally as resource deallocation. However, for some objects this is not enough; if they contain things like shared memory segments, filehandles, or other resources that are not directly held within the object itself, then these also need to be freed and returned to the system. Fortunately, whenever Perl is given a blessed reference to destroy it checks for a destructor method in the object's class, and if one is defined, calls it. As an example, here is a destructor for a putative object that contains a local filehandle and a network socket: sub DESTROY { $self = shift; # close a filehandle... close $self->{'filehandle'}; # shutdown a network socket shutdown $self->{'socket'}; }
Of course, it is not too likely a real object will have both of these at once, but it demonstrates the point. Any system resource held by the object should be destroyed, freed up, or otherwise returned in the destructor. We can also use destructors for purely informational purposes. Here is a destructor for the Serial object class, which takes advantage of the fact that each object has a unique serial number to log a message to the effect that it is being destroyed: # destructor for 'Serial' class ... sub DESTROY { my $self = shift; print STDERR ref($self), " serial no ", $self->serial, " destroyed\n"; }
With this destructor added to the object class, we can now run a program like this to create a collection of serial objects: #!/usr/bin/perl # serialdestroy.pl use warnings; use strict; use Serial; my @serials; foreach (1..10) { push @serials, new Serial; } my $serial = new Serial(2001);
782
Object-oriented Perl
When the program ends Perl calls the destructor for each object before discarding the reference, resulting in the following output: Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial
serial serial serial serial serial serial serial serial serial serial serial
no no no no no no no no no no no
10 destroyed 9 destroyed 8 destroyed 7 destroyed 6 destroyed 5 destroyed 4 destroyed 3 destroyed 2 destroyed 1 destroyed 2001 destroyed
Other than the fact that it is called automatically, a destroy method is no different from any other object method. It can also, from a functional programming point of view, be considered the object-oriented equivalent of an END block.
Destructors and Inheritance
AM FL Y
We never normally need to call a destructor directly, since it is called automatically whenever the object passes out of scope and is garbage-collected. However, in the case of inherited classes we have a problem, since destructors are called the same way any other method is called (Perl searches for it and calls the first one it finds in the hierarchy of objects from child to parent). If a subclass does not define a destructor then the destructor in the parent class will be called and all is well. However, if the subclass does, it overrides the parent's destructor. In order to make sure that all aspects of an object are properly destroyed, we need to take steps to call parent destructors if we don't define a destructor for ourselves. For objects that inherit from only one parent we can do that calling the method SUPER::DESTROY:
TE
sub DESTROY { $self = shift;
# destroy our own resources (e.g. a subhash of values for this class): undef $self->{'our_own_hash_of_data'}; # call parent's destructor $self->SUPER::DESTROY; }
We should take care to destroy our own resources first. When we are writing constructors it is good practice to call the parent constructor before doing our own initialization. Similarly, when we destroy an object we should destroy our own resources first and then call the parent destructor (reverse order). The logic behind this is simple, we may need to use the parent class to destroy our resources, and destroying the parts of the object it relies on may prevent us from doing that. Alternatively, and more interestingly, we can re-bless the object into the class of its parent. This is analogous to peeling an onion where each subclass is a layer. Once the subclass has destroyed the object's resources that pertain to it, what is left is, at least for the purposes of destruction, an object of the parent class:
783
Team-Fly®
Chapter 19
sub DESTROY { my $self = shift; # destroy our own resources undef $self->{'our_own_hash_of_data'}; bless $self, $ISA[0]; }
The parent object's class name is defined by the element in the @ISA array – we only have one, so it must be element index zero. What actually happens here is that we catch Perl's garbage collection mechanism with our DESTROY method, remove the resources we are interested in, and then toss the object back to the garbage collector by allowing the reference to go out of scope a second time. But as we re-blessed the object, Perl will now look for the DESTROY method starting at the parent class instead. Although elegant, this scheme does have one major flaw; it fails if any subclass uses multiple inheritance. In this case, re-blessing the object can cause considerable confusion when the object is passed on to other destructors in different parents of the subclass. Both the examples in the following section would potentially fail if the first parent destructor re-blessed the object before the second sees it.
Destructors and Multiple Inheritance In objects classes that use multiple inheritance, we have to get more involved, since SUPER:: will only call one parent destructor: sub DESTROY { $self = shift; ...destroy our own resources... $self->First::Parent::Object::DESTROY; $self->Second::Parent::Object::DESTROY; }
This is a little ugly, however, since it involves writing the names of the parent packages explicitly. If we change the contents of the @ISA array then this code will break. It also depends on us knowing that the parent object or objects actually have a DESTROY method. A better way to do it is to iterate through the @ISA array and test for parent DESTROY methods: sub DESTROY { $self = shift; foreach (@ISA) { if ($destructor = $_->can('DESTROY')) { $self->$destructor; } } }
784
Object-oriented Perl
Overloading Operators So far we have looked at objects from the point of view of methods. We can get quite a long way like this, but sometimes it becomes desirable to be able to treat objects as if they were conventional data types. For instance, if we have an object class, which can in some sense be added or concatenated with another object of the same type, we would like to be able to say: $object1 += $object2;
Or, rather than presuming we have an add method for the purpose: $object1->add($object2);
Unfortunately, we cannot add objects or easily apply any other operator to them, because they are at heart just references, and obey the same rules that normal references do. Instead, what we want to do is redefine the + operator (and by inference, the += operator) so that it works correctly for particular object classes too.
Basic Overloading This redefinition of standard operators for objects is known as operator overloading, and Perl supports it through the overload module. Presuming that we have an add method for a given object class, we can assign it to the + operator so that when an object of our class is added to anything, it is called to perform the actual 'addition' instead of the default + operator supplied by Perl. Here is an example of an object class that handles addition and addition assignment operations: # addition.pm package My::Value::Class; use strict; use overload '+' => \&add, '+=' => \&addassign; sub new { my ($class, $value) = @_; my $self = bless {}, $class; return $self; } sub value { my ($self, $value) = @_; $self->{'value'} = $value if defined $value; return $value; } sub add { my ($operand1, $operand2) = @_; my $result=$operand1->new; $result->value($operand1->value + $operand2->value); return $result; }
785
Chapter 19
sub addassign { my ($operand1, $operand2) = @_; $operand1->value($operand1->value + $operand2 - value); } 1;
The add and addassign methods handle the + and += operators respectively. Unfortunately they only work when both operands are objects of type My::Value::Class. If one operand is not of the right type then we have to be more cautious. Perl automatically flips the operands around so that the first one passed is always of the object type for which the method has been called. Since binary operations take two operands we have broken our usual habit of calling the object $self: $operand1 makes more sense in this case. Consequently, we only have to worry about the type of the second operand. In this example case we can simply treat it as a numeric value and combine it with the value attribute of our object. To test the operand to see if it is of a type we can add as an object we use ref and isa: sub add { my ($operand1, $operand2) = @_; my $result = $operand1->new; if (ref $operand2 and $operand2->isa(ref $operand1)) { $result->value($operand1->value + $operand2->value); } else { $result->value($operand1->value + $operand2); } return $result; }
Determining the Operand Order and Operator Name For some operations the order of the operands can be important, a good example being subtraction, and another being concatenation. In order to deal with this, operator methods are called with a third Boolean flag that indicates whether the operands are given in reverse order, or to put it another way, if the object for which this method is implemented came second rather than first. If set, this flag also tells us by implication that the first operand was not of the correct object class (and possibly not even an object). As an example, here is a concatenation method that uses this flag. It also checks the type of the other operand passed and returns a different type of result depending on the order; if a plain string was passed as the first argument, a plain string is returned, otherwise a new object is returned. To keep the example simple, we have presumed an object whose textual representation is in the attribute name: sub concatenate { my ($operand1, $operand2, $reversed) = @_; if ($reversed) { # if string came first, return a string return $operand2.$operand1->{'name'};
786
Object-oriented Perl
} else { # if object came first, return an object my $result = $operand1->new(); $result->{'name'} = $operand1->{'name'}.$operand2; return $result; } }
Now all we have to do is overload the concatenation operator with this method: use overload '.' => \&concatenate;
Occasionally we might also want to know the name of the operator that triggered the method. This is actually passed in a fourth argument to all operator methods, though we do not usually need it. The obvious and probably most common application is for errors (another is for the nomethod operator method discussed later). For example, to reject operands unless they both belong to the right class (or subclass) we do this: sub an_operator { my ($op1, $op2, $rev, $name)=@_; unless (ref $op2 and $op2->isa(__PACKAGE__)) { croak "Cannot use '$name' on non-", __PACKAGE__, " operands"; } }
Overloading Comparisons An interesting operator to overload is the <=> operator, for comparison. By providing a method for this operator we can allow sort to work with our objects. For example, dealing only with an object to object comparison: use overload '<=>' => \&compare; sub compare { ($operand1, $operand2) = @_; return $operand1->{'value'} <=> $operand2->{'value'}; }
With this method in place, we can now say (at least for objects of this class): @result = sort @objects;
and get a meaningful result. This is a more useful operator to overload than it might seem even from the above, because the overload module can deduce the correct results for all the other numeric comparison operations based on it. Likewise, if we define an operator method for cmp, all the string comparison operations can be deduced for it. This deduction is called autogeneration by the overload module, and we cover it in more detail in a moment.
787
Chapter 19
Overloading Conversion Operations The numeric and string conversion of objects are two operators, which we may overload, even though we do not normally think of them as operators. However, objects, being references at heart, do not have useful numeric or string values by default; by overloading these operations, we can create objects that can take part in numeric calculations or be printed out.
Overloading String Conversion String conversion happens whenever a Perl value is used in a string context, such as concatenation or as an argument to a print statement. Since objects ordinarily render themselves as a class name plus a memory address, which is generally less than meaningful, providing a string representation for them can be an extremely useful thing to do. String conversion is not an operator in the normal sense, so the string conversion operator is represented as a pair of double quotes. Here is how we can assign an operator method for string conversion: use overload '""' => \&render_to_string;
We now only have to create an operator method that produces a more sensible representation of the object. For example, this method returns the object name, as defined by a name attribute, enclosed within double quotes: sub render_to_string { $self = shift; return '"'.$self->{'name'}.'"'; }
Alternatively, this more debugging-oriented renderer dumps out the contents of the hash being used to implement the object (in the case of other representations we would need to use different methods, of course): sub render_to_string { $self = shift; $out = "$self:\n"; map { $out .= "\t$_ => $self->{$_} \n" } keys %{$self}; return $out; }
Finally, here an object class that uses a converter to translate date values into different formats, depending on what format we require: # DateString.pm package DateString; use strict; # construct date object and values sub new { my ($class, $time, $format) = @_; $class = ref $class || $class;
788
Object-oriented Perl
$time = time() unless $time; my ($d, $m, $y) = (localtime($time))[3, 4, 5]; my $self = bless { day => $d, month => $m+1, # months 0-11 to 1-12 year => $y+1900, # years since 1900 to year format => $format || 'Universal' }, $class; return $self; } # only the format can be changed sub format { my ($self, $format) = @_; if ($format) { $self->{'format'} = $format; } return $self->{'format'}; } # the string conversion for this class sub date_to_string { my $self = shift; my $format = shift || $self->{'format'}; my $string; SWITCH: foreach ($self->{'format'}) { /^US/ and do { $string = sprintf "%02d/%02d/%4d", $self->{'month'}, $self->{'day'}, $self->{'year'}; last; }; /^GB/ and do { $string = sprintf "%02d/%02d/%4d", $self->{'day'}, $self->{'month'}, $self->{'year'}; last; }; # universal format $string = sprintf "%4d/%02d/%02d", $self->{'year'}, $self->{'month'}, $self->{'day'}; } $string .= " ($self->{format})"; } # overload the operator to use convertor use overload '""' => \&date_to_string; 1;
789
Chapter 19
To show how this class operates, here is a short program that prints out the current date in English, US, and Universal format: #!/usr/bin/perl # datestring.pl use warnings; use strict; use DateString; my $date = new DateString(time); print "$date \n"; $date->format('GB'); print "$date \n"; $date->format('US'); print "$date \n";
Here is what the script produces, assuming we run it on February 14th 2001: 2001/02/14 (Universal) 14/02/2001 (GB) 02/14/2001 (US) We wrote these conversion methods in the style of conventional object methods, using shift and $self. This does not mean they do not receive four arguments like other overloaded operator methods, simply that we do not need or care about them. There is only one operand.
Overloading Numeric Conversion Just as we can overload string conversion, so we can overload numeric conversion. Numeric conversion takes place whenever a data value is used in a numeric context (such as addition or multiplication), or as an argument to a function that takes a numeric argument. There are two numeric conversion operators, both of which are specially named since, like string conversion, they do not correspond to any actual operator. The standard numeric conversion is called 0+, since adding anything to zero is a common trick for forcing a value into a numeric context. Here's how we can assign an operator method for numeric conversions: "0+" => \&render_to_number; sub render_to_number { $self = shift; return 0+ $self->{'value'}; }
Similarly, here is a numeric converter for the date class we converted into a string earlier. The numeric value of a date is not obvious, so we will implement it in terms of a 32-bit integer with two bytes for the year and one byte each for the month and day: sub date_as_version { $self = shift; $year_h = $self->year / 256; $year_l = $self->year % 256; return pack('C4', $year_h, $year_l, $self->month, $self->day); }
790
Object-oriented Perl
This conversion is not useful for display, but it will work just fine for numeric comparisons, so we can compare date objects using traditional operators like < and == rather than by overloading these operators to have a special meaning for dates: die "Date is in the future" if $day > $today;
# date objects
The second numeric conversion handles translation into a Boolean value, which takes place inside the conditions of while and until loops, and also in the ?: operator, unless we overloaded that too. References are always True, since they always have values, so to test objects meaningfully, we need to provide a Boolean conversion. The operator is called bool, and is used like this: "bool" => \&true_or_false;
The meaning of truth or falsehood in an object is of course very much up to the object. The return value should be either True or False to Perl; for example 1 for True and the empty string for False: sub true_or_false { $self = shift; return $self->{'value'}?1:''; }
Note that if we do not provide an overload method for Boolean conversion, the overload module will attempt to infer it from string conversion instead.
Falling Back to Unoverloaded Operations Earlier in the chapter we introduced the ErrorClass object class, which encapsulated errno values inside an object. While we could compare the object class as a string or convert it into a number, we had to call methods in the object class to do so. It would be much nicer if we could have this dealt with for us. Here is an extension to the ErrorClass object that does just that. For the sake of this example we will give it the filename ErrorClass/Overload.pm, though it actually adds overloaded methods to the base ErrorClass object class: # ErrorClass/Overload.pm package ErrorClass; use strict; use ErrorClass; use overload ( '""' => \&error_to_string, '0+' => \&error_to_number, fallback => 1, ); sub error_to_string { my $class = (ref $_[0]) || $_[0]; my $package = __PACKAGE__; $class =~ /$package\:\:(\w+)/ and return $1; } sub error_to_number { return shift->number; } 1;
791
Chapter 19
This extension package provides two new methods, error_to_string and error_to_number, which overload the string and numeric conversion operators. Because we do not want to actually change the names of the objects generated by the ErrorClass class, we add the methods directly to the ErrorClass package instead of subclassing as we would ordinarily do. As well as overloading the operators, we have set the fallback flag. This flag has a very useful function – it permits conversion methods to be called when no overloaded object method is available. Instead of complaining that there is no eq string comparison operation for ErrorClass objects, for example, Perl will 'fall back' to the ordinary string comparison, converting ErrorClass objects into strings in order to carry it out. Without this, we would be able to print out ErrorClass objects and convert them into integers with int, because these are direct uses of the conversion operators, but we would not be able to write a line like this: print "It's not there! \n" if $error eq 'ENOENT';
Without fallback, this operation will fail because the eq operator is not overloaded for ErrorClass objects. With fallback, the error will be converted into a string and then eq will compare it to the string ENOENT. In fact, this handles all the string comparison operators, including cmp. Similarly, the numeric conversion operation allows Perl to fall back to all the normal numeric comparison operators. Here is a modified version of the test script for the ErrorClass module that shows how we can use these new overloaded conversions to simplify our programming with ErrorClass objects: #!/usr/bin/perl # overload.pl use warnings; use strict; use ErrorClass::Overload; # generate an error unless (open STDIN, "no.such.file") { my $error = new ErrorClass; print print print print print print
"Error object ", ref $error, "\n"; "\thas number ", $error->number, "\n"; "\thas message '", $error->message, "'\n"; "Text represetation '", $error, "'\n"; "Numeric representation = ", int($error), "\n"; "It's not there! \n" if $error eq 'ENOENT';
}
Running this script should produce the output: Error object ErrorClass::ENOENT has number 2 has message 'No such file or directory' Text representation 'ENOENT' Numeric representation = 2 It's not there!
792
Object-oriented Perl
Overloading and Inheritance Operators' methods are automatically inherited by child classes, but the hard reference syntax we have used so far does not allow us to specify that an inherited method should be used; it requires us to specify a reference to the actual method (be it in our own class or a parent). However, we can have Perl search our parent classes for the implementation of an overloaded operator by supplying the name of the method rather than a reference to it. Here is how we would overload the <=> operator with an inherited method: use overload '<=>' => "compare";
The method name is essentially a symbolic reference, which is looked up using the -> operator internally. This may or may not be a good thing; it allows inheritance, but a hard reference has the advantage of producing a compile-time error if the method does not exist. Inheriting methods for operators rather than inheriting the overloaded operator assignments can also have a serious effect on performance if we are not careful. If autogeneration (below) is enabled, inheritance can cause Perl a lot of needless extra work as it tries to construct new methods from existing ones. Ordinarily, the search for methods is quick since we only need to see what references exist in the list of overloaded operators. But with inheritance, a search through the hierarchy for each method not implemented in the class is performed – for large inheritance trees that can cause a lot of work.
Autogenerated Operations
AM FL Y
When we started out with the overload module we provided methods to handle both the + and += operators. However, we actually only needed the + operation overloaded, because the overload module is smart enough that it can infer operations that have not been overloaded from those that have been.
TE
For instance, the current behavior for the operation += can be inferred from the + operator, since $a += $b is just shorthand for $a = $a + $b, and Perl knows how to do that for our object class. Similarly, if Perl knows how to do subtraction, it also knows how to do a unary minus (since that is just 0-$a) and -=. The ++ and -- operators can likewise be inferred from + and -; they are just $a = $a + 1 and $a = $a - 1. The overload module can even infer abs from '-'. That said, we could often supply more efficient versions of these operators, but autogeneration means that we do not have to. Unfortunately, just because an operator can be deduced does not mean that it should be. For instance, we may have a date class that allows us to compute differences between dates expressed as strings, as in: '1 Jan 2001' - '30 Dec 2000' = '2 days'
A binary subtraction makes sense in this context, but a unary minus does not. There is no such thing as a negative date. In order to distinguish it from the binary subtraction operator, the unary minus is called neg when used in the list supplied to the overload module. So, one way we can implement subtraction but forbid negation is: use overload '-' => \&subtract, neg => sub {croak "Can't negate that!"};
793
Team-Fly®
Chapter 19
This is awkward if we need to prevent several autogenerated operators, not least because we have to work out what they are to disable them. Instead, to prevent the overload module autogenerating operator methods we can specify the fallback flag. The default value is undefined, which allows autogeneration to take place, but dies if no matching operator method could be found or autogenerated. The fallback flag is intimately connected to autogeneration; we only saw it in its most basic use before. If set to 1, Perl reverts to the standard operation rather than fail with an error, which is very useful for object classes that define only conversion operations like the ErrorClass::Overload extension we showed earlier. If fallback is set to 0, however, autogeneration is disabled completely but errors are still enabled (unfortunately there is no way to disable both errors and autogeneration at the same time). A better way to disable negation is like this: use overload '-' => \&subtract, fallback => 0;
We also have the option to supply a default method (for when no other method will suit), by specifying an operator method for the special nomethod keyword. This operates a little like an AUTOLOAD method does for regular methods, and is always called if a method can neither be called nor autogenerated. The fourth argument comes in very handy here, as this nomethod operation method illustrates: sub no_operator_found { my $result; # deal with some operators here SWITCH: foreach ($_[3]) { /^ <=> / and do {$result = 0, last}; # always lexically equal /^cmp/ and do {$result = 0, last}; # always numerically equal # insert additional operations here # croak if the operator is not one we handle croak "Cannot $_[3] this"; } return $result; } use overload '-' => \&subtract, '+' => \&add, nomethod => \&no_operator_found;
We can also use this with the fallback flag, in which case autogeneration is disabled. Only explicit addition and subtraction (plus numeric and string comparisons through the nomethod operation) are enabled: use overload '-' => \&subtract, '+' => \&add, fallback => 0, nomethod => \&no_operator_found;
794
Object-oriented Perl
Overloadable Operations Not every operator provided by Perl can be overloaded. Conversely, some of the things we might normally consider functions can be overloaded. The table below summarizes the different categories of operators understood by the 'overload' module and the operators within each category that it handles: Operator Category
Operators
with_assign
+ - * / % ** << >> x .
assign
+= -= *= /= %= **= <<= >>= x= .=
str_comparison
< <= > >= == !=
3way_comparison
<=> cmp
num_comparison
lt le gt ge eq ne
binary
& | ^
unary
neg ! ~
mutators
++ --
func
atan2 cos sin exp abs log sqrt
conversion
bool "" 0+
For more details on these categories and the overload pragma, see perldoc overload.
Automating Object Class Development There are several helper modules available for Perl, which automate much of the process of creating new object classes. Two of the more useful third-party modules on CPAN are Class::MethodMaker, which can automatically generate object properties and methods, and Class::Multimethods, which gives Perl the ability to select methods from multiple parent classes based on the type and number of arguments passed (a feature present in several other object-oriented languages but absent from Perl). Perl does come with the Class::Struct module, however. This provides much more limited but considerably simpler object class generation features than modules like Class::MethodMaker. Several object classes provided by Perl are based on it, including all the hostent, netent, protoent, and servent modules, as well as stat and several others. All of these modules make good working examples of how Class::Struct is used. The module provides one subroutine, struct. The name is taken from C's struct declaration after which the module is patterned. To use it we supply a list of attributes and their data types, indicated by the appropriate Perl prefix, $ for scalars, @ for arrays, and % for hashes. In return it defines a constructor (new) and a complete set of accessor/mutator methods for each attribute we request. For example, this is how we can create a constructor and six accessor/mutator methods for an address class in one statement: # Address.pm package Address; use strict; use Class::Struct;
795
Chapter 19
struct ( name => '$', address => '@', postcode => '$', city => '$', state => '$', country => '$', ); 1;
This object class creates objects with five scalar attributes, and one array attribute whose value is stored as an array reference. When we use this module struct is called and the class is fully defined. The constructor new, generated on-the-fly by struct accepts initialization of the object with named arguments, so we can create and then print out the fields of an object created by this class using a script like the following: #!/usr/bin/perl # address.pl use warnings; use strict; use Address; my $address = new Address( name => 'Me Myself', address => ['My House', '123 My Street'], city => 'My Town', ); print $address->name," lives at: \n", "\t", join("\n\t", @{$address->address}), "\n", "in the city of ", $address->city, "\n";
This produces the output: Me Myself lives at: My House 123 My Street in the city of My Town Getting and setting attributes on these automatically generated objects is fairly self-evident. Each accessor/mutator follows the pattern of accessor/mutator methods we have seen before. Scalars are retrieved by passing no arguments, and set by passing one: $name = $address->name; $address->name($name);
# get an attribute # set an attribute
Array and hash attributes return a reference to the whole array or hash if no arguments are passed, otherwise they return the value specified by the passed index or hash key if one argument is passed, and set the value specified if two arguments are passed: $arrayref = $address->address; $first_line = $address->address(0); $address->address(0, $firstline);
796
Object-oriented Perl
$hashref = $address->hashattr; $value = $address->hashattr('key'); $address->hashattr('key', $value);
The underlying object representation of this class is an array. If we want to be able to inherit the class reliably we are better off using a hash, which we can do by passing struct (the name of the class we want to create methods for), followed by the attributes we want to handle in a hash or array reference. If we pass an array reference, the class is based on an array. If we pass a hash reference, it is based on a hash like so: # AddressHash.pm package Address; use strict; use Class::Struct; struct Address => { name => '$', address => '@', postcode => '$', city => '$', state => '$', country => '$', }; 1;
We can create and then print out the fields of the object created by this version of the class using the same script, producing the same output as before. In fact, we do not even need the package Address at the top of this module since struct creates subroutines in the package passed to it as the first argument anyway. We can also use objects as attributes by the simple expedient of naming an object class instead of a Perl data type prefix. Here's a modified Address class that replaces the address array attribute with a subclass called Address::Lines with an explicit house and street attribute: # AddressNest.pm package Address; use strict; use Class::Struct; struct 'Address::Lines' => [ house => '$', street => '$', ]; struct ( name => '$', address => 'Address::Lines', postcode => '$', city => '$', state => '$', country => '$', ); 1;
797
Chapter 19
Since we can chain together object calls if the result of one method is another object, we can modify the test script to be: #!/usr/bin/perl # addressnest.pl use warnings; use strict; use AddressNest; my $address = new Address( name => 'Me Myself', city => 'My Town', ); $address->address->house('My House'); $address->address->street('123 My Street'); print $address->name, " lives at: \n", "\t", $address->address->house, "\n", "\t", $address->address->street, "\n", "in the city of ", $address->city, "\n";
We can, if we choose, optionally prefix the attribute type with an asterisk, to make *$, *@, *%, or *Object::Class. When present, this causes the accessor for the attribute to return references rather than values. For example, if we used name => '*$' in the arguments to struct, we could do this: $scalar_ref = $address->name; $$scalar_ref = "My New Name"; $new_scalar = "A Different Name Again"; $address->name(\$newscalar);
The same referential treatment is given to array, hash, and object attributes, for example with address => '*@': $first_element = $address->address(0); $$first_element = "My New House";
If we want to provide more precise control over attributes we can do so by redefining the accessor/mutator methods with explicit subroutines. Be aware, however, that Perl will warn about redefined subroutines if warnings are enabled. If we need to override a lot of methods, however, the benefits of using Class::Struct begin to weaken and we are probably better off implementing the object class from scratch. As a final example, if we want to create a class that contains attributes of one type only (most probably scalars), we can create very short class modules, as this rather terse but still fully functional example illustrates: # AddressMap.pm use strict; use Class::Struct; struct Address => {map {$_ => '$'} qw( name house street city state country postcode) }; 1;
798
Object-oriented Perl
For completeness, here is the test script for this last class; note that it is very similar to our first example, which was mostly made up of scalars. Again it produces the same output as all the others: #!/usr/bin/perl # addressmap.pl use warnings; use strict; use AddressMap; my $address = new Address( name => 'Me Myself', house => 'My House', street => '123 My Street', city => 'My Town', ); print $address->name," lives at:\n", "\t", $address->house, "\n", "\t", $address->street, "\n", "in the city of ", $address->city, "\n";
Ties and Tied Objects One of the more intriguing parts of Perl's support for object-oriented programming is the tied object. Tied objects are somewhat at odds with the normal applications of object orientation. In most objectoriented tasks we take a functional, non object-oriented problem and rephrase it in object-oriented terms. Ties go the other way, taking an object class and hiding it behind a simple non object-oriented variable. Tied objects allow us to replace the functionality of a standard data type with an object class that secretly handles the actual manipulations, so that access to the variables is automatically and transparently converted into method calls on the underlying object. The object can then deal with the operation as it sees fit. Perl allows us to tie any standard data type, including scalars, arrays, hashes, and filehandles. In each case the operations that the underlying object needs to support vary. There are many possible uses for tie, from hashes that only allow certain keys to be stored (as used by the fields module), to the DBM family of modules, which use tied hash variables to represent DBM databases. The Perl standard library provides several tied classes, including the fields module and the DBM family of modules, and there are many, many tied class implementations available from CPAN. Before implementing a tied class it is worth checking to see if it has already been done. Tied object classes are a powerful way to use objects to provide enhanced features in places that are otherwise hard to reach or difficult to implement. The tie makes the object appear as a conventional variable. This allows us to replace ordinary variables with 'smart' ones, so any Perl code that works with the original data type will also work with the tied object, oblivious to the fact that we have replaced it with something else of our own design.
799
Chapter 19
Using Tied Objects Variables are bound to an underlying object with the tie function. The first argument to tie is the variable to be bound, and the second is the name of the object class, which will provide the functionality of the tied variable. Further arguments are passed to the constructor that is, in turn, used to create the object used to implement this particular variable, which is, of course, determined by the type of variable being tied. For example, to tie a scalar variable to a package called My::Scalar::Tie we could put: $scalar; tie $scalar, 'My::Scalar::Tie', 'initial value';
The initial value argument is passed to the constructor; in this case it is expected to be used to set up the scalar in some way, but that is up to the constructor. Similarly, to tie a hash variable, this time to a DBM module: tie %dbm, 'GDBM_File', $filename, $flags, $perm;
The GDBM_File module implements access to DBM files created with the gdbm libraries. It needs to know the name of the file to open, plus, optionally, open flags and permissions (in exactly the same way that the sysopen function does). Given this, the module creates a new database object that contains an open filehandle for the actual database handle as class data. If we tie a second DBM database to a second variable, a second object instance is created, containing a filehandle for the second database. Of course, we only see the hash variable; all the work of reading and writing the database is handled for us by the DBM module (in this case GDBM_File). The return value from tie is an object on success, or undef on failure: $object = tie $scalar, 'My::Scalar::Tie', 'initial value';
We can store this object for later use, or alternatively (and more conveniently) we can get it from the tied variable using tied: $object = tied $scalar;
Handling Errors from 'tie' Many, though not all, tied object modules produce a fatal error if they cannot successfully carry out the tie (the database file does not exist, or we did not have permission to open it, for example). To catch this we therefore need to use eval. For example, this subroutine returns a reference to a tied hash on success, or undef on failure: sub open_dbm { my $filename = shift; my %dbm; eval {tie %dbm, 'GDBM_File', $filename}; if ($@) { print STDERR "Dang! Couldn't open $filename: $@"; return undef; } return \%dbm; }
800
Object-oriented Perl
This is not a property of tie, but rather a general programming point for handling any object constructor that can emit a fatal error, but it's worth mentioning here because it is easy to overlook.
Accessing Nested References The special properties of a tied variable apply only to that variable; access to the internal values of a tied hash or array is triggered by our use of the tied variable itself. If we extract a reference from a tied hash or array, the returned value is likely to be a simple untied reference, and although attempts to manipulate it will work, we will really be handling a local copy of the data that the tied hash represents and not the data itself. The upshot of this is that we must always access elements through the tied variable, and not via a reference. In other words, the following is fine: $tied{'key'}{'subkey'}{'subsubkey1'} = "value1"; $tied{'key'}{'subkey'}{'subsubkey2'} = "value2"; $tied{'key'}{'subkey'}{'subsubkey3'} = "value3";
But this is probably not: $subhash = $tied{'key'}{'subkey'}; $subhash->{'subsubkey1'} = "value1"; $subhash->{'subsubkey2'} = "value2"; $subhash->{'subsubkey3'} = "value3";
Although this appears to work, the extraction of the hash reference $subhash actually caused the tied variable to generate a local copy of the data that the hash represents. Our assignments to it therefore update the local copy, but do not have any effect whatsoever on the actual data that the tied hash controls access to. Having said that we cannot use internal references, this is not absolutely the case. It is perfectly possible for the tied object class to return a newly tied hash that accesses the nested data we requested, in which case we can use the subreference with impunity, just like the main tied variable. However, this is a lot of effort to go to, and most tied object classes do not go to the lengths necessary to implement it.
Testing a Variable for 'tied'ness The tied function returns the underlying object used to implement the features of the tied variable, or undef otherwise. The most frequent use for this is to test whether a tie succeeded. For example, here is another way to trap and test for a failed tie: eval {tie %hash, My::Tied::Hash} handle_error($@) unless tied %hash;
We can also call methods on the underlying object class through tied. For example, wrapping tied and an object method call into one statement: (tied %hash)->object_method(@args);
801
Chapter 19
Just because we can call an underlying method does not mean we should, though. If the tied object class documentation provides additional support methods (and most tied classes of any complexity do), calling these is fine. But calling the methods that implement the tie functionality itself is a bad idea. The whole point of the tie is to abstract these methods behind ordinary accesses to the variable; sidestepping this is therefore breaking the interface design.
'Untie'ing Objects Tied objects consume system resources just like any other object, or indeed variable. When we are finished with a tied object we should dispose of it so that the object's destructor method, if any, can be called. In the case of the DBM modules this flushes any remaining output and closes the database filehandle. In many instances, the only reference to the underlying object is the tie to the variable, so by undoing the tie between the variable and the object we cause Perl to invoke the garbage collector on the now unreferenced object. Appropriately enough, the function to do this is untie: untie %hash;
untie always succeeds, unless the destructor emits a fatal error, in which case we need to trap it with eval. The fact it never returns a 'failed' result makes writing a test very easy: handle_error($@) unless eval {untie %hash};
If we pass untie on a variable that is not tied, nothing happens, but untie still succeeds. However, if we want to check explicitly we can do so, for example: untie %hash if tied %hash;
Writing Tied Objects Tied object classes work by providing methods with predefined names for each operation that Perl requires for the data type of the variable being tied. For instance, for tied scalars, an object class needs to define a constructor method called TIESCALAR, plus the following additional methods: Method
Description
FETCH
An accessor method that returns a scalar value.
STORE
A mutator method that stores a passed scalar value.
DESTROY
Optionally, a destructor for the object.
Scalars are the simplest class to tie because we can essentially only do two things to a scalar – read it, which is handled by the FETCH method, and write it, which is handled by the STORE method. We can also tie arrays, hashes, and filehandles if we define the appropriate methods. Some methods are always mandatory; we always need a constructor for the class, with the correct name for the data type being tied. We also always need a FETCH method, to read from the object. Others are required only in certain circumstances. For instance, we do not need a STORE method if we want to create a read-only variable, but it is better to define one and place a croak in it, rather than have Perl stumble and emit a less helpful warning when an attempt to write to the object is made.
802
Object-oriented Perl
Standard Tie Object Classes Creating the supporting methods for every required operation can be tedious, particularly for arrays, which require thirteen methods in a fully implemented object class. Fortunately, Perl removes a lot of the grunt-work of creating tied object classes, by providing template object classes in the standard library that contain default methods which we can inherit from, or override with our own implementations. The default methods mostly just croak when they are not implemented, but each module also provides a minimal, but functional, Std object class that implements each data type as an object of the same type blessed and returned as a reference. Data Type
Tie::Scalar
Tied scalars.
Tie::Array
Tied arrays.
Tie::Hash
Tied hashes.
Tie::Handle
Tied handles (file, directory, socket, ...).
Module
Data Type
Tie::StdScalar
Minimal tied scalar class.
Tie::StdArray
Minimal tied array class.
Tie::StdHash
Minimal tied hash class.
Tie::StdHandle
Minimal tied handle class.
AM FL Y
Module
In addition, we can make use of two enhanced tied hash classes: Description
Tie::RefHash
Tied hashes that allow references to be used as the hash keys, overcoming the usual restriction of hash keys to static string values.
Tie::SubstrHash
TE
Module
Tied hashes that permit only a fixed maximum length for both keys and values, and, in addition to this, limit the total number of entries allowed in the hash.
Tied Object Methods The methods we need to define for each data type vary, but loosely fall into the categories of constructor/destructor, accessor/mutator, function implementations, and specialized operations that apply to specific data types. All tied object classes need a constructor, and may implement a destructor. All types except filehandles may also (and usually should) define an accessor and mutator. Arrays, hashes, and filehandles also need to define methods for Perl's built-in functions that operate on those data types, for instance pop, shift, and splice for arrays, delete, and exists for hashes, and print, readline, and close for filehandles.
803
Team-Fly®
Chapter 19
We are free to define our own methods in addition to the ones required by the tie, and can call them via the underlying object if we desire. We can also create additional class methods; the only one required (for all classes) is the constructor. The standard library modules define default versions of all these methods, but we may choose to implement a tied object class without them, or even want to override them with our own methods. The following summaries list the methods required by each data type, along with a typical use of the tied variable that will trigger the method, and some brief explanations:
Methods and Uses for Scalars Here are the creator/destructor methods for scalars: Method
Use
TIESCALAR class, list
tie $scalar, Class::Name, @args;
DESTROY self
undef $scalar;
The accessor/mutator methods for scalars are as follows: Method
Use
FETCH self
$value = $scalar;
STORE self, value
$scalar = $value;
Scalars are the simplest tied object class to implement. The constructor may take any arguments it likes, and the only other method that needs to handle an argument is STORE. Note that the constructor is the only method we call directly, so we cannot pass extra arguments to the other methods even if we wanted to, all information required must be conveyed by the object.
Methods and Uses for Arrays Arrays require the same methods as scalars, but take an additional index argument for both FETCH and STORE. These are the creator/destructor methods for arrays: Method
Use
TIEARRAY class, list
tie @array, Class::Name, @list;
DESTROY self
undef @array;
The accessor/mutator methods for arrays are:
804
Method
Use
FETCH self, index
$value = $array[$index];
STORE self, index, value
$array[$index] = $value;
Object-oriented Perl
We also need implementations for the push, pop, shift, unshift, and splice functions if we want to be able to use these functions on our tied arrays. The function implementation methods are as follows: Method
Use
PUSH self, list
push @array, @list;
POP self
$value = pop @array;
SHIFT self
$value = shift @array;
UNSHIFT self, list
unshift @array, @list;
SPLICE self, offset, length, list
splice @array, $offset, $length, @list;
Finally, we need methods to handle the extension or truncation of the array. Unique to arrays are the EXTEND, FETCHSIZE, and STORESIZE methods, which implement implicit and explicit alterations the extent of the array; real arrays do this through $#array, as covered in Chapter 5. The extension/truncation methods for arrays are: Method
Use
CLEAR self
@array = ();
EXTEND self, size
$array[$size] = $value;
FETCHSIZE self
$size = $#array;
STORESIZE self, size
$#array = $size;
Methods and Uses for Hashes Hashes are more complex than arrays, but ironically are easier to implement. These are the creator/destructor methods for hashes: Method
Use
TIEHASH class, list
tie %hash, Class::Name, @list;
DESTROY self
undef %hash;
Like arrays, the FETCH and STORE methods take an additional argument, this time of a key name. Here are the accessor/mutator methods for hashes: Method
Use
FETCH self, key
$value = $hash{$key};
STORE self, key, value
$hash{$key} = $value;
805
Chapter 19
The delete and exists functions are implemented through the DELETE and EXISTS methods. In addition, CLEAR is used for deleting all the elements from a hash. The defined function is implemented in terms of FETCH, so there is no DEFINED method. This is also why there is no EXISTS or DEFINED method for arrays or scalars, since both operations are equivalent to defined for anything except hashes. Here are the function implementation methods for hashes: Method
Use
CLEAR self
%hash = ();
DELETE self, key
$done = delete $hash{$key};
EXISTS self, key
$exists = exists $hash{$key};
The FIRSTKEY and NEXTKEY methods need a little more explanation. Both methods are needed for the each keyword, which retrieves each key-value pair from a hash in turn. To do this, the class must define some form of iterator that stores the current position, so that the next key produces a meaningful result. This iterator should be reset when FIRSTKEY is called, and incremented when NEXTKEY is called. Both methods should return the appropriate key and value as a list. Finally, when there are no more key-value pairs to return, NEXTKEY should return undef. Here are the each iteration methods: Method
Use
FIRSTKEY self
($key, $value) = each %hash; # first time
NEXTKEY self, lastkey
($key, $value) = each %hash; # second and subsequent times
Methods and uses for Filehandles Filehandles are unique among tied variables because they define neither an accessor nor mutator method. Instead, the constructor makes available some kind of resource, typically with open, pipe, socket, or some other method that generates a real filehandle. The creator/destructor methods for filehandles are the following: Method
Use
TIEHANDLE class, list
tie $fh, Class::Name, @args;
DESTROY self
undef $fh;
Whatever the creator method does, the CLOSE method needs to be defined to undo whatever it creates. The only other methods required are for the standard filehandle functions: print, printf, and write for output and read, readline, and getc for input. If we are implementing a read-only or write-only filehandle we need only define the appropriate methods, of course.
806
Object-oriented Perl
Function Implementation method
Use
READ self, scalar, length, offset
read $fh, $in, $size, $from
READLINE self
$line = <$fh>
GETC self
$char = getc $fh
WRITE self, scalar, length, offset
write $fh, $out, $size, $from
PRINT self, list
print $fh @args
PRINTF self, format, list
printf $fh $format @values
CLOSE self
close $fh
DESTROY self
undef $fh
An Example Tied Hash Class The tied hash class below demonstrates a basic use of the tied functionality provided by Perl by creating hashes that can have read, write, or delete access enabled or disabled. It defines methods for all the possible access types for a hash, and so does not need to use Tie::Hash. Despite this, it is very simple to understand. The design follows a common theme for a lot of tied classes, where the actual data is stored as an element of a hash that represents the object, with other elements holding flags or values that configure how the real data is handled. This is a general template that works well for all manner of 'smart' scalars, arrays, hashes, and filehandles. # Permission/Hash.pm package Permission::Hash; use strict; use Carp; sub TIEHASH { my ($class, %cfg) = @_; my $self = bless {}, shift; $self->{'value'} = (); foreach ('read', 'write', 'delete') { $self->{$_} = (defined $cfg{$_})?$cfg{$_}:1; } return $self; } sub FETCH { my ($self, $key) = @_; croak "Cannot read key '$key'" unless $self->{'read'}; return $self->{'value'}{$key}; }
807
Chapter 19
sub STORE { my ($self, $key, $value) = @_; croak "Cannot write key '$key'" unless $self->{'write'}; $self->{'value'}{$key} = $value; } sub EXISTS { my ($self, $key) = @_; croak "Cannot read key '$key'" unless $self->{'read'}; return exists $self->{'value'}{$key}; } sub CLEAR { my $self = shift; croak "Cannot delete hash" unless $self->{'delete'}; $self->{'value'} = (); } sub DELETE { my ($self, $key) = @_; croak "Cannot delete key '$key'" unless $self->{'delete'}; return delete $self->{'value'}{$key}; } sub FIRSTKEY { my $self = shift; my $dummy = keys %{$self->{'value'}}; return $self->NEXTKEY; }
#reset iterator
sub NEXTKEY { return each %{shift->{'value'}}; } 1;
Because we are creating a tied hash that controls access to a real hash, most of the methods are very simple. We are relaying the operation requested on the tied variable to the real variable inside. This class could be improved a lot, notably by adding proper accessor/mutator methods for the three flags. We could also add other permission types. A more interesting and complex example would be to set the flags on each key, rather than for the hash as a whole. Here is a short script that puts the Permission::Hash class through its paces: #!/usr/bin/perl # permhash.pl use warnings; use strict; use Permission::Hash; my %hash; tie %hash, 'Permission::Hash', read => 1, write => 1, delete => 0; $hash{'one'} = 1; $hash{'two'} = 2; $hash{'three'} =3 ; print "Try to delete a key... \n"; unless (eval {delete $hash{'three'}; 1} ) { print $@; print "Let's try again... \n"; (tied %hash)->{'delete'} = 1; delete $hash {'three'}; print "It worked! \n";
808
Object-oriented Perl
(tied %hash)->{'delete'} = 0; } print "Disable writing... \n"; (tied %hash)->{'write'} = 0; unless (eval {$hash{'four'} = 4; 1} ) { print $@; } (tied %hash)->{'write'} = 1; print "Disable reading... \n"; (tied %hash)->{'read'} = 0; unless (defined $hash{'one'}) { print $@; } (tied %hash)->{'read'} = 1;
When run this script should produce output resembling the following: Try to delete a key... Cannot delete key 'three' at permhash.pl line 12 Let's try again... It worked! Disable writing... Cannot write key 'four' at permhash.pl line 23 Disable reading... Cannot read key 'one' at permhash.pl line 30 Many other variants on this design can be easily implemented. In the case of hashes alone, we can easily create case-insensitive hashes (apply lc to all passed keys), accumulative hashes (make each value an array and append new values to the end in STORE), or restrict the number of keys (count the keys and check whether the key already exists in STORE, before creating a new one), or the names of the keys that can be assigned (pass a list of acceptable keys to the constructor, then check that any key passed to STORE is in that list).
An Example Class Using 'Tie::StdHash' The standard Tie modules provide inheritable methods for all the required actions needed by each type of tied variable, but for the most part, these simply produce more informative error messages for methods we do not implement ourselves. However, each package comes with a Std object class that implements a minimal, but functional, tied object of the same class. Here is the implementation of Tie::StdHash defined inside the Tie::Hash module: package Tie::StdHash; @ISA = qw(Tie::Hash); sub sub sub sub sub sub sub sub
TIEHASH STORE FETCH FIRSTKEY NEXTKEY EXISTS DELETE CLEAR
{bless {}, $_[0]} {$_[0]->{$_[1]} = $_[2]} {$_[0]->{$_[1]}} {my $a = scalar keys %{$_[0]}; each %{$_[0]}} {each %{$_[0]}} {exists $_[0]->{$_[1]}} {delete $_[0]->{$_[1]}} {%{$_[0]} = ()}
1;
809
Chapter 19
We can use this object class to implement our own tied hash classes, as long as we are willing to accept the implementation of the object as a directly tied hash. As an example, here is a hash class that will limit either the total number of hash keys allowed, or restrict keys to one of a specific list provided when the object is initialized: # Limit/Hash.pm package Limit::Hash; use strict; use Carp; use Tie::Hash; our @ISA = qw(Tie::StdHash); sub TIEHASH { my ($class, @keys) = @_; my $self = $class->SUPER::TIEHASH; croak "Must pass either limit or key list" if $#keys == -1; if ($#keys) { $self->{'_keys'} = {map {$_ => 1} @keys}; } else { croak ",", $keys[0], "' is not a limit" unless int($keys[0]); $self->{'_limit'} = $keys[0]+1; #add one for _limit } return $self; } sub FETCH { my ($self, $key) = @_; croak "Invalid key '$key'" if defined($self->{'_keys'}) and (!$self->{'_keys'}{$key} or $key =~ /^_/); return $self->SUPER::FETCH($key); } sub STORE { my ($self, $key, $value) = @_; croak "Invalid key '$key'" if defined($self->{'_keys'}) and (!$self->{'_keys'}{$key} or $key =~ /^_/); croak "Limit reached" if defined($self->{'_limit'}) and (!$self->{'_limit'} or $self->{'limit'} <= scalar(%{$self})); $self->SUPER::STORE($key, $value); } 1;
The constructor works by examining the passed arguments and either establishing a limit, or a list of valid keys depending on what it was passed (we assume that establishing a single key is not an unlikely requirement). If a limit was passed, we add one to it to allow for the _limit key itself, which also resides in the hash. If no arguments are passed at all, we complain. Otherwise, we set up two special keys in the hash to hold the configuration. The only other methods we override are FETCH and STORE. Each method checks that the key is valid, and prevents 'private' underscore prefixed keys from being set. If the key passes the check, we pass the request to the FETCH and STORE methods of Tie::StdHash, our parent.
810
Object-oriented Perl
Technically we do not need to pass the method request up to the parent object from our versions of TIEHASH, FETCH, and STORE since they are obvious implementations. However, it is good practice to use parent methods where possible, so we do it anyway. # limithash.pl #!/usr/bin/perl use warnings; use strict; use Limit::Hash; tie my %hash, 'Limit::Hash', 'this', 'that', 'other'; $hash{'this'} = 'this is ok'; $hash{'that'} = 'as is this'; print $hash{'this'}, "\n"; $hash{'invalid-key'} = 'but this croaks';
When run, this script should produce: this is ok Invalid key 'invalid-key' at limithash.pl line 13
Summary In this chapter, we were introduced to objects, and concepts relating to objects. We saw how to create objects, and use them by: ❑
Accessing properties
❑
Calling class methods
❑
Calling object methods
❑
Nesting method calls
We then learned how to determine what inherited characteristics an object possesses, for example, its ancestry, capabilities, and version. After this, we saw how to write object classes; specifically we looked at constructors and choosing an underlying data type. As well as this, we looked at class, object, and multiple-context methods. From here we discussed object and class data, which involved learning about accessors and mutators, along with inheriting and setting class data. Then we learned about class and object-level debugging, and implemented a multiplex debug strategy. A large section was devoted to inheritance and subclassing, and the topics covered included: ❑
Inheriting from a parent class
❑
Writing inheritable classes
❑
Extending and redefining objects
❑
Multiple inheritance
❑
The UNIVERSAL constructor
❑
Autoloading methods
811
Chapter 19
We also saw how to keep class and object data private, and discussed the ins-and-outs of destroying objects. We also saw that it is possible to overload conversion operators as well as normal ones, and that we can 'fall back' to unoverloaded operators if need be. The final topics involved ties and tied objects, and in this section we saw how to use tied objects, including: ❑
Handling errors from tie
❑
Accessing nested references
❑
Testing a variable for tiedness
❑
Untieing objects
On top of this, we learned, among other things, how to write tied objects, and finished the chapter with an example of a tied hash class, and an example of a class using Tie::StdHash.
812
TE
AM FL Y
Object-oriented Perl
813
Team-Fly®
Chapter 19
814
Inside Perl In this chapter, we will look at how Perl actually works – the internals of the Perl interpreter. First, we will examine what happens when Perl is built, the configuration process and what we can learn about it. Next, we will go through the internal data types that Perl uses. This will help us when we are writing extensions to Perl. From there, we will get an overview of what goes on when Perl compiles and interprets a program. Finally, we will dive into the experimental world of the Perl compiler: what it is, what it does, and how we can write our own compiler tools with it. To get the most out of this chapter, it would be best advised for us to obtain a copy of the source code to Perl. Either of the two versions, stable or development, is fine and they can both be obtained from our local CPAN mirror.
Analyzing the Perl Binary – 'Config.pm' If Perl has been built on our computer, the configuration stage will have asked us a number of questions about how we wanted to build it. For instance, one question would have been along the lines of building Perl with, or without threading. The configuration process will also have poked around the system, determining its capabilities. This information is stored in a file named config.sh, which the installation process encapsulates in the module Config.pm. The idea behind this is that extensions to Perl can use this information when they are being built, but it also means that we as programmers, can examine the capabilities of the current Perl and determine whether or not we could take advantage of features such as threading provided by the Perl binary executing our code.
Chapter 20
'perl -V' The most common use of the Config module is actually made by Perl itself: perl –V, which produces a little report on the Perl binary. It is actually implemented as the following program: #!/usr/bin/perl # config.pl use warnings; use strict; use Config qw(myconfig config_vars); print myconfig(); $"="\n "; my @env = map {"$_=\"$ENV{$_}\""} sort grep {/^PERL/} keys %ENV; print " \%ENV:\n @env\n" if @env; print " \@INC:\n @INC\n";
When this script is run we will get something resembling the following, depending on the specification of the system of course: > perl config.pl Summary of my perl5 (revision 5.0 version 7 subversion 0) configuration: Platform: osname=linux, osvers=2.2.16, archname=i686–linux uname='linux deep–dark–truthful–mirror 2.4.0–test9 #1 sat oct 7 21:23:59 bst 2000 i686 unknown ' config_args='–d –Dusedevel' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='cc', ccflags ='–fno–strict–aliasing –I/usr/local/include –D_LARGEFILE_SOURCE – D_FILE_OFFSET_BITS=64', optimize='–g', cppflags='–fno–strict–aliasing –I/usr/local/include' ccversion='', gccversion='2.95.2 20000220 (Debian GNU/Linux)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' –L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=–lnsl –ldb –ldl –lm –lc –lcrypt –lutil perllibs=–lnsl –ldl –lm –lc –lcrypt –lutil libc=/lib/libc–2.1.94.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='–rdynamic' cccdlflags='–fpic', lddlflags='–shared –L/usr/local/lib' @INC:
816
Inside Perl
lib /usr/local/lib/perl5/5.7.0/i686–linux /usr/local/lib/perl5/5.7.0 /usr/local/lib/perl5/site_perl/5.7.0/i686–linux /usr/local/lib/perl5/site_perl/5.7.0 /usr/local/lib/perl5/site_perl
How It Works Most of the output is generated by the myconfig function in Config. It produces a list of the variables discovered by the Configure process when Perl was built. This is split up into four sections: Platform, Compiler, Linker and Libraries, and Dynamic Linking.
Platform The first section, platform, tells us a little about the computer Perl was being built on, as well as some of the choices we made at compile time. This particular machine is running Linux 2.4.0–test9, and the arguments –d –Dusedevel were passed to Configure during the question and answer section. (We will see what these arguments do when we come to looking at how Perl is built.) hint=recommended means that the configure program accepted the recommended hints for how a Linux system behaves. We built the POSIX module, and we have a struct sigaction in our C library. Next comes a series of choices about the various flavors of Perl we can compile: usethreads is turned off, meaning this version of Perl has no threading support. Perl has two types of threading support. See Chapters 1 and 22 for information regarding the old Perl 5.005 threads, which allow us to create and destroy threads in our Perl program, inside the Perl interpreter. This enables us to share data between threads, and lock variables and subroutines against being changed or entered by other threads. This is the use5.005threads option above. The other model, which came with version 5.6.0, is called interpreter threads or ithreads. In this model, instead of having two threads sharing an interpreter, the interpreter itself is cloned, and each clone runs its own portion of the program. This means that, for instance, we can simulate fork on systems such as Windows, by cloning the interpreter and having each interpreter perform separate tasks. Interpreter threads are only really production quality on Win32 – on all other systems they are still experimental. Allowing multiple interpreters inside the same binary is called multiplicity. The next two options refer to the IO subsystem. Perl can use an alternative input/output library called sfio (http://www.research.att.com/sw/tools/sfio) instead of the usual stdio if it is available. There is also a separate PerlIO being developed, which is specific to Perl. Next, there is support for files over 2Gb if our operating system supports them, and support for the SOCKS firewall proxy, although the core does not use this yet. Finally, there is a series of 64-bit and long double options.
Compiler The compiler tells us about the C environment. Looking at the output, we are informed of the compiler we used and the flags we passed to it, the version of GCC used to compile Perl and the sizes of C's types and Perl's internal types. usemymalloc refers to the choice of Perl's supplied memory allocator rather than the default C one. The next section is not very interesting, but it tells us what libraries we used to link Perl.
817
Chapter 20
Linker and Libraries The only thing of particular note in this section is useshrplib, which allows us to build Perl as a shared library. This is useful if we have a large number of embedded applications, and it means we get to impress our friends by having a 10K Perl binary. By placing the Perl interpreter code in a separate library, Perl and other programs that embed a Perl interpreter can be made a lot smaller, since they can share the code instead of each having to contain their own copy.
Dynamic Linking When we use XS modules (for more information on XS see Chapter 21), Perl needs to get at the object code provided by the XS. This object code is placed into a shared library, which Perl dynamically loads at run time when the module is used. The dynamic linking section determines how this is done. There are a number of models that different operating systems have for dynamic linking, and Perl has to select the correct one here. dlsrc is the file that contains the source code to the chosen implementation. dlsymun tells us whether or not we have to add underlines to symbols dynamically loaded. This is because some systems use different naming conventions for functions loaded at run time, and Perl has to cater to each different convention. The documentation to the Config contains explanations for these and other configure variables accessible from the module. It gets this documentation from Porting/Glossary in the Perl source kit. What use is this? Well, for instance, we can tell if we have a threaded Perl or whether we have to use fork: use Config; if ($Config{usethreads} eq "define") { # we have threads. require MyApp::Threaded; } else { # make do with forking require MyApp::Fork; }
Note that Config gives us a hash, %Config, which contains all the configuration variables.
Under the Hood Now it is time to really get to the deep material. Let us first look around the Perl source, before taking an overall look at the structure and workings of the Perl interpreter.
Around the Source Tree The Perl source is composed of around 2190 files in 186 directories. To be really familiar with the source, we need to know where we can expect a part of it to be found, so it is worth taking some time to look at the important sections of the source tree. There are also several informational files in the root of the tree:
818
Inside Perl
❑
Changes* – a very comprehensive list of every change that has been made to the Perl source since Perl 5.000
❑
Todo* – lists the changes that haven't been made yet – bugs to be fixed, ideas to try out, and so on
❑
MANIFEST – tells us what each file in the source tree does
❑
AUTHORS and MAINTAIN – tell us who is 'looking after' various parts of the source
❑
Copying and Artistic – the two licenses under which we receive Perl
Documentation
The bulk of the Perl documentation lives in the pod/ directory. Platformspecific notes can be found as README.* in the root of the source tree.
Core modules
The standard library of modules shipped with Perl is distributed around two directories: pure-Perl modules that require no additional treatment are placed in lib/, and the XS modules are each given their own subdirectory in the ext/ directory.
Regression tests
When a change to Perl is made, the Perl developers will run a series of tests to ensure that this has not introduced any new bugs or reopened old ones; Perl will also encourage us to run the tests when we build a new Perl on our system. These regression tests are found in the t/ directory.
Platform– specific code
Some platforms require a certain amount of special treatment. They do not provide some system calls that Perl needs, for instance, or there is some difficulty in getting them to use the standard build process. (See Building Perl.) These platforms have their own subdirectories: apollo/, beos/, cygwin/, djgpp/, epoc/, mint/, mpeix/, os2/, plan9/, qnx/, vmesa/, vms/, vos/, and win32/. Additionally, the hints/ subdirectory contains a series of shell scripts, which communicate platform-specific information to the build process.
Utilities
Perl comes with a number of utilities scattered around. perldoc and the pod translators, s2p, find2perl, a2p, and so on. (There is a full list, with descriptions, in the perlutils documentation of Perl 5.7 and above.) These are usually kept in utils/ and x2p/, although the pod translators have escaped to pod/.
819
Chapter 20
Helper Files
The root directory of the source tree contains several program files that are used to assist the installation of Perl, (installhtml, installman, installperl) some which help out during the build process (for instance, cflags, makedepend, and writemain) and some which are used to automate generating some of the source files. In this latter category, embed.pl is most notable, as it generates all the function prototypes for the Perl source, and creates the header files necessary for embedding Perl in other applications. It also extracts the API documentation embedded in the source code files.
Eagle-eyed readers may have noticed that we have left something out of that list – the core source to Perl itself! The files *.c and *.h in the root directory of the source tree make up the Perl binary, but we can also group them according to what they do: Data Structures
A reasonable amount of the Perl source is devoted to managing the various data structures Perl requires, we will examine more about these structures in 'Internal Variable Types' later on. The files that manage these structures – av.c, av.h, cv.h, gv.c, gv.h, hv.c, hv.h, op.c, op.h, sv.c, and sv.h – also contain a wide range of helper functions, which makes it considerably easier to manipulate them. See perlapi for a taste of some of the functions and what they do.
Parsing
The next major functional group in the Perl source code is the part turns our Perl program into a machine-readable data structure. The files that take responsibility for this are toke.c and perly.y, the lexer and the parser.
PP Code
Once we have told Perl that we want to print 'hello world' and the parser has converted those instructions into a data structure, something actually has to implement the functionality. If we wonder where, for instance, the print statement is, we need to look at what is called the PP code. (PP stands for push-pop, for reasons will become apparent later). The PP code is split across four source files: pp_hot.c contains 'hot' code which is used very frequently, pp_sys.c contains operating-systemspecific code, such as network functions or functions which deal with the system databases (getpwent and friends), pp_ctl.c takes care of control structures such as while, eval, and so on. pp.c implements everything else.
Miscellaneous
820
Finally, the remaining source files contain various utility functions to make the rest of the coding easier: utf8.c contains functions that manipulate data encoded in UTF8; malloc.c contains a memory management system; and util.c and handy.h contain some useful definitions for such things as string manipulation, locales, error messages, environment handling, and the like.
Inside Perl
Building Perl Perl builds on a mind-boggling array of different platforms, and so has to undergo a very rigorous configuration process to determine the characteristics of the system it is being built on. There are two major systems for doing this kind of probing: the GNU project autoconf is used by the vast majority of free software, but Perl uses an earlier and less common system called metaconfig.
'metaconfig' Rather than 'autoconf'? Porting /pumpkin.pod explains that both systems were equally useful, but the major reasons for choosing metaconfig are that it can generate interactive configuration programs. The user can override the defaults easily: autoconf, at the time, affected the licensing of software that used it, and metaconfig builds up its configuration programs using a collection of modular units. We can add our own units, and metaconfig will make sure that they are called in the right order. The program Configure in the root of the Perl source tree is a UNIX shell script, which probes our system for various capabilities. The configuration in Windows is already done for us, and an NMAKE file can be found in the win32/ directory. On the vast majority of systems, we should be able to type ./Configure –d and then let Configure do its stuff. The –d option chooses sensible defaults instead prompting us for answers. If we're using a development version of the Perl sources, we'll have to say ./Configure –Dusedevel –d to let Configure know that we are serious about it. Configure asks if we are sure we want to use a development version, and the default answer chosen by –d is 'no'.– Dusedevel overrides this answer. We may also want to add the –DDEBUGGING flag to turn on special debugging options, if we are planning on looking seriously at how Perl works. When we start running Configure, we should see something like this: > ./Configure -d -Dusedevel Sources for perl5 found in "/home/simon/patchbay/perl". Beginning of configuration questions for perl5. Checking echo to see how to suppress newlines... ...using –n. The star should be here––>* First make sure the kit is complete: Checking... And eventually, after a few minutes, we should see this: Creating config.sh... If you'd like to make any changes to the config.sh file before I begin to configure things, do it as a shell escape now (e.g. !vi config.sh). Press return or use a shell escape to edit config.sh: After pressing return, Configure creates the configuration files, and fixes the dependencies for the source files. We then type make to begin the build process.
821
Chapter 20
Perl builds itself in various stages. First, a Perl interpreter is built called miniperl; this is just like the eventual Perl interpreter, but it does not have any of the XS modules – notably, DynaLoader – built in to it. The DynaLoader module is special because it is responsible for coordinating the loading of all the other XS modules at run time; this is done through DLLs, shared libraries or the local equivalent on our platform. Since we cannot load modules dynamically without DynaLoader, it must be built in statically to Perl – if it was built as a DLL or shared library, what would load it? If there is no such dynamic loading system, all of the XS extensions much be linked statically into Perl. miniperl then generates the Config module from the configuration files generated by Configure, and processes the XS files for the extensions that we have chosen to build; when this is done, make returns to the process of building them. The XS extensions that are being linked in statically, such as DynaLoader, are linked to create the final Perl binary. Then the tools, such as the pod translators, perldoc, perlbug, perlcc, and so on, are generated, these must be created from templates to fill in the eventual path of the Perl binary when installed. The sed–to–perl and awk-to-perl translators are created, and then the manual pages are processed. Once this is done, Perl is completely built and ready to be installed; the installperl program looks after installing the binary and the library files, and installman and installhtml install the documentation.
How Perl Works Perl is a byte-compiled language, and Perl is a byte-compiling interpreter. This means that Perl, unlike the shell, does not execute each line of our program as it reads it. Rather, it reads in the entire file, compiles it into an internal representation, and then executes the instructions. There are three major phases by which it does this: parsing, compiling, and interpreting.
Parsing Strictly speaking, parsing is only a small part of what we are talking of here, but it is casually used to mean the process of reading and 'understanding' our program file. First, Perl must process the command-line options and open the program file. It then shuttles extensively between two routines: yylex in toke.c, and yyparse in perly.y. The job of yylex is to split up the input into meaningful parts, (tokens) and determine what 'part of speech' each represents. toke.c is a notoriously fearsome piece of code, and it can sometimes be difficult to see how Perl is pulling out and identifying tokens; the lexer, yylex, is assisted by a sublexer (in the functions S_sublex_start, S_sublex_push, and S_sublex_done), which breaks apart double-quoted string constructions, and a number of scanning functions to find, for instance, the end of a string or a number. Once this is completed, Perl has to try to work out how these 'parts of speech' form valid 'sentences'. It does this by means of grammar, telling it how various tokens can be combined into 'clauses'. This is much the same as it is in English: say we have an adjective and a noun – 'pink giraffes'. We could call that a 'noun phrase'. So, here is one rule in our grammar: adjective + noun => noun phrase
822
Inside Perl
We could then say: adjective + noun phrase => noun phrase
This means that if we add another adjective – 'violent pink giraffes' – we have still got a noun phrase. If we now add the rules: noun phrase + verb + noun phrase => sentence noun => noun phrase
We could understand that 'violent pink giraffes eat honey' is a sentence. Here is a diagram of what we have just done:
sentence
verb
NP
Adj
NP
Adj
N
N
AM FL Y
violent
NP
pink
giraffes
eat
honey
TE
We have completely parsed the sentence, by combining the various components according to our grammar. We will notice that the diagram is in the form of a tree, this is usually called a parse tree. This explains how we started with the language we are parsing, and ended up at the highest level of our grammar. We put the actual English words in filled circles, and we call them terminal symbols, because they are at the very bottom of the tree. Everything else is a non-terminal symbol. We can write our grammar slightly differently: s : np v np ; np : adj np | adj n | n ;
This is called 'Backhaus-Naur Form', or BNF; we have a target, a colon, and then several sequences of tokens, delimited by vertical bars, finished off by a semicolon. If we can see one of the sequences of things on the right-hand side of the colon, we can turn it into the thing on the left – this is known as a reduction.
823
Team-Fly®
Chapter 20
The job of a parser is to completely reduce the input; if the input cannot be completely reduced, then a syntax error arises. Perl's parser is generated from BNF grammar in perly.y; here is an (abridged) excerpt from it: loop : label WHILE '(' expr ')' mblock cont | label UNTIL '(' expr ')' mblock cont | label FOR MY my_scalar '(' expr ')' mblock cont | ... cont : | CONTINUE block ;
We can reduce any of the following into a loop: ❑
A label, the token WHILE, an open bracket, some expression, a close bracket, a block, and a continue block
❑
A label, the token UNTIL, an open bracket, some expression, a close bracket, a block, and a continue block
❑
A label, the tokens FOR and MY, a scalar, an open bracket, some expression, a close bracket, a block, and a continue block. (Or some other things we will not discuss here.)
And that a continue block can be either: ❑
The token CONTINUE and a block
❑
Empty
We will notice that the things that we expect to see in the Perl code – the terminal symbols – are in upper case, whereas the things thatare purely constructs of the parser, like the noun phrases of our English example, are in lower case. Armed with this grammar, and a lexer, which can split the text into tokens and turn them into nonterminals if necessary, Perl can 'understand' our program. We can learn more about parsing and the yacc parser generator in the book Compilers: Principles, Techniques and Tools, ISBN 0-201100-88-6.
Compiling Every time Perl performs a reduction, it generates a line of code; this is as determined by the grammar in perly.y. For instance, when Perl sees two terms connected by a plus sign, it performs the following reduction, and generates the following line of code: term | term ADDOP term {$$ = newBINOP($2, 0, scalar($1), scalar($3));}
Here, as before, we're turning the things on the right into the thing on the left. We take our term, an ADDOP, which is the terminal symbol for the addition operator, and another term, and we reduce those all into a term. Now each term, or indeed, each symbol carries around some information with it. We need to ensure that none of this information is lost when we perform a reduction. In the line of code in braces above, $1 is shorthand for the information carried around by the first thing on the right – that is, the first term. $2 is shorthand for the information carried around by the second thing on the right – that is, the ADDOP and so on. $$ is shorthand for the information that will be carried around by the thing on the left, after reduction.
824
Inside Perl
newBINOP is a function that says 'Create a new binary op'. An op (short for operation) is a data structure, which represents a fundamental operation internal to Perl. It's the lowest–level thing that Perl can do, and every non–terminal symbol carries around one op. Why? Because every non–terminal symbol represents something that Perl has to do: fetching the value of a variable is an op; adding two things together is an op; performing a regular expression match is an op, and so on. There are some 351 ops in Perl 5. A binary op is an op with two operands, just like the addition operator in Perl-space – we add the thing on the left to the thing on the right. Hence, along with the op, we have to store a link to our operands; if, for instance, we are trying to compile $a + $b, our data structure must end up looking like this:
add is the type of binary op that we have created, and we must link this to the ops that fetch the values of $a and $b. So, to look back at our grammar: term | term ADDOP term {$$ = newBINOP($2, 0, scalar($1), scalar($3));}
We have two 'terms' coming in, both of which will carry around an op with them, and we are producing a term, which needs an op to carry around with it. We create a new binary op to represent the addition, by calling the function newBINOP with the following arguments: $2, as we know, stands for the second thing on the right, ADDOP; newBINOP creates a variety of different ops, so we need to tell it which particular op we want – we need add, rather than subtract or divide or anything else. The next value, zero, is just a flag to say 'nothing special about this op'. Next, we have our two binary operands, which will be the ops carried around by the two terms. We call scalar on them to make them turn on a flag to denote scalar context. As we reduce more and more, we connect more ops together: if we were to take the term we've just produced by compiling $a + $b and then use it as the left operand to ($a + $b) + $c, we would end up with an op looking like this:
825
Chapter 20
Eventually, the whole program is turned into a data structure made up of ops linking to ops: an op tree. Complex programs can be constructed from hundreds of ops, all connected to a single root; even a program like this: while(<>) { next unless /^#/; print; $oklines++; } print "TOTAL: $oklines\n";
Turns into an op tree like this: leave
enter
nextstate
leaveloop
enterloop
null
nextstate
print
pushmark
and
nextstate
null
readline
gvsv
gv
null
lineseq
defined
null
null
nextstate
null
or
match
print
pushmark
next
nextstate
preinc
null
null
gvsv
gvsv
unstack
concat
concat
const
nextstate
const
null
gvsv
We can examine the op tree of a Perl program using the B::Terse module described later, or with the –Dx option to Perl if we told Configure we wanted to build a debugging Perl.
Interpreting Once we have an op tree, what do we do with it? Having compiled the program into an op tree, the usual next stage is to do the ops. To make this possible, while creating the op tree, Perl has to keep track of the next op in the sequence to execute. So, running through the tree structure, there is an additional 'thread', like this:
add
add
fetch value of Sa
826
fetch value of Sc
fetch value of Sb
Inside Perl
Executing a Perl program is just a matter of following this thread through the op tree, doing whatever instruction necessary at each point. In fact, the main code, which executes a Perl program is deceptively simple: it is the function run_ops_standard in run.c, and if we were to translate it into Perl, it would look a bit like this: PERL_ASYNC_CHECK() while $op = &{$op–>pp_function};
Each op contains a function reference, which does the work and returns the next op in the thread. Why does the op return the next one? Don't we already know that? Well, we usually do, but for some ops, like the one that implements if, the choice of what to do next has to happen at run time. PERL_ASYNC_CHECK is a function that tests for various things like signals that can occur asynchronously between ops. The actual operations are implemented in PP code, the files pp*.c; we mentioned earlier that PP stands for push-pop, because the interpreter uses a stack to carry around data, and these functions spend a lot of time popping values off the stack or pushing values on. For instance, to execute $a = $b + $c the sequence of ops must look like this: ❑
Fetch $b and put it on the stack.
❑
Fetch $c and put it on the stack.
❑
Pop two values off the stack and add them, pushing the value.
❑
Fetch $a and put it on the stack.
❑
Pop a value and a variable off the stack and assign the value to the variable.
We can watch the execution of a program with the –Dt flag if we configured Perl with the –DEBUGGING option. We can also use –Ds and watch the contents of the stack. And that is, very roughly, how Perl works: it first reads in our program and 'understands' it; second, it converts it into a data structure called an op tree; and finally, it runs over that op tree executing the fundamental operations. There's one fly in the ointment: if we do an eval STRING, Perl cannot tell what the code to execute will be until run time. This means that the op that implements eval must call back to the parser to create a new op tree for the string and then execute that.
Internal Variable Types Internally, Perl has to use its own variable types. Why? Well, consider the scalar variable $a in the following code: $a = "15x"; $a += 1; $a /= 3;
Is it a string, an integer, or a floating-point number? It is obviously all three at different times, depending on what we want to do with it, and Perl has to be able to access all three different representations of it. Worse, there is no 'type' in C that can represent all of the values at once. So, to get around these problems, all of the different representations are lumped into a single structure in the underlying C implementation: a Scalar Variable, or SV.
827
Chapter 20
PVs The simplest form of SV holds a structure representing a string value. Since we've already used the abbreviation SV, we have to call this a PV, a Pointer Value. We can use the standard Devel::Peek module to examine a simple SV, (see the section 'Examining Raw Datatypes with Devel::Peek' later in the chapter for more detail on this module): > perl -MDevel::Peek -e '$a = "A Simple Scalar"; Dump($a)' SV = PV(0x813b564) at 0x8144ee4 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81471a8 "A Simple Scalar"\0 CUR = 15 LEN = 16 What does this tell us? This SV is stored at memory location C<0x8144ee4>; the location can vary on different computers. The particular type of SV is a PV, which is itself a structure; that structure starts at location C<0x813b564>. Next comes some housekeeping information about the SV itself: its reference count (the REFCNT field) tells us how many references exist to this SV. As we know from our Perl-level knowledge of references, once this drops to zero, the memory used by the SV is available for reallocation. The flags tell us, in this case, that it's OK to use this SV as a string right now; the POK means that the PV is valid. (In case we are wondering, the pPOK means that Perl itself can use the PV. We shouldn't take advantage of this – the little p stands for 'private'.) The final three parts come from the PV structure itself: there's the pointer we talked about, which tells us that the string is located at 0x81471a8 in memory. Devel::Peek also prints out the string for us, to be extra helpful. Note that in C, but not in Perl, strings are terminated with \0 – character zero. Since C thinks that character zero is the end of a string, this causes problems when we want to have character zero in the middle of the string. For this reason, the next field, CUR is the length of the string; this allows us to have a string like a\0b and still 'know' that it's three characters long and doesn't finish after the a. The last field is LEN, the maximum length of the string that we have allocated memory for. Perl allocates more memory than it needs to, to allow room for expansion. If CUR gets too close to LEN, Perl will automatically reallocate a proportionally larger chunk of memory for us.
IVs The second-simplest SV structure is one that contains the structures of a PV and an IV: an Integer Value. This structure is called a PVIV, and we can create one by performing string concatenation on an integer, like this: > perl -MDevel::Peek -e '$a = 1; Dump($a); $a.="2"; Dump($a)' SV = IV(0x8132fe4) at 0x8132214 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 1 SV = PVIV(0x8128c30) at 0x8132204 REFCNT = 1
828
Inside Perl
FLAGS = (POK,pPOK) IV = 1 PV = 0x8133e38 "12"\0 CUR = 2 LEN = 3 Notice how our SV starts as a simple structure with an IV field, representing the integer value of the variable. This value is 1, and the flags tell us that the IV is fine to use. However, to use it as a string, we need a PV; rather than change its type to a PV, Perl changes it to a combination of IV and PV. Why? Well, if we had to change the structure of a variable every time we used it as an integer or a string, things would get very slow. Once we have upgraded the SV to a PVIV, we can very easily use it as PV or IV. Similarly, Perl never downgrades an SV to a less complex structure, nor does it change between equally complex structures. When Perl performs the string concatenation, it first converts the value to a PV – the C macro SvPV retrieves the PV of a SV, converting the current value to a PV and upgrading the SV if necessary. It then adds the 2 onto the end of the PV, automatically extending the memory allocated for it. Since the IV is now out of date, the IOK flag is unset and replaced by POK flags to indicate that the string value is valid. On some systems, we can use unsigned (positive only) integers to get twice the range of the normal signed integers; these are implemented as a special type known as a UV.
NVs The third and final (for our purposes) scalar type is an NV (Numeric Value), a floating-point value. The PVNV type includes the structures of a PV, an IV, and an NV, and we can create one just like our previous example: > perl -MDevel::Peek -e '$a = 1; Dump($a); $a.="2"; Dump($a); $a += 0.5; Dump($a)' SV = IV(0x80fac44) at 0x8104630 REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 1 SV = PVIV(0x80f06f8) at 0x8104630 REFCNT = 1 FLAGS = (POK,pPOK) IV = 1 PV = 0x80f3e08 "12"\0 CUR = 2 LEN = 3 SV = PVNV(0x80f0d68) at 0x8104630 REFCNT = 1 FLAGS = (NOK,pNOK) IV = 1 NV = 12.5 PV = 0x80f3e08 "12"\0 CUR = 2 LEN = 3
829
Chapter 20
We should be able to see that this is very similar to what happened when we used an IV as a string: Perl had to upgrade to a more complex format, convert the current value to the desired type (an NV in this case), and set the flags appropriately.
Arrays and Hashes We have seen how scalars are represented internally, but what about aggregates like arrays and hashes? These, too, are stored in special structures, although these are much more complex than the scalars. Arrays are, as we might be able to guess, a series of scalars stored in a C array; they are called an AV internally. Perl takes care of making sure that the array is automatically extended when required so that new elements can be accommodated. Hashes, or HVs, on the other hand, are stored by computing a special value for each key; this key is then used to reference a position in a hash table. For efficiency, the hash table is a combination of an array of linked lists, like this: Array of hash buckets 0 Hash key
"hello"
Hashing Algorithm for (split //, $string) { $hash = ($hash * 33 + ord ($_)) %429467294; } return ($hash + $hash>>5) "hello" => 7942919
Distribute across buckets
1 2 3
7942919 & 7 => 7 4 5 6 Hash entry Hash chain (More entries in same bucket)
7 7942919 "Value"
Thankfully, the interfaces to arrays and hashes are sufficiently well-defined by the Perl API that it's perfectly possible to get by without knowing exactly how Perl manipulates these structures.
Examining Raw Datatypes with 'Devel::Peek' The Devel::Peek module provides us with the ability to examine Perl datatypes at a low level. It is analogous to the Dumpvalue module, but returns the full and gory details of the underlying Perl implementation. This is primarily useful in XS programming, the subject of Chapter 21 where Perl and C are being bound together and we need to examine the arguments passed by Perl code to C library functions. For example, this is what Devel::Peek has to say about the literal number 6: > perl -MDevel::Peek -e "Dump(6)" SV = IV(0x80ffb48) at 0x80f6938 REFCNT = 1 FLAGS = (IOK,READONLY,pIOK,IsUV) UV = 6
830
Inside Perl
Other platforms may add some items to FLAGS, but this is nothing to be concerned about. NT may add PADBUSY and PADTMP, for example. We also get a very similar result (with possibly varying memory address values) if we define a scalar variable and fill it with the value 6: > perl -MDevel::Peek -e '$a=6; Dump($a)' SV = IV(0x80ffb74) at 0x8109b9c REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 6 This is because Devel::Peek is concerned about values, not variables. It makes no difference if the 6 is literal or stored in a variable, except that Perl knows that the literal value cannot be assigned to and so is READONLY. Reading the output of Devel::Peek takes a little concentration but is not ultimately too hard, once the abbreviations are deciphered: ❑
SV means that this is a scalar value.
❑
IV means that it is an integer.
❑
REFCNT=1 means that there is only one reference to this value (the count is used by Perl's garbage collection to clear away unused data).
❑
IOK and pIOK mean this scalar has a defined integer value (it would be POK for a string value, or ROK if the scalar was a reference).
❑
READONLY means that it may not be assigned to. Literal values have this set whereas variables do not.
❑
IsUV means that it is an unsigned integer and that its value is being stored in the unsigned integer slot UV rather than the IV slot, which indeed it is.
The UV slot is for unsigned integers, which can be twice as big as signed ones for any given size of integer (for example, 32 bit) since they do not use the top bit for a sign. Contrast this to –6, which defines an IV slot and doesn't have the UV flag set: SV = IV(0x80ffb4c) at 0x80f6968 REFCNT = 1 FLAGS = (IOK,READONLY,pIOK) IV = –6 The Dump subroutine handles scalar values only, but if the value supplied to Dump happens to be an array or hash reference, then each element will be dumped out in turn. A second optional count argument may be supplied to limit the number of elements dumped. For list values (that is lists, arrays, or hashes) we need to use DumpArray, which takes a count and a list of values to dump. Each of these values is of course scalar (even if it is a reference), but DumpArray will recurse into array and hash references: > perl -MDevel::Peek -e '@a=(1,[2,sub {3}]); DumpArray(2, @a)'
831
Chapter 20
This array has two elements, so we supply 2 as the first argument to DumpArray. We could of course also have supplied a literal list of two scalars, or an array with more elements (in which case only the first two would be dumped). The example above produces the following output, where the outer array of an IV (index no. 0) and an RV (reference value, index no. 1) can be clearly seen, with an inner array inside the RV of element 1 containing a PV (string value) with the value two and another RV. Since this one is a code reference, DumpArray cannot analyze it any further. At each stage the IOK, POK, or ROK (valid reference) flags are set to indicate that the scalar SV contains a valid value of that type: Elt No. 0 0x811474c SV = IV(0x80ffb74) at 0x811474c REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 1 Elt No. 1 0x8114758 SV = RV(0x810acbc) at 0x8114758 REFCNT = 1 FLAGS = (ROK) RV = 0x80f69a4 SV = PVAV(0x81133b0) at 0x80f69a4 REFCNT = 1 FLAGS = () IV = 0 NV = 0 ARRAY = 0x81030b0 FILL = 1 MAX = 1 ARYLEN = 0x0 FLAGS = (REAL) Elt No. 0 SV = PV(0x80f6b74) at 0x80f67b8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x80fa6d0 "two"\0 CUR = 3 LEN = 4 Elt No. 1 SV = RV(0x810acb4) at 0x80fdd24 REFCNT = 1 FLAGS = (ROK) RV = 0x81077d8 We mentioned earlier in the chapter how Perl can hold both a string and an integer value for the same variable. With Devel::Peek we can see how: > perl -MDevel::Peek -e '$a="2701"; $a*1; Dump($a)' Or on Windows: > perl -mDevel::Peek -e "$a=q(2701); $a*1; Dump($a)"
832
Inside Perl
This creates a variable containing a string value, then uses it (redundantly, since we do not use the result) in a numeric context and then dumps it. This time the variable has both PV and NV (floating-point) values, and the corresponding POK and NOK flags set: SV = PVNV(0x80f7648) at 0x8109b84 REFCNT = 1 FLAGS = (NOK,POK,pNOK,pPOK) IV = 0 NV = 2701 PV = 0x81022e0 "2701"\0 CUR = 4 LEN = 5 It is interesting to see that Perl actually produced a floating-point value here and not an integer – a window into Perl's inner processes. As a final example, if we reassign $a in the process of converting it, we can see that we get more than one value stored, but only one is legal: > perl -MDevel::Peek -e '$a="2701"; $a=int($a); Dump($a)' This produces:
AM FL Y
SV = PVNV(0x80f7630) at 0x8109b8c REFCNT = 1 FLAGS = (IOK,pIOK) IV = 2701 NV = 2701 PV = 0x81022e8 "2701"\0 CUR = 4 LEN = 5
The int subroutine is used to reassign $a an integer value. As we can see from the results of the dump, Perl actually converted the value from a string into a floating point NV value before assigning an integer IV to it. Because the assignment gives the variable a new value (even if it is equal to the string), only the IOK flag is now set.
TE
These are the main features of Devel::Peek from an analysis point of view. Of course in reality we would not be using it from the command line but to analyze values inside Perl code. Here we have just used it to satisfy our curiosity and understand the workings of Perl a little better. If we have a Perl interpreter combined with DEBUGGING_MSTATS we can also make use of the mstat subroutine to output details of memory usage. Unless we built Perl specially to do this, however, it is unlikelyto be present, and so this feature is not usually available. Devel::Peek also contains advanced features to edit the reference counts on scalar values. This is not a recommended thing to do even in unusual circumstances, so we will not do more than mention that it is possible here. We can see perldoc Devel::Peek for information if absolutely necessary.
The Perl Compiler The Perl Compiler suite is an oft-misunderstood piece of software. It allows us to perform various manipulations of the op tree of a Perl program, including converting it to C or bytecode. People expect that if they use it to compile their Perl to stand-alone executables, it will make their code magically run faster, when in fact, usually the opposite occurs. Now we know a little about how Perl works internally, we can determine why this is the case.
833
Team-Fly®
Chapter 20
In the normal course of events, Perl parses our code, generates an op tree, and then executes it. When the compiler is used, Perl stops before executing the op tree and executes some other code instead, code provided by the compiler. The interface to the compiler is through the O module, which simply stops Perl after it has compiled our code, and then executes one of the "compiler backend" modules, which manipulate the op tree. There are several different compiler back-ends, all of which live in the 'B::' module hierarchy, and they perform different sorts of manipulations: some perform code analysis, while others convert the op tree to different forms, such as C or Java VM assembler.
The 'O' Module How does the O module prevent Perl from executing our program? The answer is by using a CHECK block. As we learnt in Chapter 6, Perl has several special blocks that are automatically called at various points in our program's lifetime: BEGIN blocks are called before compilation, END blocks are called when our program finishes, INIT blocks are run just before execution, and CHECK blocks are run after compilation. sub import { ($class, $backend, @options) = @_; eval "use B::$backend ()"; if ($@) { croak "use of backend $backend failed: $@"; } $compilesub = &{"B::${backend}::compile"}(@options); if (ref($compilesub) eq "CODE") { minus_c; save_BEGINs; eval 'CHECK {&$compilesub()}'; } else { die $compilesub; } }
The 'B' Module The strength of these compiler back-ends comes from the B module, which allows Perl to get at the Clevel data structure which makes up the op tree; now we can explore the tree from Perl code, examining both SV structures, and OP structures. For instance, the function B::main_start returns an object, which represents the first op in the tree that Perl will execute. We can then call methods on this object to examine its data: use B qw(main_start class); CHECK { $x= main_start; print "The starting op is in class ", class($x), " and is of type: ", $x–>ppaddr, "\n"; $y = $x –> next; print "The next op after that is in class ", class($y), " and is of type ", $y–>ppaddr, "\n"; }; print "This is my program";
834
Inside Perl
The class function tells us what type of object we have, and the ppaddr method tells us which part of the PP code this op will execute. Since the PP code is the part that actually implements the op, this method tells us what the op does. For instance: The starting op is in class OP and is of type: PL_ppaddr[OP_ENTER] The next op after that is in class COP and is of type: PL_ppaddr[OP_NEXTSTATE] This is my program This tells us we have an ENTER op followed by a NEXTSTATE op. We could even set up a little loop to keep looking at the next op in the sequence: use B qw(main_start class); CHECK { $x= main_start; print "The starting op is in class ", class($x), " and is of type: ", $x–>ppaddr, "\n"; while ($x = $x–>next and $x–>can("ppaddr")) { print "The next op after that is in class ",class($x), " and is of type ", $x–>ppaddr, "\n"; } }; print "This is my program";
This will list all the operations involved in the one-line program print This is my program: The starting op is in class OP and is of type: PL_ppaddr[OP_ENTER] The next op after that is in class COP and is of type PL_ppaddr[OP_NEXTSTATE] The next op after that is in class OP and is of type PL_ppaddr[OP_PUSHMARK] The next op after that is in class SVOP and is of type PL_ppaddr[OP_CONST] The next op after that is in class LISTOP and is of type PL_ppaddr[OP_PRINT] The next op after that is in class LISTOP and is of type PL_ppaddr[OP_LEAVE] This is my program Since looking at each operation in turn is a particularly common thing to do when building compilers, the B module provides methods to 'walk' the op tree. The walkoptree_slow starts a given op and performs a breadth-first traversal of the op tree, calling a method of our choice on each op. Whereas walkoptree_exec does the same, but works through the tree in execution order, using the next method to move through the tree, similar to our example programs above. To make these work, we must provide the method in each relevant class by defining the relevant subroutines: use B qw(main_start class walkoptree_exec); CHECK { walkoptree_exec(main_start, "test"); sub B::OP::test { $x = shift; print "This op is in class ", class($x), " and is of type: ", $x–>ppaddr, "\n"; } }; print "This is my program";
835
Chapter 20
The 'B::' Family of Modules Now let us see how we can use the O module as a front end to some of the modules, which use the B module. We have seen some of the modules in this family already, but now we will take a look at all of the B:: modules in the core and on CPAN.
'B::Terse' The job of B::Terse is to walk the op tree of a program, printing out information about each op. In a sense, this is very similar to the programs we have just built ourselves. Let us see what happens if we run B::Terse on a very simple program: > perl -MO=Terse -e '$a = $b + $c' LISTOP (0x8178b90) leave OP (0x8178bb8) enter COP (0x8178b58) nextstate BINOP (0x8178b30) sassign BINOP (0x8178b08) add [1] UNOP (0x81789e8) null [15] SVOP (0x80fbed0) gvsv GV (0x80fa098) *b UNOP (0x8178ae8) null [15] SVOP (0x8178a08) gvsv GV (0x80f0070) *c UNOP (0x816b4b0) null [15] SVOP (0x816dd40) gvsv GV (0x80fa02c) *a -e syntax OK This shows us a tree of the operations, giving the type, memory address and name of each operator. Children of an op are indented from their parent: for instance, in this case, the ops enter, nextstate, and sassign are the children of the list operator leave, and the ops add and the final null are children of sassign. The information in square brackets is the contents of the targ field of the op; this is used both to show where the result of a calculation should be stored and, in the case of a null op, what the op used to be before it was optimized away: if we look up the 15th op in opcode.h, we can see that these ops used to be rv2sv – turning a reference into an SV. Again, just like the programs we wrote above, we can also walk over the tree in execution order by passing the exec parameter to the compiler: > perl -MO=Terse,exec -e '$a = $b + $c' OP (0x80fcf30) enter COP (0x80fced0) nextstate SVOP (0x80fc1d0) gvsv GV (0x80fa094) *b SVOP (0x80fcda0) gvsv GV (0x80f0070) *c BINOP (0x80fce80) add [1] SVOP (0x816b980) gvsv GV (0x80fa028) *a BINOP (0x80fcea8) sassign LISTOP (0x80fcf08) leave -e syntax OK
836
Inside Perl
Different numbers in the parenthesis or a different order to that shown above may be returned as this is dependent on the version of Perl. This provides us with much the same information, but re-ordered so that we can see how the interpreter will execute the code.
'B::Debug' B::Terse provides us with minimal information about the ops; basically, just enough for us to understand what's going on. The B::Debug module, on the other hand, tells us everything possible about the ops in the op tree and the variables in the stashes. It is useful for hard-core Perl hackers trying to understand something about the internals, but it can be quite overwhelming at first sight: > perl -MO=Debug -e '$a = $b + $c' LISTOP (0x8183c30) op_next 0x0 op_sibling 0x0 op_ppaddr PL_ppaddr[OP_LEAVE] op_targ 0 op_type 178 op_seq 6437 op_flags 13 op_private 64 op_first 0x8183c58 op_last 0x81933c8 op_children 3 OP (0x8183c58) op_next 0x8183bf8 op_sibling 0x8183bf8 op_ppaddr PL_ppaddr[OP_ENTER] op_targ 0 op_type 177 op_seq 6430 op_flags 0 op_private 0 ... -e syntax OK Here's a slightly more involved cross-reference report from the debug closure example debug.pl, which we have already encountered in Chapter 17: #!/usr/bin/perl # debug.pl use warnings; use strict; # define a debugging infrastructure { my $debug_level = $ENV{'DEBUG'}; $debug_level| = 0; # return and optionally set debug level sub debug_level { my $old_level = $debug_level; $debug_level = $_[0] if @_; return $old_level; }
837
Chapter 20
# print debug message or set debug level sub debug { # remove first argument, if present my $level = shift; # @_ will contain more elements if if (@_) { # 2+ argument calls print debug print STDERR @_, "\n" if $level } else { # one and no-argument calls set debug_level($level?$level:1); }
2+ arguments passed message if permitted <= debug_level(); level
} } # set debugging level explicitly debug_level(1); # send some debug messages debug 1, "This is a level 1 debug message"; debug 2, "This is a level 2 debug message (unseen)"; # change debug level with single argument 'debug' debug 2; debug 2, "This is a level 2 debug message (seen)"; # return debugging level programmatically debug 0, "Debug level is: ", debug_level; # set debug level to 1 with no argument 'debug' debug; debug 0, "Debug level now: ", debug_level;
Below is the command and the output that is produced: > perl -MO=Xref debug.pl File debug.pl Subroutine (definitions) Package UNIVERSAL &VERSION &can &isa Package attributes &bootstrap Package main &debug &debug_level Subroutine (main) Package (lexical) $debug_level Package main %ENV &debug &debug_level Subroutine debug
838
s0 s0 s0 s0 s27 s12
i6 6 &34, &35, &38, &39, &42, &45, &46 &31, &42, &46
Inside Perl
Package (lexical) $level Package main &debug_level *STDERR @_ Subroutine debug_level Package (lexical) $debug_level Package main @_ debug.pl syntax OK
i17, 20, 25, 25 &20, &25 20 17, 20, 20
10, 11 10
Subroutine (definitions) details all the subroutines defined in each package found, note the debug and debug_level subroutines in package main. The numbers following indicate that these are subroutine definitions (prefix s) and are defined on lines 12 (debug_level) and 27 (debug), which is indeed where those subroutine definitions end. Similarly, we can see that in package main the debug_level subroutine is called at lines 31, 42, and 46, and within debug it is called at lines 20 and 25. We can also see that the scalar $debug_level is initialized (prefix i) on line 6, and is used only within the debug_level subroutine, on lines 10 and 11. This is a useful result, because it shows us that the variable is not being accessed from anywhere that it is not supposed to be. Similar analysis of other variables and subroutines can provide us with similar information, allowing us to track down and eliminate unwanted accesses between packages and subroutines, while highlighting areas where interfaces need to be tightened up, or visibility of variables and values reduced.
'B::Deparse' As its name implies, the B::Deparse module attempts to 'un-parse' a program. If parsing is going from Perl text to an op tree, deparsing must be going from an op tree back into Perl. This may not look too impressive at first sight: > perl -MO=Deparse -e '$a = $b + $c' $a = $b + $c; -e syntax OK It can be extremely useful for telling us how Perl sees certain constructions. For instance, we can show that && and if are almost equivalent internally: > perl -MO=Deparse -e '($a == $b or $c) && print "hi";' print 'hi' if $a == $b or $c; -e syntax OK We can also understand the strange magic of while(<>) and the –n and –p flags to Perl: > perl -MO=Deparse -e 'while(<>){print}' while (defined($_ = )) { print $_; } -e syntax OK
839
Chapter 20
> perl -MO=Deparse -pe 1 LINE: while (defined($_ = )) { '???'; } continue { print $_; } -e syntax OK The '???' represents a useless use of a constant – in our case, 1, which was then optimized away. The most interesting use of this module is as a 'beautifier' for Perl code. In some cases, it can even help in converting obfuscated Perl to less-obfuscated Perl. Consider this little program, named strip.pl, which obviously does not follow good coding style: ($text=shift)||die "$0: missing argument!\n";for (@ARGV){s-$text--g;print;if(length){print "\n"}}
B::Deparse converts it to this, a much more readable, form: > perl -MO=Deparse strip.pl die "$0: missing argument!\n" unless $text = shift @ARGV; foreach $_ (@ARGV) { s/$text//g; print $_; if (length $_) { print "\n"; } } B::Deparse also have several options to control its output. For example, -p to add parentheses even where they are optional, or –l to include line numbers from the original script. These options are documented in the pod documentation included with the module.
'B::C' One of the most sought-after Perl compiler backends is something that turns Perl code into C. In a sense, that's what B::C does – but only in a sense. There is not currently a translator from Perl to C, but there is a compiler. What this module does is writes a C program that reconstructs the op tree. Why is this useful? If we then embed a Perl interpreter to run that program, we can create a stand-alone binary that can execute our Perl code. Of course, since we're using a built-in Perl interpreter, this is not necessarily going to be any faster than simply using Perl. In fact, it might well be slower and, because it contains an op tree and a Perl interpreter, we will end up with a large binary that is bigger than our Perl binary itself. However, it is conceivably useful if we want to distribute programs to people who cannot or do not want to install Perl on their computers. (It is far more useful for everyone to install Perl, of course...) Instead of using perl -MO=C, the perlcc program acts as a front-end to both the B::C module and our C compiler; it was recently re-written, so there are two possible syntaxes:
840
Inside Perl
If we're using 5.6.0, we can say: > perlcc hello.pl To create a binary called hello with 5.7.0 or 5.6.1 and above, the syntax and behavior is much more like the C compiler: > perlcc –o hello hello.pl Do not believe that the resulting C code would be readable, by the way; it truly does just create an op tree manually. Here is a fragment of the source generated from the famous 'Hello World' program; static OP op_list[2] = { { (OP*)&cop_list[0], (OP*)&cop_list[0], NULL, 0, 177, 65535, 0x0, 0x0 }, { (OP*)&svop_list[0], (OP*)&svop_list[0], NULL, 0, 3, 65535, 0x2, 0x0 }, }; static LISTOP listop_list[2] = { { 0, 0, NULL, 0, 178, 65535, 0xd, 0x40, &op_list[0], (OP*)&listop_list[1], 3 }, { (OP*)&listop_list[0], 0, NULL, 0, 209, 65535, 0x5, 0x0,&op_list[1], (OP*)&svop_list[0], 1 }, }; And there are another 1035 lines of C code just like that. Adding the –S option to perlcc will leave the C code available for inspection after the compiler has finished.
'B::CC' To attempt to bridge the gap between this and 'real' C, there is the highly experimental 'optimized' C compiler backend, B::CC. This does very much the same thing, but instead of creating the op tree, it sets up the environment for the interpreter and manually executes each PP operation in turn, by setting up the arguments on the stack and calling the relevant op. For instance, the main function from 'Hello, World' will contain some code a little like this: lab_8135840: PL_op = &op_list[1]; DOOP(PL_ppaddr[OP_ENTER]); TAINT_NOT; sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp; FREETMPS; lab_82e7ef0: PUSHMARK(sp); EXTEND(sp, 1); PUSHs((SV*)&sv_list[3]); PL_op = (OP*)&listop_list[0]; DOOP(PL_ppaddr[OP_PRINT]); lab_81bca58: PL_op = (OP*)&listop_list[1]; DOOP(PL_ppaddr[OP_LEAVE]); PUTBACK; return PL_op;
841
Chapter 20
We should be able to see that the code after the first label picks up the first op, puts it into the PL_op variable and calls OP_ENTER. After the next label, the argument stack is extended and an SV is grabbed from the list of pre-constructed SVs. (This will be the PV that says 'Hello, world') This is put on the stack, the first list operator is loaded (that will be print), and OP_PRINT is called. Finally, the next list operator, leave is loaded and OP_LEAVE is called. This is, in a sense, emulating what is going on in the main loop of the Perl interpreter, and so could conceivably be faster than running the program through Perl. However, the B::C and B::CC modules are still very experimental; the ordinary B::C module is not guaranteed to work but very probably will, whereas the B::CC module is not guaranteed to fail but almost certainly will.
'B::Bytecode' Our next module, B::Bytecode, turns the op tree into machine code for an imaginary processor. This is not dissimilar to the idea of Java bytecode, where Java code is compiled into machine code for an idealized machine. And, in fact, the B::JVM::Jasmin module discussed below can compile Perl to JVM bytecode. However, unlike Java, there are no software emulations of such a Perl 'virtual machine' (although see http://www.xray.mpe.mpg.de/mailing–lists/perl5–porters/2000–04/msg00436.html for the beginnings of one). Bytecode is, to put it simply, the binary representation of a program compiled by Perl. Using the B::Bytecode module this binary data is saved to a file. Then, using the ByteLoader module, it is read in memory and executed by Perl. Getting back to the source code of a byte-compiled script is not simple but not impossible, as in the previous case. There is even a disassembler module, as we'll see below, but it does not produce Perl code. Someone with a very good knowledge of the Perl internals could understand from here what the program does, but not everybody. The execution speed is just the same as normal Perl, and here too, only the time of reading the script file and parsing it is cut off. In this case, however, there's the additional overhead of reading the bytecode itself, and the ByteLoader module too. So, an overall speed increase can be obtained only on very large programs, whose parsing time would be high enough to justify the effort. We may ask, how does this magic take place? How does Perl run the binary data as if it was a normal script? The answer is that the ByteLoader module is pure magic: it installs a 'filter' that lets Perl understand a sequence of bytes in place of Perl code. In Chapter 21 we'll introduce filters and their usage. Using the bytecode compiler without the perlcc tool (if we're running an older version of Perl, for example) is fairly trivial. The O compiler frontend module can be used for this: > perl -MO=Bytecode hello.pl > hello.plc To run the bytecode produced, we need to specify the ByteLoader module on the command line, because the file generated this way does not contain the two header lines that perlcc adds for us: > perl -MByteLoader hello.plc
842
Inside Perl
An op tree stored as Perl bytecode can be loaded back into Perl using the Byteloader module; perlcc can also compile programs to bytecode by saying: > perlcc -b -o hello.plc hello.pl The output will be a Perl program, which begins something like: #!/usr/bin/perl use ByteLoader 0.04; This is then followed by the binary bytecode output.
'B::Disassembler' The Perl bytecode produced can be 'disassembled' using the disassemble tool, placed in the B directory under the Perl lib path. The output of this program is a sort of 'Perl Assembler' (thus the name): a set of basic instructions that describe the low level working of the Perl interpreter. It is very unlikely that someone will find this Assembler-like language comfortable (that's the reason why high-level languages were invented, after all). However, if we already have a good knowledge of the Perl internals, the disassembled code can provide many useful insights about the way Perl works.
> perl disassemble hello.plc
AM FL Y
The disassemble script is just a front-end for the B::Disassembler module, which does all the work. To convert from bytecode to assembler we can use this command line:
The output, as we can expect if we have some experience with 'regular' Assembler, is very verbose. Our simple 'hello world' script generates 128 lines of output, for a single Perl instruction! Once the bytecode has been disassembled, it can also be regenerated using the opposite assemble program:
TE
> perl disassemble hello.plc > hello.S > perl assemble hello.S > hello.plc
This could be used, for example, to convert bytecode from one Perl version to another. The intermediate assembler should be understood by the two versions, while the binary format can change.
'B::Lint' Like the lint utility for C files, this module acts as a basic 'style checker', looking for various things which may indicate problems. We can select from a variety of checks, by passing a list of the names of the tests to O as follows: > perl -MO=Lint,context,undefined-subs program.pl
843
Team-Fly®
Chapter 20
This will turn on the context and undefined–subs checks. The checks are: Check Type
Description
context
Warns whenever an array is used in a scalar context, without the explicit scalar operator. For instance: $count = @array;
will give a warning, but: $count = scalar @array;
will not. implicit–read
Warns whenever a special variable is read from implicitly; for instance: if(/test/) {...}
is an implicit match against $_. implicit–write
Warns whenever a special variable will be written to implicitly: s/one/two/
both reads from and writes to $_, and thus causes an warning in both the above categories. dollar– underscore
Warns whenever $_ is used explicitly or implicitly.
private–names
Warns when a variable name is invoked which starts with an underscore, such $_self; these variable names are, by convention, private to some module or package.
undefined–subs
Warns when a subroutine is directly invoked but not defined; this will not catch subroutines indirectly invoked, such as through subroutine references which may be empty.
regexp–variables
The regular expression special variables $`, $', and $& slow our program down; once Perl notices one of them in our program, it must keep track of their value at all times. This check will warn if the regular expression special variables are used.
all
Turns on all of the above checks.
For instance, given the following file: #!/usr/bin/perl # B_Lint.pl use warnings; use strict; $count = @file = <>; for (@file) { if (/^\d\d\d/ and $& > 300) { print } else { summarize($_); } }
844
Inside Perl
B::Lint reports: > perl -MO=Lint,all test Implicit use of $_ in foreach at test line 4 Implicit match on $_ at test line 5 Use of regexp variable $& at test line 5 Use of $_ at test line 5 Use of $_ at test line 8 Undefined subroutine summarize called at test line 8 test syntax OK
'B::Showlex' The B::Showlex module is used for debugging lexical variables. For instance, when used on the following short file: #!/usr/bin/perl #B_Showlex.pl use warnings; use strict; my $one = 1; { my $two = "two"; my $one = "one"; } Again, we can use the O module to drive it: > perl -MO=Showlex B_Showlex.pl The result shown below may differ slightly depending on the system setup: Pad of lexical names for comppadlist has 4 entries 0: SPECIAL #1 &PL_sv_undef 1: PVNV (0x80fa0c4) "$one" 2: PVNV (0x816b070) "$two" 3: PVNV (0x8195fa0) "$one" Pad of lexical values for comppadlist has 4 entries 0: SPECIAL #1 &PL_sv_undef 1: NULL (0x80f0070) 2: NULL (0x80fa130) 3: NULL (0x816b0dc) This tells us that this program has four lexical key-value pairs; (remember that variables are stored in a hash, effectively) the first is the special variable undef, the undefined value. Then follows our three lexical variables, which have no values before our program is executed.
'B::Xref' Generating cross-reference tables can be extremely useful to help understand a large program. The B::Xref module tells us where subroutines and lexical variables are defined and used. For instance, here is an extract of the output for a program called Freekai, (http://simoncozens.org/software/freekai/) broken down into sections:
845
Chapter 20
> perl -MO=Xref File File freekai Subroutine (definitions) Package UNIVERSAL &VERSION &can &isa Package attributes &bootstrap Package main &dotext
s0 s0 s0 s0 s11
First, Xref tells us about the subroutine definitions; the subroutines UNIVERSAL::VERSION, UNIVERSAL::can, UNIVERSAL::isa, and attributes::bootstrap are special subroutines internal to Perl, so they have line number 0. On line 11, Freekai's one subroutine, dotext, is defined. Subroutine (main) Package ? ? &6 text &7, &8 Package Text::ChaSen &getopt_argv &5 This section talks about subroutine calls in the main body of the program; there are some calls that it cannot understand (these are actually handlers to HTML::Parser) and the subroutine Text::ChaSen::getopt_argv is called on line 5. Package main $cset 4 $p 6, 7, 8, 9 $res 5 &dotext 7 *STDIN &9, 9 Now we describe the global variables used in the main body of the program: $cset is used on line 4, the subroutine reference &dotext is used on line 7, and so on. Subroutine dotext Package (lexical) $deinflected i19, 21 $kanji i19, 20, 21, 21 $pos i19, 20, 21 $yomi i19, 21 We enter the subroutine dotext and look at the lexical variables used here: i19 says that these variables were all initialized on line 19, and the other numbers tell us which lines the variables were used one. Package Text::ChaSen &sparse_tostr &16
846
Inside Perl
We refer to the subroutine Text::ChaSen::sparse_tostr on line 16. Package main $_ 12, 14, 15, 16, 19 $cset 21 @_ 12 @chars 16, 17, 18 And, finally, these are the global variables used in this subroutine. B::Xref is an excellent way of getting an overall map of a program, particularly one that spans multiple files and contains several modules. Here's a slightly more involved cross-reference report from the debug closure example debug.pl we have analyzed earlier: > perl -MO=Xref debug.pl Below is the output that this command produces: File debug.pl Subroutine (definitions) Package UNIVERSAL &VERSION &can &isa Package attributes &bootstrap Package main &debug &debug_level Subroutine (main) Package (lexical) $debug_level Package main %ENV &debug &debug_level Subroutine debug Package (lexical) $level Package main &debug_level *STDERR @_ Subroutine debug_level Package (lexical) $debug_level Package main @_ debug.pl syntax OK
s0 s0 s0 s0 s27 s12
i6 6 &34, &35, &38, &39, &42, &45, &46 &31, &42, &46
i17, 20, 25, 25 &20, &25 20 17, 20, 20
10, 11 10
847
Chapter 20
Subroutine (definitions) details all the subroutines defined in each package found; note the debug and debug_level subroutines in package main. The numbers following indicate that these are subroutine definitions (prefix s) and are defined on lines 12 (debug_level) and 27 (debug), which is indeed where those subroutine definitions end. Similarly, we can see that in package main the debug_level subroutine is called at lines 31, 42, and 46, and within debug it is called at lines 20 and 25. We can also see that the scalar $debug_level is initialized (prefix i) on line 6, and is used only within the debug_level subroutine, on lines 10 and 11. This is a useful result, because it shows us that the variable is not being accessed from anywhere that it is not supposed to be. Similar analysis of other variables and subroutines can provide us with similar information, allowing us to track down and eliminate unwanted accesses between packages and subroutines, while highlighting areas where interfaces need to be tightened up, or visibility of variables and values reduced.
'B::Fathom' Turning to the B::* modules on CPAN, rather than in the standard distribution, B::Fathom aims to provide a measure of readability. For instance, running B::Fathom on Rocco Caputo's Crossword server (http://poe.perl.org/poegrams/index.html#xws) produces the following output: > perl -MO=Fathom xws.perl 237 tokens 78 expressions 28 statements 1 subroutine readability is 4.69 (easier than the norm) However, do not take its word as gospel: it judges the following code (due to Damian Conway) as 'very readable' $;=$/;seek+DATA,!++$/,!$s;$_=;$s&&print||$g&&do{$y=($x||=20)*($y||8); sub i{sleep&f}sub'p{print$;x$=,join$;,$b=~/.{$x}/g}$j=$j;sub'f{pop}sub n{substr($b,&f%$y,3)=~tr,O,O,}sub'g{$f=&f–1;($w,$w,substr($b,&f,1),O) [n($f–$x)+ n($x+$f)–(substr($b,&f,1)eq+O)+n$f]||$w}$w="\40";$b=join'',@ARGV?<>:$_,$w x$y;$b=~s).)$&=~/\w/?O:$w)ge;substr($b,$y)=q++;$g='$i=0;$i?$b:$c=$b; substr+$c,$i,1,g$i;$g=~s?\d+?($&+1)%$y?e;$i–$y+1?eval$g:do{$i=–1;$b=$c;p;i 1}';sub'e{eval$g;&e}e}||eval||die+No.$;
'B::Graph' The B::Graph module, by Stephen McCamant, is not part of the standard compiler suite, but is available as a separate download from CPAN. This backend produces a graphical representation of the op tree that the compiler sees. Output is in plain text by default, but this is not very appealing. With the –vcg and –dot options the module generates vector graphs that can be displayed using, respectively, the VCG (Visualization of Compiler Graphs) program (from http://www.cs.unisb.de/RW/users/sander/html/gsvcg1.html) and the GraphViz module with the Graphviz package (available from http://www.graphviz.org/). Both are free software tools, available for UNIX as well as Windows. A VCG graph can be generated with the following command:
848
Inside Perl
> perl -MO=Graph,-vcg -e '$a = $b + $c' > graph.vcg Here is the graph produced by the module of our old friend, $a = $b + $c.
leave (LISTOP) flags: 'VKP' private: 64 first children: 3 (child) last
enter (OP) flags: " next sibling
nextstate (COP) flags: 'V' next sibling cop_seq: 663 line: 1
sassign (BINOP) flags: 'VKT' next private: 2 first last
add (BINOP) flags: 'SK' next sibling targ: 1 private: 2 first last
null (UNOP) was rv2vsv flags: 'SKRM*' next private: 1 first
null (UNOP) was rv2sv flags: 'SK' next sibling private: 1 first
null (UNOP) was rv2sv flags: 'SK' next private: 1 first
gvsv (SVOP) flags: 'S' next private: 16 sv
gvsv (SVOP) flags: 'S' next private: 16 sv
gvsv (SVOP) flags: 'S' next private: 16 sv
Aside from being nice (plenty of colorful boxes connected by arrows; the colors can't be displayed in the diagram), the output can be very useful if we're going to seriously study the internal working of the Perl compiler. It can be seen as a graphical disassembler, a feature that few programming languages can offer.
'B::JVM::Jasmin' Finally, one of the most ambitious modules is an attempt to have Perl output Java bytecode; it creates Java assembly code, which is assembled into bytecode with the Jasmin assembler, available from http://mrl.nyu.edu/meyer/jasmin. It provides a compiler, perljvm, which turns a Perl program into a compiled class file. Running: > perljvm myprog.pl MyProgram will create MyProgram.class. If this is of interest we should also look in the jpl/ directory of the Perl source tree; JPL (Java Perl Lingo) is a way of sharing code between Perl and Java, just like XS and embedding shared code between Perl and C. B::JVM::Jasmin, on the other hand, makes Perl code run natively on the Java Virtual Machine.
849
Chapter 20
Writing a Perl Compiler Backend Now that we have looked at the backends that are available to us, let's think about how we can write our own. We'll create something similar to B::Graph, but we will make it a lot more simple – let's call it B::Tree. We will use the GraphViz module just like B::Graph, because its interface is very simple: to add a node, we say: $g–>add_node({name => 'A', label => 'Text to appear on graph'});
and to connect two nodes, we say: $g–>add_edge({from => 'A', to => 'B'});
Let's start off slowly, by just having our backend visit each op and say hello. We need to remember two things: to be a proper backend usable by the O module, we need to define a compile subroutine which returns a subref which does the work; if we are using walkoptree_slow to visit each op, we need to define a method for each class we are interested in. Here is the skeleton of our module: #!/usr/bin/perl # ourmodule1.pm package B::Tree; use strict; use B qw(main_root walkoptree_slow); sub compile { return sub {walkoptree_slow(main_root, "visit")}; } sub B::OP::visit { print "Hello from this node. \n"; } 1;
Now let's change it so that we add a GraphViz node at every op: #!/usr/bin/perl # ourmodule2.pm package B::Tree; use strict; use GraphViz; use B qw(main_root walkoptree_slow); sub compile { return sub { $g = new GraphViz; walkoptree_slow(main_root, "visit"); print $g–>as_dot; }; } sub B::OP::visit { my $self = shift; $g–>add_node({name => $$self, label => $self–>name}); } 1;
850
Inside Perl
This time, we load up the GraphViz module and start a new graph. At each op, we add a node to the graph. The node's name will be the address of its reference, because that way we can easily connect parent and child ops; ${$self–>first}, for instance, will give the address of the child. For the label, we use the name method to tell us what type of op this is. Once we have walked the op tree, we print out the graph; at the moment, this will just be a series of non–connected circles, one for each op. Now we need to think about how the ops are connected together, and that's where the difference between the op structures comes into play. The ops are connected up in four different ways: a LISTOP can have many children, a UNOP (unary operator) has one child, and a BINOP (binary operator) has two children; all other ops have no children. Let's first deal with the easiest case, the op with no children: sub B::OP::visit { my $self = shift; $g–>add_node({name => $$self, label => $self–>name}); }
This is exactly what we had before; now, in the case of an UNOP, the child is pointed to by the first method. This means we can connect an UNOP up by adding an edge between $$unop and ${$unop– >first}, like so: sub B::UNOP::visit { my $self = shift; my $first = $self–>first; $g–>add_node({name => $$self, label => $self–>name}); $g–>add_edge({from => $$self, to => $$first}); }
A BINOP is very similar, except that its other child is pointed to by the last method: sub B::BINOP::visit { my $self = shift; my $first = $self–>first; my $last = $self–>last; $g–>add_node({name => $$self, label => $self–>name}); $g–>add_edge({from => $$self, to => $$first}); $g–>add_edge({from => $$self, to => $$last}); }
The trickiest one is the LISTOP. Here, the first child is pointed to by first and the last child by last. The other children are connected to each other by the sibling method. Hence, if we start at first, we can add each sibling until we get to last: sub B::LISTOP::visit { my $self = shift; $g–>add_node({name => $$self, label => $self–>name}); my $node = $self–>first; while ($$node != ${$self–>last}) { $g–>add_edge({from => $$self, to => $$node}); $node = $node–>sibling; } $g–>add_edge({from => $$self, to => $$node}); # finally add the last }
851
Chapter 20
And that's it! Putting the whole thing together: #!/usr/bin/perl # ourmodule3.pm package B::Tree; use strict; use GraphViz; use B qw(main_root walkoptree_slow); sub compile { return sub { $g = new GraphViz; walkoptree_slow(main_root, "visit"); print $g–>as_dot; }; } sub B::LISTOP::visit { my $self = shift; $g–>add_node({name => $$self, label => $self–>name}); my $node = $self–>first; while ($$node != ${$self–>last}) { $g–>add_edge({from => $$self, to => $$node}); $node = $node–>sibling; } $g–>add_edge({from => $$self, to => $$node}); } sub B::BINOP::visit { my $self = shift; my $first = $self–>first; my $last = $self–>last; $g–>add_node({name => $$self, label => $self–>name}); $g–>add_edge({from => $$self, to => $$first}); $g–>add_edge({from => $$self, to => $$last}); } sub B::UNOP::visit { my $self = shift; my $first = $self–>first; $g–>add_node({name => $$self, label => $self–>name}); $g–>add_edge({from => $$self, to => $$first}); } sub B::OP::visit { my $self = shift; $g–>add_node({name => $$self, label => $self–>name}); } 1;
We can now use this just like B::Graph: > perl -I. -MO=Tree -e '$a= $b+$c' | dot –Tps > tree.ps
852
Inside Perl
leave
enter
sassign
nextstate
add
null
null
null
gvsv
gvsv
gvsv
And in fact, this was the module we used to make the tree diagrams earlier on in this chapter!
Summary
AM FL Y
Actually, there was one complication that we did not explore here. Sometimes, unary operators are not really unary; ops that have been simplified to NULLs during the optimization stage will still have siblings which need to be connected to the graph. The full version of B::Tree is on CPAN and works around this.
In this chapter, we have seen a little bit of how Perl goes about executing a program: the parsing and the execution stages. We have looked at the internal data structures that represent Perl variables, such as SVs, and we have also had a look at the Perl compiler, which transforms the op tree of a program into various different formats: C code, Java assembler, and so on.
TE
For more information on how Perl works internally, look at the perlguts and perlhack documentation files. perlguts is more of a reference on how to use the Perl API as well as how the Perl core works, while perlhack is a more general tutorial on how to get to grips with working on the Perl sources.
853
Team-Fly®
Chapter 20
854
Integrating Perl with Other Programming Languages In this chapter, we shall explore several ways to integrate Perl with other programming languages. The main theme will be, for reasons that will become evident, integrating with C in both directions. Using C from Perl will show how to write C code and call it from a Perl program, or how to build a Perl wrapper to existing C code, using the XS interface (also known as XSUB, for eXternal SUBroutine) to write Perl extensions. We will also take a look at how to dynamically link to compiled libraries (on systems that support dynamic linking), and perform calls to their exported functions directly from Perl, without the need to compile anything. Using Perl from C deals with the opposite topic: how to take advantage of Perl's functionalities in C programs. This technique, usually referred to as 'embedding Perl', consists of linking the Perl library with a C project and, using the provided include files and definition, implementing a Perl interpreter in the application. What we discuss here requires some degree of familiarity with the Perl internals, as seen in Chapter 20. Even if there are a number of well established techniques, there are also a lot of pitfalls, undocumented features, and probably even bugs that exist. Last, but not least, the whole knowledge base is subject to change as Perl evolves. At the end of this chapter, we'll take a quick tour around integration with different languages: from the JPL (Java-Perl Lingo), to developing with the COM framework on Windows machines, to rather esoteric implementations and possibilities lying around in the labyrinthine world of Perl.
Chapter 21
Using C from Perl Putting some C code at the disposal of Perl is something that one would do for a number of reasons, such as gaining access to otherwise unimplemented low-level system calls and libraries, interfacing with other programs or tools that expose C APIs, or simply speeding up a Perl script by rewriting timeconsuming tasks in C. In fact, a large part of Perl's riches comes from the numerous extensions it has already. Modules are provided to interface the GD (to produce graphic files) and Tk (Tcl's GUI toolkit) libraries, to name a few. Even on specific platforms, there is usually a whole set of modules that puts Perl in a position to interact with every single part of that operating system (for example, on the Windows platform, we have modules to manage the registry, services, event logs, and much more). All this is made possible by what is called the XS interface, a quite complex set of Perl scripts and modules that work together to ease the task of linking C code into Perl. Many major modules, like the ones we just mentioned, make use of this interface to have C code that can be directly seen by Perl as a set of subroutines. In this chapter, we are going to introduce the use of this interface in its most common form. We will see how to write XS code, and how it interfaces with Perl. For a more complete reference, and some advanced topics not covered here, we can refer to the following manual pages from the standard Perl documentation: Manual Page
Description
perlxs
The reference manual for the XS format.
perlxstut
An example driven, lightweight tutorial.
perlguts
The official Perl internal's user's guide: explains almost everything anyone needs to know to use Perl and C together.
perlapi
The official Perl internal's API reference (available from version 5.6.0; on earlier versions this was part of perlguts).
perlcall
A lot of information about calling Perl from C, dealing with stack manipulation, creating callback code and other advanced techniques.
We also suggest looking at the Extension-Building Cookbook by Dean Roehrich, available on CPAN (search for distributions matching Cookbook on http://search.cpan.org), as well as the PerlGuts Illustrated paper by Gisle Aas, available at http://gisle.aas.no/perl/illguts/. Finally, probably the best and most complete reference material is the source code written by others. The POSIX extension, for example, available in the ext/POSIX directory of the Perl source tree, is a very good example of both basic and advanced techniques. We shouldn't be afraid of learning from existing extensions, as that is one of the benefits of open source after all. We just need to remember to be good citizens, and give the appropriate credits when reusing someone else's work.
856
Integrating Perl with Other Programming Languages
An Overview of XS The 'core' of the XS interface is a Perl script called xsubpp, which is part of the standard ExtUtils distribution. This script translates an XS file, which is basically a C file with some special markers here and there, to a regular C file that is compiled and linked as usual. The purpose of the XS format is to hide most of the required Perl C API calls from the programmer, so that he or she can concentrate on writing the C code. xsubpp will do its best to automate the interactions between C and Perl, simplifying all the routine jobs of declaring XSUBs and converting data back and forth; as we'll see later. The process of building a Perl extension is handled by the ExtUtils::MakeMaker module (and the locally installed C compiler and make program, of course) that was introduced in Chapter 10. The resulting binary will usually be a shared object known as DSO (Dynamic Shared Object), on most UNIX platforms. On 'Windows' it is known as Dynamic Link Library (DLL). Once we have this library, with all its exported symbols, the job of linking to it and performing calls is done at run-time by the DynaLoader module, which is a special extension often built together with the Perl executable and statically linked to it. There is more detail on this module in Chapter 20. If our operating system does not support dynamic linking, we can still use the XS interface. Only the build process will differ, because the extension has to be statically linked into the Perl executable (just as DynaLoader is). This means that, by building an extension, a completely new Perl executable will be built. Fortunately, the majority of modern operating systems do support dynamic linking. The only other bit of glue that needs to be in the process is a Perl module (a regular PM file), which simply inherits from DynaLoader, and is instructed to call the bootstrap function responsible for doing the run-time link to the XS shared object.
'h2xs', or Where to Start The XS interface was originally developed to provide transparent access to C libraries, which usually come with header files defining the API (function calls, constants, macros, and whatever else). There is in fact, a dedicated Perl script, called h2xs, which handles this task. As the name implies, it is a C header file (.h) to XS translator. It is automatically installed in the bin directory of the Perl's installation, so chances are that if the Perl executable is in the path, it can be called directly in the shell. This command creates stubs in the makefile to handle C integration. Beyond its original intent, h2xs can, and definitely should, be used as a starting point to create an extension, even if there is no header to translate and we're writing code from scratch. The script has a number of options, which we'll see below, that makes it very flexible. In particular, it can be used to build a skeleton module, with or even without, the XS part (see Chapter 10 for an example of using it on plain Perl modules). This script, h2xs, will take over all the routine jobs of preparing a module, generating the Makefile.PL, the PM file, a MANIFEST, and everything that should be in every well packaged distribution.
857
Chapter 21
We can suppress the C aspects of the Makefile by adding –X (note the case sensitivity) option. The–n flag specifies the name for the extension. For example: > h2xs -X -n Installable::Module In addition to these flags, h2xs supports a large number of options to control various aspects of the generated files. For a pure Perl module, the most important of these are: -h
Print out some usage information about the h2xs utility.
-A
Do not include the AutoLoader module, (see Chapter 10). By default, h2xs generates a module source file including the AutoLoader. If present, the install makefile target automatically calls the AutoSplit module to perform a split on installation. If we do not want to use the AutoLoader, this option will suppress it.
-C
Do not create a Changes file. Instead, h2xs will add a HISTORY section to the pod documentation at the bottom of the module source file.
-P
Suppress generation of the pod documentation in the module source file.
-v
Specify a version number for the module other than the default of 0.01. For example, -v 0.20.
-X
Suppress the XS part of the makefile – pure Perl module distributions should use this. Modules that make use of C code should not.
By default, the created module file incorporates code to use both the Exporter and AutoLoader. As an example, to suppress the lines involving the AutoLoader, we can add the -A option as well: > h2xs -XA -n Installable::Module Similarly, to create a module with a version of 0.20 and no Changes file, but which does use the AutoLoader: > h2xs -C -v 0.20 -n Installable::Module
Converting C Header Files To start, let us consider the 'canonical' situation. Suppose we have a C header file, called xs_test.h, with some constants, and a pair of function prototypes, as shown below: #define FIRST 1 #define SECOND 2 #define THIRD 3 int first_function(int number, void *databuffer); void second_function(char *name, double a, double b);
In the simplest case, we can call the h2xs script with the name of the header file as the only argument: > h2xs xs_test.h
858
Integrating Perl with Other Programming Languages
This will create a directory called Xs_test (the name is taken from the header file), which contains everything that is needed to build the extension: the PM and XS files (Xs_test.pm and Xs_test.xs), a Makefile.PL, a simple test script, a nicely initialized Changes log, and finally the MANIFEST file: Writing Writing Writing Writing Writing Writing
Xs_test/Xs_test.pm Xs_test/Xs_test.xs Xs_test/Makefile.PL Xs_test/test.pl Xs_test/Changes Xs_test/MANIFEST
Once built with the following command, our extension will simply export all the constants found in the header file. > perl Makefile.Pl > make > make test > su Password: # make install This means that we can already write a script with it: #!/usr/bin/perl # xs_test.pl use warnings; use strict; use Xs_test; my @names = qw(none foo bar baz); print $names[SECOND];
This will correctly print out bar. Not a big thing yet, but this procedure, applied to more meaningful headers, will provide a good starting point in implementing an existing API. By default, h2xs can only search the header file for constants (the #define directives above) and make them accessible from the generated extension. The code to access the defined functions must be provided by the programmer. However, if we have installed the C::Scan module, available from CPAN, we can add the -x switch (note the case sensitivity, using a capital X will suppress the XS part of the Makefile) to h2xs and it will recognize function definitions too, and put them in the generated XS file. For example, by running > h2xs -Ox xs_test.h (we also add the -O switch to allow h2xs to overwrite the files we previously generated), some stub code will be added to Xs_test.xs. With this code, we are already able to call Xs_test::first_function from a Perl script, and access the C function defined by this name; as long as we have included the object file that contains the declared function. We will see later how to add external libraries to the build process.
859
Chapter 21
It should be said that this procedure, by itself, is unlikely to be enough for real-world applications: the job done by h2xs and C::Scan can only generate useful interface code for very simple cases. Most of the time, if not always, some coding will be required to successfully implement a C API as a Perl extension. But the value of these tools should not be underestimated: it's much easier to add the missing pieces on a prepared template, than to start with a blank file and write everything by hand.
Starting from Scratch The header file name supplied to h2xs is not mandatory: the tool can be used without any existing file to convert. In this case, we must add the –n flag we met earlier, which specifies the name for the extension. Let's see what happens when we call the script like this: > h2xs -n XS::Test This time, the directory structure XS/Test is created, and the same set of files seen above is generated: Writing Writing Writing Writing Writing Writing
XS/Test/Test.pm XS/Test/Test.xs XS/Test/Makefile.PL XS/Test/test.pl XS/Test/Changes XS/Test/MANIFEST
The only difference is that our XS::Test extension provides nothing of its own. Still, the generated files are convenient templates with which to start writing our own code, and this is what we will use for the rest of the section as our workbench.
The XS File Now that we have defined the starting point, lte's first see what's in our XS file. We start with what was generated by our last example, a truly bare XS file (Test.xs), and explore its content bit by bit: #include "EXTERN.h" #include "perl.h" #include "XSUB.h"
The first three lines are standard to every XS file: the included headers contain all the definitions needed to use Perl's internals and write XSUBs Additional #include directives and definitions can be put here, but these three should always be left untouched. Next, we have two regular C functions. static int not_here(s) char *s; { croak("%s not implemented on this architecture", s); return -1; }
860
Integrating Perl with Other Programming Languages
static double constant (name, arg) char *name; int arg; { errno = 0; switch (*name) { } errno = EINVAL; return 0; not_there: errno = ENOENT; return 0; }
These are put here by default, and work together with the AUTOLOAD function placed in the PM file to give convenient access to C constants. Given the constant name as a string, the constant function returns its corresponding numerical value (or calls not_here to complain with a not implemented error, in case the constant is not defined). From this point on, we have what is properly called the XS code: MODULE = XS::Test
PACKAGE = XS::Test
double constant(name, arg) char * name int arg
In fact, everything before the MODULE keyword is just plain C, and it will be passed untouched to the compiler. The xsubpp work only begins here, and it concerns the functions defined in this section.
Declaring the 'MODULE' and 'PACKAGE' The MODULE and PACKAGE directives tell XS, respectively, the name of the module for which the resulting loadable object is being generated, and the namespace, or Perl package, where the functions will be defined (in this case is XS::Test for both). We should never need to change MODULE, but we can, at some point, change to another PACKAGE. This could happen, for example, when we're writing a module that implements different packages or subpackages: # put some functions in XS::Test MODULE = XS::Test PACKAGE = XS::Test # other functions in XS::Test::foo MODULE = XS::Test PACKAGE = XS::Test::foo # in XS::Test::bar MODULE = XS::Test
PACKAGE = XS::Test::bar
Note that PACKAGE can't be changed without declaring (or repeating) MODULE right before it. The first MODULE occurrence, as we've seen, marks the start of the XS section. After this, the file contains function definitions.
861
Chapter 21
XS Functions In its simplest form, an XS function is just a wrapper for the corresponding C function – that's exactly what the last four lines of our example do: double constant(name, arg) char * name int arg
This snippet defines the XSUB constant: to Perl, this means sub Xs::Test::constant. That sub, in turn, calls the C function named constant (defined before in the same file), passing any input and output parameters, to and from Perl, as required. The XS definition is very similar to a C function definition, except for the missing semicolons after the parameters type declaration. Note, however, that the return value must be on the line before the function name, so it's not correct to squeeze the lines above to this: /* ERROR: this is wrong */ double constant(name, arg) char * name int arg
Types must also be defined after the function name, as in good old pre ANSI C, so this is incorrect: /* ERROR: this is wrong too */ double constant(char *name, int arg)
A more elaborate XS function, however, will usually do more than map one-to-one with a C function. This can happen for a number of reasons, including handling complex C data type like structures, providing more 'perlish' syntactic sugar, or grouping a series of calls as one. Last, but not least, if we're writing C code from scratch, we need to provide some code for our functions.
Putting the 'CODE' in In fact, an XS function can contain code, just like a C function can have a body: but this time, the syntax is more particular. This is the most common form of an XS function: int first_function(number, databuffer) int number void * databuffer CODE: /* place C code here */ OUTPUT: RETVAL
The CODE: identifier marks the start of the function, while OUTPUT: marks its end and returns the indicated value to Perl (RETVAL has a special meaning as we will see shortly).
862
Integrating Perl with Other Programming Languages
Take care; xsubpp wants functions to be separated by a blank line, so if there is no output to give (for example, if the function is declared to return void), the blank line marks the end of the function. void first_function() CODE: /* place C code here */ void second_function() CODE: /* place C code here */
int maximum(a, b) int a int b CODE: if(a > b) { RETVAL = a; } else { RETVAL = b; } OUTPUT: RETVAL
AM FL Y
There are no brackets to mark the end of the function body, so we have to pay attention to blank lines, which is something that may be confusing at first. Let's see a very simple C function, expressed in XS, which returns the greater one of the two numbers passed:
As mentione earlier, there is a special variable lying around, which is RETVAL. This variable is automatically defined by xsubpp from the function definition's return type. In this case, RETVAL is declared to be an int. The last two lines are there to return the calculated variable to Perl.
TE
Now that we have added these lines to our Test.xs file, we simply need to rebuild the extension, and our maximum function is already available as a Perl sub: use Xs::Test; print Xs::Test::maximum(10, 20);
This simple script will print out 20, as we would expect.
Complex Output, 'PPCODE' and the Stack There are times when we want more than one return value (this is Perl, after all, and we shouldn't suffer from the limitations of C). To achieve this result, we should take the following steps: ❑
Use void for the return value. This way, XS will not provide an automatic RETVAL variable.
❑
Use the PPCODE keyword instead of CODE, to let XS know that we're going to handle return values on our own.
863
Team-Fly®
Chapter 21
❑
Allocate the needed number of return values on the stack by calling: EXTEND(SP, number);
❑
For each value, use the XST_m* set of macros to actually put the value onto the stack. The macros are XST_mIV, XST_mNV and XST_mPV for the Perl integer, double and string datatypes respectively. The first argument to the macros is the stack position where we want our value to be placed (0 is the first position), and the second argument is the value itself.
❑
Finally, return the desired number of values by ending the function with: XSRETURN(number);
There are also different XSRETURN macros to return special values (eg. XSRETURN_UNDEF, XSRETURN_EMPTY, XSRETURN_YES and others). In the next example, we will make a little example function using this technique: our 'zeroes' function just returns an array composed by the given number of zeroes (equivalent to Perl's (0) x number). We do this by using a variable to loop from 0 to the given number, and for each value of the loop variable, we place the IV 0 at the corresponding stack position. Here we introduce another keyword, PREINIT, which should be used whenever the function needs to declare local C variables. This is necessary because xsubpp, as we'll see later, actually uses function calls for the XS arguments. If we place our definition of i after the PPCODE keyword, it will result in it being placed after some code has already been evaluated, which will upset the compiler. void zeroes(number) int number PREINIT: int i; PPCODE: EXTEND(SP, number); for(i=0; i
We can go even further, and inspect the context in which the function was called in Perl, by using the GIMME function (same as Perl's built-in wantarray). Based on this, we may decide to return different results. GIMME returns one of the following values: G_ARRAY when the function is called in list context, and G_SCALAR when it is called in scalar context. Let's see also how to cope with memory allocation in XS. Perl defines its own API for this, so we have safemalloc and safefree instead of the C's standard malloc and free, as well as Zero instead of memzero. There are additional substitutes like Copy for memcpy, Move for memmove, and so on. void zeroes(number) int number PREINIT: char * string; int i;
864
Integrating Perl with Other Programming Languages
PPCODE: if(number <= 0) {XSRETURN_UNDEF;} if(GIMME == G_ARRAY) { EXTEND(SP, number); for(i=0; i
This version returns an array of zeroes when called in list context, and a string made up of zeroes when called in scalar context. @z = XS::Test::zeroes(3); $z = XS::Test::zeroes(3);
# returns (0, 0, 0) # returns "000"
As a side note, it should always be kept in mind that we're working at a low level, so we don't have nice Perl error messages when something goes wrong, which is why we added the line: if(number <= 0) {XSRETURN_UNDEF;}
This is to avoid allocating a huge amount of memory when the function is called with a negative number, which would make Perl blow up (probably with an Out of memory! error).
Complex Input, Default Values and Variable Input Lists We have ways to deal with variable input too: again, the power of dealing with Perl's @_ input to a subroutine should not be left without being addressed. The simplest case is having optional arguments, for which we assume a default value when one is not supplied. To provide default values, we simply assign them to our arguments on the function prototype line. For example, this function behaves like Perl's built-in rand, which returns a random number between 0 and the given number, or between 0 and 1 when no arguments are given: double xrand(max=1) int max CODE: RETVAL = (double) rand() * max / RAND_MAX; OUTPUT: RETVAL
865
Chapter 21
There can be more than one argument for which a default value is defined, but they should always be at the end of the argument list. For example, the following forms are wrong: void my_function(arg1=NULL, arg2) char * arg1 char * arg2 void my_function(arg1, arg2=NULL, arg3) char * arg1 char * arg2 char * arg3
While the following are correct: void my_function(arg1, arg2=NULL) char * arg1 char * arg2 void my_function(arg1, arg2=NULL, arg3=NULL) char * arg1 char * arg2 char * arg3
A more complex case occurs when the number of arguments is not known in advance. We can use the special identifier '...' (also known as an ellipsis). When it appears in the argument list of the function, any number of arguments can be passed to it. To actually work with the supplied arguments, we can follow this procedure: ❑
Use items to inspect the number of values in the input stack (items is an integer variable automatically defined by xsubpp).
❑
Access a specific value with the ST(n) macro, which returns the element n on the stack.
❑
Since ST returns a pointer to an SV, use the Sv* macros to convert the value to the corresponding C type (as before, we have SvIV, SvNV and SvPV for integers, doubles and strings respectively).
This simple function shows this at work. It accepts a variable list of values (a Perl array, for example) and returns the number of positive integers found in the list: int find_positive_numbers(...) PREINIT: int i; CODE: RETVAL = 0; for(i=0; i 0) { RETVAL++; } } OUTPUT: RETVAL
866
Integrating Perl with Other Programming Languages
The 'TYPEMAP' So far, we've seen examples with rather plain variables involved, and Perl happily understood C variable types and vice versa. This is true for the most common basic datatypes, but not for the many cases when we will need to define our own mappings. This is done by a special file called typemap, which xsubpp will use when converting from XS to C. In reality, even the work on simple datatypes is done by the default typemap file, which resides in the lib/ExtUtils directory and is always used by xsubpp. A typemap consists of three parts: ❑
The enumeration of the C types we use (sometimes introduced by an optional TYPEMAP identifier).
❑
An INPUT section, which is used when passing Perl variables to C.
❑
An OUTPUT section, which is used when returning C variables to Perl.
The example below shows the general syntax. We use c_type here as a placeholder for the C datatype, and T_TYPE as our self-defined tag for the datatype: c_type
T_TYPE
INPUT T_TYPE /* code to convert from Perl to C */ OUTPUT T_TYPE /* code to convert from C to Perl */
As an example implementation, let's see how the default typemap deals with integers: int T_IV ... INPUT T_IV $var = ($type)SvIV($arg) ... OUTPUT T_IV sv_setiv($arg, (IV)$var);
The process involves using the SvIV macro to convert from a Perl value (an SV) to a C int, and the sv_setiv function to convert the other way around. There are three special identifiers (which look like Perl variables, but are just patterns for xsubpp to substitute the actual C expressions) that can be used in the typemap code fragments: $arg
The XS argument, which gets translated into the ST(n) macro.
$var
The corresponding C variable.
$type
The C type, as defined in the XS file.
867
Chapter 21
To see how the mechanism really works, we refer back to the example maximum function from the section entitled 'Putting the 'CODE' in'. The typemap at work produced the following results, which we can see by peeking into the generated Test.c file: our declarations of int a and int b have been translated to: int a = (int)SvIV(ST(0)); int b = (int)SvIV(ST(1));
The return value for the function has become: sv_setiv(ST(0), (IV)RETVAL);
A custom typemap must be built if we're using C types that are not included in the default one. For example, if our library defines a type, say, little_number, to represent an integer, it must be in the typemap if we want to use input/output values of this type. If we just have a different name for an already defined type, we can use the tags defined in the default typemap: little_number
T_IV
This is enough to treat a little_number as if it was an int. A typemap definition can be much more complicated. For example, suppose we have this C structure: typedef struct { int red; int green; int blue; } my_rgb;
we may decide that we want a my_rgb variable passed to our extension as an array reference, holding values for the red, green, and blue components (as in [128, 128, 128]). We implement this in the typemap: my_rgb
T_RGB
INPUT T_RGB if(SvROK($arg) && SvTYPE(SvRV($arg)) == SVt_PVAV) { $var.red = (av_fetch((AV*)SvRV($arg), 0, 0) == NULL) ? 0 : SvIV(*(av_fetch((AV*)SvRV($arg), 0, 0))); $var.green = (av_fetch((AV*)SvRV($arg), 1, 0) == NULL) ? 0 : SvIV(*(av_fetch((AV*)SvRV($arg), 1, 0))); $var.blue = (av_fetch((AV*)SvRV($arg), 2, 0) == NULL) ? 0 : SvIV(*(av_fetch((AV*)SvRV($arg), 2, 0))); } else croak(\"$var is not a reference to an array!\");
868
Integrating Perl with Other Programming Languages
The above code checks that the argument is a reference, and a reference to an array. If it is, it tries to fetch three values from the referenced array, and assigns them to the red, green, and blue components of the C structure. To be safe in case of failures, we first check the result of av_fetch and, in case it is NULL, we use zero for the corresponding component. Now we can write an XS function that accepts my_rgb as input: bool is_gray(color) my_rgb color CODE: RETVAL = ( color.red == color.green && color.green == color.blue ) ? 1 : 0; OUTPUT: RETVAL
and we can use this from Perl as: XS::Test::is_gray([128, 128, 128]);
Any other kind of value passed to our function will cause a fatal error (with the message color is not a reference to an array!). Dealing with the corresponding OUTPUT part requires a different approach. For example, we can use a PPCODE function, as seen above, to return an array of three values for the red, green, and blue components. A more functional implementation of a C structure can be achieved using the T_PTROBJ tag, defined in the standard typemap. It requires, however, some additional work, which is outlined below: ❑
We add to the typemap a line for a pointer to the structure, specifying the T_PTROBJ tag for it:
my_rgb ❑
*
T_PTROBJ
When returning a variable, we need to be sure to allocate memory for it. This simple function initializes and returns a my_rgb structure:
my_rgb * new_color(r, g, b) int r int g int b CODE: RETVAL = (my_rgb *) safemalloc(sizeof(my_rgb)); RETVAL->red = r; RETVAL->green = g; RETVAL->blue = b; OUTPUT: RETVAL
869
Chapter 21
❑
Now, we add another PACKAGE to the XS file, named after the structure, with the addition of a final Ptr:
MODULE = XS::Test ❑
PACKAGE = my_rgbPtr
After this line, we add functions to access the structure members. Here is the one for red, and we can do the same for the green and blue members:
int red(color) my_rgb * color CODE: RETVAL = color->red; OUTPUT: RETVAL ❑
We should also add a DESTROY function, to free the allocated memory when the variable we have returned to Perl goes out of scope:
void DESTROY(color) my_rgb * color CODE: safefree(color);
Now that we've done all this, we can use the new_color function, which returns a Perl object blessed into the my_rgbPtr class. To access the underlying C structure, we simply call the methods we've defined on it: my $Black = XS::Test::new_color(0, 0, 0); print "red component of Black is: ", $Black->red, "\n";
Et voila, $Black is now a real C structure implemented in Perl.
The 'Makefile' The extension we've played with until now, as we have already mentioned, can be built using the standard procedure: > perl Makefile.PL > make > make test > su Password: # make install The starting point of the build process is the Makefile.PL, which uses the ExtUtils::MakeMaker module to build a Makefile suitable for make. In most cases, the default file, built for us by h2xs, will do its job. But there are times when something needs to be added to the build process. The most common case is when the extension is going to use external C libraries, or particular switches need to be passed to the C compiler. If we look at the default Makefile.PL, we will see that it already has pointers for us to fill the missing parts:
870
Integrating Perl with Other Programming Languages
use ExtUtils::MakeMaker; # see lib/ExtUtils/MakeMaker.pm for details of how to influence # the contents of the Makefile that is written WriteMakefile( 'NAME' => 'XS::Test', 'VERSION_FROM' => 'Test.pm', # finds $VERSION 'LIBS' => [''], # e.g., '-lm' 'DEFINE' => '', # e.g., '-DHAVE_SOMETHING' 'INC' => '', # e.g., '-I/usr/include/other' );
The LIBS, DEFINE and INC parameters to WriteMakefile should satisfy most of our needs; they're even well documented on the right. There are additional parameters that may sometimes be useful, like CCFLAGS to pass additional switches to the C compiler, or XSOPT to alter the behaviour of xsubpp. The ExtUtils::MakeMaker documentation lists all the options recognized by WriteMakefile. We should note that, normally, most of these should be left to their default, since changing them may result in our extension being unusable. If we still require more control over the generated Makefile, we can replace (or 'overload') some, or all, of the code that is behind WriteMakefile. To do that, we simply define a sub in a package called MY with the same name of a MakeMaker sub. The output of our sub will override the one produced by the module's sub. Usually, however, we just need to modify the output of the corresponding MakeMaker function. So, we first call the function from the SUPER package (which is ExtUtils::MakeMaker), then modify its result and return it. This example, which can be added to the Makefile.PL, overrides the constants sub, to move everything that resides in /usr to /tmp/usr: package MY; sub constants { my $inherited = shift->SUPER::constants(@_); $inherited =~ s|/usr/|/tmp/usr/|g; return $inherited; }
This kind of tweak should be used with caution, and only when it is really needed. If we are having troubles producing a working Makefile, we ought to think about patching MakeMaker.pm, and sharing our work with the perl5-porters mailing list for inclusion in the Perl standard distribution.
Dynamic Linking If we already have a dynamic library (a shared object on UNIX platforms or a DLL on the Windows platform), we can use a different approach to access the code in there; one which does not require writing extension code and messing with compilers and linkers. Basically, we make use of the same mechanism Perl makes use of with XS extensions. The library is loaded at runtime, references to its exported symbols are created, and then the library code is executed, as if it was a Perl sub, through this reference.
871
Chapter 21
From the programmer's point of view, this is an all Perl solution and only requires some script coding. Under the hood, however, there's some C code, and really clever code indeed, because we don't know until runtime what the C prototype of the function we're going to call is. So, deep magic (and usually resorting to assembler code) is involved to provide a generic, all-purpose link to anonymous C functions. There are currently three modules that have been written to address the issue. We'll give them a quick look in descending order of completeness and portability.
Using the 'FFI' Module The FFI module by, Paul Moore, uses the ffcall library, developed by Bruno Haible for the LISP language; a very powerful set of tools to interface foreign functions (functions defined in external libraries). The internals of the library are, obviously, quite complex. It uses a lot of assembler code to deal with the way data is passed to and from functions at a very, very, low level. The Perl module, fortunately, hides all this from our eyes and allows us to link with a shared library in just a few lines of code. We only need to know where (in which library) a function is located, its name, and its C prototype. This prototype then needs to be converted to a 'signature', a short string defining the function's input and output parameter types. Next we will make a little example program that links to the standard C math library and calls the cos function. First of all, we use the FFI::Library module to link to a shared library. Once we have obtained an FFI::Library object, we can import functions with the function method. We add some conditions to make our script work on both UNIX and Windows platforms. Firstly, the name of the library, which is different on both platforms (libm.so on UNIX and MSVCRT.DLL on Windows). Secondly, the signature for the function, because we have to use different calling conventions on UNIX and Windows. We'll explain more about signatures and calling conventions below: use FFI; use FFI::Library; # dynamically link to the library $libm = new FFI::Library( $^O eq 'MSWin32' ? 'msvcrt' : 'm' ); # import the cos function $cos = $libm->function('cos', $^O eq 'MSWin32' ? 'sdd' : 'cdd' ); # $cos is a CODE reference $result = $cos->(2.0); print "cos(2.0) = $result\n";
872
Integrating Perl with Other Programming Languages
The name of the library can be expressed either in an abbreviated form or with a full pathname. In the script above, we have used m for UNIX machines, which is equivalent to the -lm switch to the C compiler: it searches the system library path (for example, /lib, /usr/lib) and adds lib and .so before and after the name, respectively. The library, in fact, resides in /lib/libm.so, so just m is enough. If a library resides in a different location, or uses a different naming scheme, its full path and filename can be specified. On Windows too, the DLL extension is automatically added, and the library is searched in the default system path. The DLL we specified might reside, for example, in C:\WINNT\SYSTEM32\MSVCRT.DLL, so saying msvcrt is enough. Here too, if a DLL resides in a directory not searched by the system, the full path can be specified. Signatures for the functions are made up of three distinct parts: ❑
The first letter is the calling convention: this is usually c for UNIX (for cdecl) and s for Windows (for stdcall). We should be aware that the s convention requires parameters to be passed in reverse order, so we need to adjust the signature as appropriate.
❑
The second letter is the return value of the function.
❑
From the third letter on, there are the arguments that the function expects (one letter for each argument).
The C prototype for the cos function is double cos (double x), so we have a return value of d (double), and one single parameter d (double too). This makes up our signature as cdd (or sdd for Windows).
AM FL Y
Here is another example with this C function: char *strncat(char *dest, const char *src, size_t n);
On UNIX machines, this needs to be translated, for use by FFI, into: $strncat = $libc->function('strncat', 'cpppI');
TE
The cpppI signature tells us that the function uses c calling convention, returns a pointer (p), and accepts two pointers and an unsigned integer (I). On Windows, the correct signature is:
$strncat = $libc->function('strncat', 'spIpp');
The calling convention is s, the return value is a pointer as before, but the parameters are now in reverse order. The next example shows how to use a callback using FFI, to enumerate the window handles on a Windows machine. A callback is defined using the FFI::callback function, which takes a signature and a code reference, containing the body for the callback. When passing the callback to a C function, we need to get its address using the addr method. use FFI; use FFI::Library; my $user32 = new FFI::Library('user32');
873
Team-Fly®
Chapter 21
my $enumwindows = $user32->function( 'EnumWindows', 'siLp' # note reverse order here! ); my @windows; my $cb = FFI::callback('siLL', sub { my($hwnd, $lparam) = @_; push(@windows, $hwnd); return 1; } ); $enumwindows->($cb->addr, 100); print join(', ', @windows), "\n";
For more information, and additional tips, tricks, and type definitions, refer to the pod documentation included in the module. One final note for Windows users. Building the module may require some little hacks to the included ffcall library, but Paul Moore also provides a binary distribution for Perl 5.6.0, which can be installed using ppm: > ppm install --location=http://www.perl.com/CPAN/authors/id/P/PM/PMOORE FFI
Using the 'C::DynaLib' Module The C::DynaLib module, written by John Tobey, is very similar to FFI. The major differences are that it does not use the ffcall library, and is somewhat more limited – a small number of callbacks can be defined in one go, and it has UNIX support only (it may be possible to build it on a Windows machine with some hacking, but this had not been done at the time of writing). The usage differs only slightly from what we've seen in the previous paragraph. This is a rewrite of call_cos.pl using C::DynaLib: use C::DynaLib; # dynamically link to the library $libm = new C::DynaLib('m'); # import the cos function $cos = $libm->DeclareSub('cos', 'd', 'd'); # $cos is a CODE reference $result = $cos->(2.0); print "cos(2.0) = $result\n";
The syntax is different, but the principle is just the same. For more information consult the module's documentation.
874
Integrating Perl with Other Programming Languages
Using the 'Win32::API' module The Win32::API module is written by Aldo Calpini. As its name implies, this module works on the Windows platform only. Furthermore, it uses assembler calls that are specific to the Microsoft Visual C++ compiler, so building it with a different environment may be difficult. Also, this module does not support callbacks, so only simple function calls can be made. Here is another rewrite of our call_cos.pl sample using Win32::API: use Win32::API; # import the cos function $cos = new Win32::API('msvcrt', 'cos', 'D', 'D'); # $cos is an object, use the Call() method on it $result = $cos->Call(2.0); print "cos(2.0) = $result\n";
Again, the principles are just the same, and other information can be found by looking at the module's documentation.
Using Perl from C Using Perl from C is a technique usually referred to as embedding Perl. This means, taking the Perl kernel and embedding it into a C program. There are several different forms of embedding, depending on which functionality of Perl we want to include in our program. We can use, for example, Perl's arrays and hashes as C structures, or we can write Perl code snippets to create them and modify them. We can use external script files that our application will interpret, or we can store the Perl code we want to be evaluated, as strings inside our C program. In this section, we will see some basic examples of this technique at work. For a more in-depth coverage of the details involved in programming with Perl internals, refer to the same documents that werementioned in the first section. In most regards, embedding and extending (writing XS extensions) are two sides of the same coin. Knowledge of the Perl's internal structure and its C APIs is required in both tasks. More specific documentation can be found in the perlembed manual page, and by downloading the ExtUtils::Embed module from CPAN. The package contains additional information and examples, which are not included in the standard ExtUtils distribution.
First Steps Writing a C program which makes use of Perl requires, in addition to writing proper C code, a particular build procedure to correctly generate an executable. The compiler must get the same defines, includes, and library paths that were used to build the Perl binary. We must also link the object with the Perl library, which contains the code for the Perl internals we're going to use. This means, of course, that if we're using this technique, we absolutely need to have a Perl binary that was built on the local machine. It needs to find the required libraries and include files where Perl searches for them. In other words, the configuration saved in Config.pm (which was explained in Chapter 20) needs to be the actual configuration of the Perl build in use.
875
Chapter 21
If our Perl was built on a different system and installed as a binary, we would probably be unable to write C programs with embedded Perl. So, the first requirement is to download the Perl source tree, and build it on the system we're going to program. Even when we have successfully built our own Perl, tracing back all the information our compiler needs would not be easy without a little help. Fortunately, we have it – there is a specific module, called ExtUtils::Embed, which deals with all the gory details of finding the correct parameters to satisfy the compiler and linker.
Building a Program (The Hard Way) Just to better understand what is going on, before taking the easier route, we will take the long, hard one and try to deduce all the needed parameters from Perl itself, using the -V switch; The values shown here are just examples, they are dependent on a system's configuration, (such as OS and C compiler) and may, of course be, different. First of all, let's find our compiler and linker: > perl -V:cc cc='gcc'; > perl -V:ld ld='gcc'; This tells us that we're using the Gnu C compiler, which is what we will use to build our Perl-embedded programs. Now, let's find the path to the include files and libraries: > perl -V:archlib archlib='/usr/local/perl-5.6/lib/5.6.0/i686-linux'; We'll need to add /CORE to this path, since the standard installation of Perl puts the files there. Now let's find the name of the Perl library which we'll link to, as well as any other library originally used to build Perl: > perl -V:libperl libperl='libperl.a'; > perl -V:libs libs='-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt'; Finally, we need the flags that we must pass to the compiler and the linker: > perl -V:ccflags ccflags='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'; > perl -V:ldflags ldflags=' -L/usr/local/lib'; We put all this information together to obtain a command line, which we can use to build our program. Only the name of the source file (we have just called this program.c in the following command) needs to be added to this command, and it will correctly compile with embedded Perl:
876
Integrating Perl with Other Programming Languages
> gcc program.c -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/local/perl-5.6/lib/5.6.0/i686-linux/CORE /usr/local/perl-5.6/lib/5.6.0/i686-linux/CORE/libperl.a -L/usr/local/lib -lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt
Building a Program (The Easy Way) What we've just seen is quite complex and somewhat boring. We don't want to remember all the four line stuff when compiling a program, which is why we use ExtUtils::Embed to reduce everything to this single command: > perl -MExtUtils::Embed -e ccopts -e ldopts With the ccopts and ldopts functions, the module prints out more or less the same command line we obtained before (with the addition of the DynaLoader library, which is needed to use XS extensions from C): > -rdynamic -L/usr/local/lib /usr/local/perl-5.6/lib/5.6.0/i686-linux/auto/DynaLoader/DynaLoader.a -L/usr/local/perl-5.6/lib/5.6.0/i686-linux/CORE -lperl -lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/local/perl-5.6/lib/5.6.0/i686-linux/CORE This one liner is primarily intended to be used (in the UNIX shell) inside backticks, to have the output appended to the compilation command line like this: > gcc -o program program.c 'perl -MExtUtils::Embed -e ccopts -e ldopts' This is the normal procedure used to compile a C program with embedded Perl. But, if even this command line seems too long, or the backticks trick does not work (for example, on Windows machines, where the standard shell has some severe limitations), there is another, possibly simpler, approach provided by ExtUtils::Embed. The distribution from CPAN comes with a script called genmake, which produces a Makefile that can be used to build the executable. The procedure is: > perl genmake program.c Writing Makefile for program > make genmake uses the ExtUtils::MakeMaker module, tweaking it from its original intent to produce Makefiles for Perl modules into producing one for a C program.
Implementing the Perl Interpreter Now that we've prepared the ground, let's see what is needed to actually write the C program. To start, some header files need to be included: #include <EXTERN.h> #include
877
Chapter 21
We have already seen these two files included in the first section. We can skip the third, XSUB.h, as long as we don't plan to use XS extensions from our code (we'll cover this later). Then, we need to implement an interpreter, which is an instance of Perl that is kept in a Perl Interpreter structure. We initialize it with the functions: PerlInterpreter *my_perl; my_perl = perl_alloc(); perl_construct(my_perl);
Once the my_perl variable has been allocated and constructed, it can be fed to other functions which actually do the Perl job: perl_parse(my_perl, NULL, argc, argv, env); perl_run(my_perl);
This really explains the double nature of Perl as a compiler and an interpreter. First, the code is parsed and internally compiled by perl_parse. Then, it is run by perl_run. The parameters to perl_parse are, in order: ❑
The pointer to the Perl interpreter.
❑
A pointer to initialized XSUBs. (See the 'Using Modules' section later in this chapter for this; we'll use NULL for now.)
❑
argc, argv, and env like the ones C's main function gets.
If the env parameter is NULL, the current environment is passed to the interpreter. After our run, we clean up with the following two functions: perl_destruct(my_perl); perl_free(my_perl);
One thing to note about the perl_parse function, is that it is, by itself, an implementation of the Perl executable that we're familiar with. In fact, it accepts all the switches and options Perl does, in argc and argv. By simply putting it in a main function and passing the relevant arguments, exactly the same functionalities of the Perl binary are implemented. Of course, we can also provide our own arguments to it. This little program, called pversion.c, prints out the current version of Perl (as perl –v does): #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; /* we pass a dummy argv[0] */ char *switches[] = {"", "-v"}; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 2, switches, (char**)NULL); perl_run(my_perl); perl_destruct(my_perl); perl_free(my_perl); }
878
Integrating Perl with Other Programming Languages
We can compile our program with one of the two procedures we've seen in the last paragraph; > gcc –o pversion pversion.c 'perl –MExtUtils::Embed -e ccopts -e ldopts' or by using the genmake script: > perl genmake pversion.c > make Both procedures (remember, only the second option is available on Windows machines) generate an executable called pversion that will produce, for example, this output: > pversion This is perl, v5.6.0 built for i686-linux Copyright 1987-2000, Larry Wall Perl may be copied only under the terms of either the Artistic License, or the GNU General Public License, which may be found in the Perl 5.0 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using 'man perl' or 'perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. If a filename is hard coded in the switches, Perl will execute the script by that name, just as the executable does. We can even put a -e switch followed by code there: /* shorter version */ /* note we have to escape quotes and backslashes! */ char *switches[] = {"", "-e", "print \"$]\\n\";"} /* later, we'll call perl_parse with 3 arguments this time */ perl_parse(my_perl, NULL, 3, switches, (char**)NULL);
Note that this calling sequence (perl_alloc, construct, parse, run, destruct, and free) is always needed, even if there is nothing to pass to perl_parse. We can't simply pass a blank argument list, because this corresponds to calling the Perl executable without arguments, so it will read code from standard input, which is probably not what we want. To have the interpreter ready to run code, we usually pass -e 0 on its command line, which results in perl_parse doing nothing. Here, we have a script named do_nothing.c, which shows this: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; /* 'void' command line */ char *switches[] = {"", "-e", "0"}; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl);
879
Chapter 21
/* my_perl is now ready to receive orders */ perl_destruct(my_perl); perl_free(my_perl); }
We'll use this skeleton for the next samples, whenever we need a bare implementation of the Perl interpreter.
Embedding Perl Code We have many ways of embedding Perl code. As said above, the interpreter can execute the script contained in a file. This is a convenient approach, for example, when we want to change some portions of an application without the need to rebuild the executable. This can be seen as having Perl plug-ins added to a C application. The drawback to this is that the script, or scripts need to be shipped together with the application, which is something we may not want to do. Being normal scripts, anybody can open the scripts and mess with the source code, which could make our application unusable. We can decide that we want to properly 'embed' Perl code in our application instead of depending on external files. We have already seen that the -e switch can be used to pass code directly to the Perl interpreter, but there is also a more convenient, and efficient, way, which is the eval_pv function, which works like Perl's built-in eval. We take our last sample from the previous paragraph and we make it print Hello, world. This script is called phello.c: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; char *switches[] = {"", "-e", "0"}; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); eval_pv("print \"Hello, world\\n\";", 0); perl_destruct(my_perl); perl_free(my_perl); }
eval_pv takes a string, containing Perl code, and a flag called croak_on_errors (the zero above). If the flag is set, and the evaluated code produces an error, Perl will die with its own error message (and of course, the C program dies with it). If it is zero, it acts just like eval does, ignoring fatal errors and setting the $@ variable (which can be accessed in C under the name ERRSV). We show how error handling works with this code: /* 'safe' eval */ eval_pv("a deliberate error", 0); if(SvTRUE(ERRSV)) { printf("eval_pv produced an error: %s\n", SvPV(ERRSV, PL_na)); }
880
Integrating Perl with Other Programming Languages
/* 'unsafe' eval */ eval_pv("a deliberate error", 1); /* this is never reached */ printf("too late, we died...\n");
In the first part, we use the SvTRUE function to check if the global variable ERRSV is true. If it is, then the evaluated block produced an error (and we report the error message by getting the PV stored in ERRSV). In the second part, the error simply causes the program to end.
Getting Perl Values The value of defined Perl variables can be obtained using the get_* family of functions. We have get_sv for scalars, get_av for arrays, and get_hv for hashes. This little program, called dump_inc.c, gets the @INC array using get_av, and dumps its content using av_len to get the array length, and av_fetch to fetch values from it in a loop: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; AV* inc; SV** value; int i; char *switches[] = {"", "-e", "0"}; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); inc = get_av("INC", 0); if(NULL != inc) { for(i = 0; i <= av_len(inc); i++) { value = av_fetch(inc, i, 0); if(NULL != value) { printf("%s \n", SvPV(*value, PL_na)); } } } perl_destruct(my_perl); perl_free(my_perl); }
The combination of eval and get functions is a powerful means to integrate Perl functionalities into a C program. For example, this function performs a pattern substitution on a string: bool substitute(char *string[], char *from, char *to, char *flags) { char *default_flag = "g"; SV* instruction; SV* result;
881
Chapter 21
if(flags == NULL) flags = default_flag; /* note we use an SV* to evaluate */ instruction = newSVpvf( "($string = '%s') =~ s/%s/%s/%s", *string, from, to, flags ); /* evaluate the code */ eval_sv(instruction, 0); /* oops, something went wrong... */ if(SvTRUE (ERRSV)) return 0; /* $string now contains the new string, let's get it */ result = get_sv("string", 0); /* and give it back to C */ *string = SvPV(result, PL_na); return 1; }
The function can be used comfortably from C as shown in foobar.c below: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; char *string = "foo is foo"; char *switches[] = {"", "-e", "0"}; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); printf("string is '%s'\n", string); substitute(&string, "foo", "bar", NULL); printf("string is now '%s'\n", string); perl_destruct(my_perl); perl_free(my_perl); }
The program performs the Perl instruction s/foo/bar/g on our string, and correctly outputs: string is 'foo is foo' string is now 'bar is bar'
Using Perl Subroutines Calling a Perl subroutine instead of evaluating pieces of code requires some additional steps. Parameters can't be simply passed from C to a Perl subroutine and vice versa, we have to perform a series of calls to set up the stack, push, and pop parameters to and from it.
882
Integrating Perl with Other Programming Languages
This is the usual procedure to call a Perl subroutine from C, supposing our sub is called foo: dSP; /* declare the stack */ int count; ENTER; /* prepare the stack */ SAVETMPS; PUSHMARK(SP); XPUSHs("bar"); /* push input parameters */ XPUSHi(42); PUTBACK; /* finalize the stack */ count = call_pv("foo", 0); /* call the sub 'foo' */ /* count holds the numbers of return values */ SPAGAIN; /* query the stack */ for(i=0; i<=count; i++) { temp = POPp; /* pop a string */ } PUTBACK; /* clean up */ FREETMPS; LEAVE;
static SV* commify_sub;
AM FL Y
We have used call_pv to call a named subroutine, which could have been defined in an external Perl file, or built with an eval_pv statement. But if we don't want to pollute Perl's namespace, we can also use a reference to an anonymous subroutine. In the next example, we take the commify sub from the Perl FAQ (this sub formats a number with commas to separate thousands, for example 123456 will become 123,456) and store it in a SV* in our C program:
TE
void create_sub() { commify_sub = eval_pv( "sub {local $_ = shift; \ 1 while s/^(-?\\d+)(\\d{3})/$1,$2/; \ return $_;}", 0); }
Now that we have obtained commify_sub, we can use call_sv to actually call the code reference stored in it: int commify(char *result[], double number) { dSP; int count; int success; ENTER; SAVETMPS; PUSHMARK(SP); XPUSHn(number); PUTBACK; count = call_sv(commify_sub, G_SCALAR); SPAGAIN;
883
Team-Fly®
Chapter 21
if(count != 1) { success = 0; } else { *result = POPp; success = 1; } PUTBACK; FREETMPS; LEAVE; return success; }
Our commify function, created above, will call our globally stored commify_sub. This is now ready to be used from C: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; char *switches[] = {"", "-e", "0"}; char *commified; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); create_sub(); if(commify(commified, printf("commified: free(commified); } if(commify(commified, printf("commified: free(commified); }
1675002)) { %s\n", commified);
39852.12)) { %s\n", commified);
perl_destruct(my_perl); perl_free(my_perl); }
This program will output, as expected, the result from the Perl commify sub: commified: 1,675,002 commified: 39,852.12
Working with Perl Internals Some parts of the Perl internals can also be accessed directly from C, without a line of Perl code. For example, Perl variables (scalars, arrays, and hashes) can be created and managed using Perl's C APIs.
884
Integrating Perl with Other Programming Languages
This program, hash.c, shows how to build a hash using the newHV function: the hash is stored in a HV structure. We can then store a value in it using hv_store, and query the hash with hv_fetch. Full documentation for the C representation of Perl's variables can be found in the perlguts manual page. #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; char *switches[] = {"", "-e", "0"}; HV* hash; SV** value; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); /* create the hash */ hash = newHV(); /* store the 'test' value... */ hv_store(hash, "test", 4, newSViv(42), 0); /*...and fetch it */ value = hv_fetch(hash, "test", 4, 0); /* first check it is defined */ if(NULL != value) { printf("hash{test}=%d\n", SvIV(*value)); } perl_destruct(my_perl); perl_free(my_perl); }
Instead of fetching a single key from the hash, we can iterate through its keys, as per Perl's each function, with this code: SV* val; char* key; I32 keylen; hv_iterinit(hash); while ((val = hv_iternextsv(hash, &key, &keylen))) { printf("hash{%s}=%s\n", key, SvPV(val, PL_na)); }
The next example, do_join.c, goes even further and shows how to call the equivalent of Perl's built-in join directly from C. This is not part of the officially exposed APIs, so this can be considered as a little hack. The prototypes for many of the Perl built-ins are contained in proto.h, which is in the source distribution.
885
Chapter 21
#include <EXTERN.h> #include int main(int argc, char** argv, char** env) { PerlInterpreter *my_perl; char *switches[] = {"", "-e", "0"}; SV* joined; SV* stack[4]; my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 3, switches, (char**)NULL); perl_run(my_perl); /* place a dummy SV on top of the stack */ stack[0] = &PL_sv_undef; /* the strings we want to join */ stack[1] = newSVpv("one", 0); stack[2] = newSVpv("two", 0); stack[3] = newSVpv("three", 0); /* this will hold the result */ joined = &PL_sv_undef; do_join(joined, newSVpv(", ", 0), stack, stack+3); printf("and the result is: '%s'\n", SvPV(joined, PL_na)); perl_destruct(my_perl); perl_free(my_perl); }
This program does the equivalent of Perl's join (', ', 'one', 'two', 'three'), producing this output: and the result is: 'one, two, three'
Using Modules Using a Perl module is not a problem, as long as the module is pure Perl. For example, to use the Config module, the -MConfig switch can be added to perl_parse, as shown in this code, use_config.c: #include <EXTERN.h> #include int main(int argc, char** argv, char** env) { static PerlInterpreter *my_perl; char *switches[] = {"", "-MConfig", "-e", "0"}; HV* config; SV** osname;
886
Integrating Perl with Other Programming Languages
my_perl = perl_alloc(); perl_construct(my_perl); perl_parse(my_perl, NULL, 4, switches, (char**)NULL); perl_run(my_perl); config = get_hv("Config", 0); osname = hv_fetch(config, "osname", 6, 0); if(NULL != osname) { printf("$Config{osname} = '%s'\n", SvPV(*osname, PL_na)); } perl_destruct(my_perl); perl_free(my_perl); }
The module could also have been accessed using this code instead: eval_pv("use Config", 0);
The program will act just like Perl does: it will search for the Config.pm file in its @INC, read it and evaluate it. This also means that it needs to find the file where it's supposed to be, so the program will not run on a machine without a proper Perl lib directory set up. As mentioned before, this works only as long as the module is made up of Perl and nothing else. If we try to use a module that uses the XS interface, it will fail miserably. We'll get an error message saying that our Perl can't do dynamic loading. To use a module that in turn loads a shared library, the DynaLoader module, which takes care of loading the library and providing hooks to its exported symbols, must have been previously initialized. To do this, we use the second parameter to perl_parse, which we've left to NULL until now. The definition for this parameter in the prototype is void(*xsinit)(void), a function which takes no arguments and returns none. This function, usually simply defines an XS subroutine to 'bootstrap' DynaLoader: the rest is done by the module itself: void xs_init() { char *file = __FILE__; newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file); }
We also need to provide, before this function, the prototype for the boot_DynaLoader function, defined in the module's library: void boot_DynaLoader (CV* cv);
Now xs_init can be passed to perl_parse, and we are able to use XS extensions. This sample program, named use_posix.c, calls a function contained in the POSIX module, which has an XS part:
887
Chapter 21
#include "embedded_perl.h" void boot_DynaLoader (CV* cv); void xs_init() { char *file = __FILE__; newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file); } int main(int argc, char** argv, char** env) { dSP; perl_prologue(); printf("using POSIX...\n"); eval_pv("use POSIX", 0); printf("calling POSIX::clock...\n"); call_pv("POSIX::clock", G_SCALAR|G_NOARGS); SPAGAIN; printf("POSIX::clock returned %ld\n", POPi); perl_epilogue(); }
If the genmake script is used to do the build, a file called perlxsi.c is automatically generated and compiled together with the program. This file defines the xs_init function for us, so we only need to include its prototype before the main function: void xs_init();
The Java-Perl Lingo The Java-Perl Lingo (or JPL for short), originally put up by Larry Wall, provides interoperability between Perl and Java. It is composed of two distinct parts: one that allows Java classes and methods to be used from Perl, and another that allows Perl code to be 'embedded' into Java classes. To use this package, a JDK (Java Development Kit) must also be available on our system. This example shows the first part at work: use JPL::Class 'java::lang::StringBuffer'; $sb = java::lang::StringBuffer->new__s("TEST"); print "Java string: ", $sb->toString____s(), "\n";
With the use JPL::Class directive, we import, in Java terms, the desired class. The class becomes a regular Perl package, exposing its methods as Perl methods. We only need to 'decorate' the names with the signature for input and output parameters. Details for this can be found in the JNI (Java Native Interface) documentation. The JPL delimiter for signatures is a double underscore, so we gather that the new method gets a string (s is a special case for the JNI Ljava/lang/String; signature) as its argument, while the toString method gets no argument and returns a string.
888
Integrating Perl with Other Programming Languages
The next example shows how to define a Perl method inside a Java class. We simply define a Java method with the perl prefix, and put its body into double brackets: class foo { perl int add(int a, int b) {{ return $a + $b; }} // ...rest of the code here }
The file should be named foo.jpl, and it will be compiled by JPL into foo.class. The provided JPL tool will perform all the necessary steps for us. The command for this is: > jpl foo.jpl In reality, the result will be composed of the Java class, a Perl file, and a dynamic library to provide glue between the two.
Perl and COM ActiveState's Perl Dev Kit (or PDK for short) contains two tools that allow Perl to be used in a COM environment. The advantages of doing this, in terms of integration, are quite high on the Windows operating systems, where many programming languages and applications make use of the COM architecture. Take just a quick look at what these tools provide; for further information, refer to the documentation included with the copy of the PDK.
PerlCOM PerlCOM is, basically, a DLL that exposes a COM interface to a Perl interpreter. Using PerlCOM is very similar to having an embedded Perl. This time, however, it can be embedded in every programming language, or application, which supports COM objects, not only in C as we've seen so far. The next sample shows how to embed PerlCOM, and evaluate a snippet of code inside Visual Basic: Dim my_Perl Set my_Perl = CreateObject("PerlCOM.Script") my_Perl.EvalScript("print ""Hello, world!\n"" ")
The my_Perl object can also be used to access Perl functions and variables directly: my_Perl.EvalScript("$foo = 'bar'") MsgBox my_Perl.foo my_Perl.EvalScript( _ "sub hello {my($name) = @_; return 'Hello '.$name;}" _ ) MsgBox my_Perl.hello("Joe")
889
Chapter 21
PerlCOM provides some additional methods for useing Perl modules and datatypes, such as UsePackage, and CreateHash. Here we've made some samples using Visual Basic, but PerlCOM objects can be instantiated in a number of other environments, including Microsoft Visual C++ and J++, Delphi, VBA, and even Python.
'PerlCtrl' PerlCtrl is another variation on the COM theme: it is a command line tool that takes a Perl script and turns it into an ActiveX component. This way, instead of implementing a full-blown Perl interpreter, we can implement a 'control' that exposes only the methods and properties we have defined in the appropriate type library. This requires a bit of coding, in the Perl script itself, to properly define what should and should not be exposed. In a few words, the script should contain a %TypeLib hash (commented out from Perl's view with pod directives, but interpreted by the PerlCtrl program) with all the information about the COM interface we are going to expose. Here, we take the code from a sample control implementation, hello.ctrl, included with the PDK to illustrate what a PerlCtrl program looks like: package Hello; sub Hello { return "Hello, world!"; } =pod =begin PerlCtrl %TypeLib = ( PackageName => 'Hello', # DO NOT edit the next 3 lines. TypeLibGUID => '{E91B25C6-2B15-11D2-B466-0800365DA902}', ControlGUID => '{E91B25C7-2B15-11D2-B466-0800365DA902}', DispInterfaceIID=> '{E91B25C8-2B15-11D2-B466-0800365DA902}', ControlName => 'Hello World Control', ControlVer => 1, ProgID => 'Hello.World', DefaultMethod => '', Methods => { 'Hello' => { RetType => VT_BSTR, TotalParams => 0, NumOptionalParams => 0, ParamList =>[ ] }, }, # end of 'Methods'
890
Integrating Perl with Other Programming Languages
Properties => { } , # end of 'Properties' );
# end of %TypeLib
=end PerlCtrl =cut
This script is simply passed to the PerlCtrl command line utility that will generate a DLL from our code. This DLL is then registered on our system, and we can instantiate the COM object as we've seen before: Dim my_Hello Set my_Hello = CreateObject("Hello.World") MsgBox my_Hello.Hello()
The same considerations made about implementing PerlCOM apply, of course, to PerlCtrl objects as well. The resulting object can be instantiated in every language or application that supports COM.
Miscellaneous Languages So far, we have seen a number of ways to integrate other programming languages with Perl. On CPAN, however, there are also Perl modules that implement different languages, and modules that are designed to implement language parsers. The proverbial flexibility of Perl in text parsing and string handling makes it a perfect candidate for writing language interpreters. Some of them are purely academic exercises, and far from being fully functional interpreters, but nonetheless they may contain useful bits of code. CPAN offers little gems like Language::Basic by Amir Karger, or Language::Prolog by Jack Shirazi, and the perl-lisp package by Gisle Aas, for example. These are all interesting efforts to implement, at least subsets, of other languages in Perl. Damian Conway has released a very interesting module called Lingua::Romana::Perligata. This module is an interpreter that uses the ancient latin language as a 'dialect' to write Perl code. Agreed, it is unlikely that someone will actually write programs in this language, but the code that's behind it deserves a mention. The module has a very complex grammar and uses Latin inflections to distinguish, for example, hashes from scalars. It makes use of the Filter (covered in the next section) and the Parse::Yapp modules (by Francois Desarmenien). The last of the two modules is a grammar compiler, similar to the UNIX yacc tool. In a few words, it generates a parser based on user-defined rules and syntax. Another similar tool is the Parse::RecDescent module, also written by Damian Conway.
891
Chapter 21
These modules are the building blocks of language design, and if anyone is serious about this topic, they definitely should learn to use them.
The Filter Module Something that deserves a special mention is the Filter module, by Paul Marquess, which gives us the ability to turn Perl into something else. The module makes use of a very powerful feature of the Perl internals, that we saw at work with the ByteLoader module in Chapter 20. The principle is very simple: a hook function is installed, and then it is called when data is read from the program source. This function can modify the data before it is passed to the Perl interpreter. This way, the source code of a program can be modified before it is parsed by Perl. This means, for example, that we can write our code in something that isn't Perl, and using an appropriate filter we can translate it back to Perl before it is interpreted. The Filter distribution, available from CPAN, comes with a module called Filter::Util::Call, which hides all the complexities of working with the Perl internals, and allows us to write a Perl filter in a few lines of code. We'll see how it works with a basic example: this filter reads its input and uses tr to shift the letters thirteen places forward in the alphabet. There is a simpleminded 'encoding' algorithm known on the Usenet as rot13, and so this script is named rot13.pm: package rot13 ; use Filter::Util::Call ; sub import { filter_add( sub { my $status; tr/n-za-mN-ZA-M/a-zA-Z/ if ($status = filter_read()) > 0; return $status; } ); } 1;
The import function, which is called automatically when the module is used, contains a filter_add call. It takes a code reference as its argument, which is the hook code for doing the real filtering. This filter routine takes a line from the file using the filter_read function, and uses the $status variable to check for the end of file. The filter works directly on $_, so we use tr implicitly on it. The modified $_ is then passed to the Perl interpreter. We can 'encode' a script with the rot13 algorithm using the following one liner. The hello.pl sample script contains one line (print "Hello, world\n"): > perl -pi.bak -e 'tr/a-zA-Z/n-za-mN-ZA-M/' hello.pl
892
Integrating Perl with Other Programming Languages
The script now contains what, at a first glance, looks like funny characters and nothing more (cevag "Uryyb, jbeyq\a"). Of course if we try to execute it, we'll get a syntax error. But it still can be executed using the filter package we've prepared, like this: > perl -Mrot13 hello.pl Hello, world! Although simple, this trick shows a very powerful technique. Of course, much more power can be obtained using an XS extension instead of plain Perl code. The Filter distribution contains examples of filters written in C, which may be used as a template for our own filters. Starting from version 5.6.0, the Perl documentation also includes a perlfilter manual page, which explains in more depth the concepts of source filtering and their usage.
Summary Throughout this chapter, we have seen how Perl can interact with a number of other programming languages, in particular with C. We have looked at this interaction in two different directions; using C from Perl and using Perl from C. This is done by utilizing various tools, interfaces and modules. As a summary we have: Introduced the XS interface to extend Perl by making use of C.
❑
Seen how Perl can call functions residing in external libraries using dynamic loading.
❑
Seen how to embed the Perl interpreter in a C program.
❑
Reviewed integration with different languages like Java, and (on Windows) Visual Basic, and other COM-aware environments.
❑
Given some pointers to useful tools when implementing language design in Perl.
TE
AM FL Y
❑
893
Team-Fly®
Chapter 21
894
Creating and Managing Processes When a script is run, regardless of whatever language it has been compiled in, there are certain features common to each execution. In this chapter, we shall be looking at some of these aspects more closely in order to gain a better understanding of processes and how they behave within a program. First we shall begin by laying a solid foundation on understanding signals. Even simple, single processes will be able to benefit from the proper handling of them. Then we shall move on to the manipulation and behavior of processes before moving on to the subject of how they can be split up and utilized. Another important aspect of processes, much like the real-life analogy of parents and children, is the communication between them. In the later sections we expand the ideas of using pipes and other objects in handling data between processes. Finally we cover a major new topic, which is experimental in Perl at the present moment – threads. We can still use it at this stage so long as we are careful but threading will be built into future Perl releases, beginning with Perl 6.
Signals Much like hardware interrupts, signals are exception events that happen independently of a process. They are even considered to be 'out of band', since they cannot be deferred for handling. Any signal sent is received by the process instantly, and must be processed immediately. Typically, our operating environment has defined default behaviors for most signals. A UNIX shell, for instance, maps some signals to key sequences, such as KILL to Ctrl-C, and STOP to Ctrl-Z. Other processes, as well as the OS kernel, have the ability to send signals to any designated process.
Chapter 22
Every process keeps an index of signals and their defined behaviors. Whenever a process receives a signal, it stops normal execution and executes the behavior defined for that signal. Depending on the signal, that behavior may take some action and return control back to the interrupted routine. Other behaviors may do some cleaning up and exit the program, while others can do nothing, effectively ignoring the signal. Regardless of the signal, Perl allows the programmer to redefine the behaviors of most signals as necessary. The signal index is stored as the hash variable %SIG, with the key value being the signal name, and the value being either a reference to a subroutine to execute, or an appropriate return value. By assigning various values to signals in the hash, we can control how Perl and our process handle different kinds of signals. To elaborate a little on what kind of signals we have available, we can dump out a list of them by printing out the keys of the %SIG hash: #!/usr/bin/perl # signals.pl use warnings; use strict; foreach (sort keys %SIG) { print "$_\n"; }
At the system level each signal is denoted as an integer, rather than as the name we know them. As seen from the above, however, Perl does not index the signals in %SIG by their numeric values, but by the name. If we need to cross reference the numeric and names, we can discover them via the kill -l command (provided we're on a UNIX platform, of course) or through the position of the name in Perl's Config variable sig_name, which is a space-delimited string: #!/usr/bin/perl # siglist.pl use warnings; use strict; use Config; my @signals = split ' ', $Config{sig_name}; for (0..$#signals) { print "$_ $signals[$_] \n" unless $signals[$_] =~ /^NUM/; }
This generates a list of all signals and their associated numbers, skipping over real time signal numbers beginning NUM (in the range between RTMIN and RTMAX): 0 ZERO 1 HUP 2 INT 3 QUIT 4 ILL 5 TRAP 6 ABRT 7 BUS 8 FPE 9 KILL 10 USR1 ...
896
Creating and Managing Processes
Many of these signals are obscure, unlikely, or only occur if we have certain things enabled. The signal SIGPWR for example, applies to power failure situations and SIGIO occurs only with filehandles set to asynchronous mode using O_ASYNC. The most commonly used signals are as follows (please note that not all of the signals below work on non-UNIX platforms): Name
Key
Meaning
SIGHUP
HUP
Hangup Controlling terminal or process has terminated (for example a modem, hence 'hang up'). Often redefined in daemons to tell them to re-read their configuration files and restart operations.
SIGINT
INT
Interrupt Instructs the process to interrupt what it is doing and exit. This signal is trappable (which means we can redefine its handler) so that a process can perform any necessary clean up before doing so. Pressing Ctrl-C on the keyboard is often the manual way to send this signal.
SIGQUIT
QUIT
Quit A higher priority signal to shut down and, depending on the system's configuration, may produce a core dump. This signal, too, is trappable.
SIGKILL
KILL
Kill An explicit command to tell the process to immediately exit. This signal is not trappable (would we really want to take away the kernel's ability to terminate run-away processes?).
SIGUSR1
USR1
User-defined signal While this signal is never used by the kernel, the default behavior for any process receiving the signal is to exit.
SIGUSR2
USR2
User-defined signal A second signal for process definition. Like USR1, the default behavior is to exit.
SIGPIPE
PIPE
Broken pipe A pipe that a process was either reading from or writing to has been closed by the other end.
SIGALRM
ALRM
Alarm An alarm timer for this process has expired. Not supported on Microsoft platforms.
SIGTERM
TERM
Terminate Like INT, this instructs the process to exit, and is trappable. This signal has a higher priority than INT, but lower than QUIT and KILL. Table continued on following page
897
Chapter 22
Name
Key
Meaning
SIGCHLD
CHLD
Child exit A child process has exited.
SIGFPE
FPE
Floating point exception
SIGSEGV
SEGV
Invalid memory reference
SIGCONT
CONT
Continue Resume execution if stopped by a STOP signal.
SIGSTOP
STOP
Stop The process is halted until it receives a CONT, at which point it resumes operational. This signal is not trappable.
SIGIO
IO
Asynchronous IO event If a filehandle is set for asynchronous operation (O_ASYNC), this signal is raised whenever an event (for example, more input) occurs on it.
SIGWINCH
WINCH
Window changed The window or terminal in which the console of a process is running has changed size. Chapter 15 discusses this in more detail.
For a complete list of signals, consult system documentation (> man 7 signal on UNIX), but be aware that even on UNIX, some signals are platform dependent. The same documentation should inform us of the default behavior of each signal. Regardless, by manipulating the contents of the %SIG hash we can override these defaults for most signal types, and install our own mechanisms for handling signals.
Signal Handling A string or subroutine reference may be set as values of the %SIG hash to control what happens when a given signal is received: Value
Action
DEFAULT or undef
Performs the default behavior as determined by the system.
IGNORE
Instructs the process to take no action in response to the signal. Untrappable signals are unaffected by this setting (such as KILL).
\&subreference
If a subroutine reference is set as the value of a signal then it is called whenever that signal is received. We can then decide how to handle it ourselves, including ignoring it, raising a different signal, dying, and so on.
subroutine
If the name of the subroutine is set as a string, this will be evaluated by &main::subroutine when the signal is received.
For example, to ignore SIGPIPE signals, we would put: $SIG{PIPE} = 'IGNORE';
898
Creating and Managing Processes
To restore the default handling for SIGINT signals: $SIG{INT} = 'DEFAULT';
To find out the current setting of SIGALRM (remembering that undef is equal to DEFAULT): $alarming = $SIG{ALRM};
To set a subroutine as a signal handler: $SIG{USR1} = \&usr1handler;
The last of these is of course the most interesting. Signal handlers, when called, receive the name of the signal that called them as an argument, so we can assign multiple signals to the same handler, or handle each signal individually. $SIG{HUP} = \&handler; $SIG{STOP} = \&handler; $SIG{USR1} = \&handler;
If we set a text string that is not DEFAULT or IGNORE then it is taken as a symbolic reference to a subroutine in the main (not the current) package. So be careful about spelling. Thus: $SIG{INT} = 'DEFLAT';
actually means: $SIG{INT} = \&main::DEFLAT;
This will silently fail unless we are using the -w flag with Perl, in which case it will merely complain on STDERR that the handler DEFLAT is undefined. Although this is a perfectly legal way to set a signal handler, the fact that it defaults to the main package can cause problems when handling signals inside packages. Note also that though this is a form of symbolic reference, it is not trapped by the use strict refs pragma. Conversely, if we try to set a signal that does not exist, Perl will complain with an error, for example: No such signal SIGFLARE at... A practical example for redefining trappable signals would be when our program creates temporary files. A well-behaved program should clean up after itself before exiting, even when unexpectedly interrupted. In the following example we will redefine the INT handler to remove a temporary PID file before exiting. This will keep the program from leaving PID files around when the user interrupts the program with a Ctrl-C: $SIG{INT} = sub { warn "received SIGINT, removing PID file and exiting.\n"; system("rm", "$ENV{HOME}/.program.pid"); exit 0; };
899
Chapter 22
The 'die' and 'warn' Handlers In addition to the standard signals, the %SIG hash also allows us to set handlers for Perl's error reporting system, specifically the warn and die functions together with derivatives of them like carp and croak (supplied by the Carp module). These are not true signal handlers but hooks into Perl's internals, which allow us to react to events occurring within Perl. The %SIG hash makes a convenient interface because, aside from some minor differences in behavior, it operates very similarly to signals. Neither the warn nor die hooks are present as keys in the %SIG hash by default – we have to add them: $SIG{__WARN__} = \&watchout; $SIG{__DIE__} = \&lastwillandtestament;
Both handlers may customize the error or warning before passing it on to a real warn or die (the action of the handler is suppressed within the handler itself, so calling warn or die a second time will do the real thing). A warning handler may choose to suppress the warning entirely; die handlers cannot, however, avert death, but only do a few things on their deathbed, so to speak. See Chapter 16 for more on die, warn, and registering handlers for them. The hook mechanism is not extensible; the __ prefix and suffix merely distinguish these special handlers from true signal handlers. They do not, however, allow us to create arbitrary signals and handlers.
Writing Signal Handlers Signal handlers are just subroutines. When they are called, Perl passes them a single parameter: the name of the signal. For example, here is a program containing a very simple signal handler that raises a warning whenever it receives an INT signal, but otherwise does nothing with it: #!/usr/bin/perl # inthandler1.pl use warnings; use strict; sub handler { my $sig = shift; print "Caught SIG$sig! \n"; } # register handler for SIGINT $SIG{INT} = \&handler; # kill time while (1) { sleep 1; }
Note that since the handler, for all intents and purposes, handled the exception, the program does not exit. If we still want the handler to exit after performing whatever other actions we need done, we must add the exit command to the end of our handler. Here is another handler that implements this scheme, with a private counter. It will catch the first two signals, but defer to the normal behavior on the third reception: #!/usr/bin/perl # inthandler2.pl use warnings; use strict;
900
Creating and Managing Processes
{ # define counter as closure variable my $interrupted = 0; sub handler { foreach ($interrupted) { $_ == 0 and warn("Once..."), $interrupted++, last; $_ == 1 and warn("Twice..."), $interrupted++, last; $_ == 2 and die ("Thrice!"); } } } # register handler for SIGINT $SIG{INT} = \&handler; # kill time while (1) { sleep 1; }
A few platforms (BSD systems, typically) cancel signal handlers once they have been called. On these systems (and if we want to be maximally portable) we have to reinstall the handler before we exit it, if we wish to call it again: handler { $sig = shift; # reinstate handler $SIG{$sig} = \&handler; ... do stuff ... }
To prevent another signal from coming in before we have redefined it, we will do that first in every handler. Since this does no harm even on platforms that do not need it, it is a good piece of defensive programming if we are worried about portability.
Avoid Complex Handlers The above handlers are good examples of signal handlers in that they do very little. By their nature, signals herald a critical event. In many cases, that condition means that executing complex code, and especially anything that causes Perl to allocate more memory to store a value, is a bad idea. For example, this signal counting handler is not a good idea because it allocates a new key and value for the hash each time a signal it has not previously seen arrives. %sigcount; sub allocatinghandler { $sig = shift; $sigcount {$sig}++; }
901
Chapter 22
This is fine though, as we have guaranteed that all the keys of the hash already exist and have defined values: %sigcount = map { $_ => 0 } keys %SIG; sub nonallocatinghandler { $sig = shift; $sigcount{$sig}++; }
Regardless, there is one rule of thumb that must be adhered to wherever possible: do the absolute minimum necessary in any signal handler. If we do not, and another signal comes in while we are doing some kind of extensive processing, our program may suffer random core dumps. Perl, like many lowlevel system calls, is not re-entrant at the lowest levels.
Un-interruptible Signal Handlers Unlike the warn and die hooks, real signal handlers do not suppress signals while they are running, so if we want to avoid being interrupted a second time while we are still handling a signal, we have to find a way to avoid further signals. One way to do this is simply to disable the handler for the duration of the handler: sub handler { $sig = shift; $SIG{$sig} = 'IGNORE'; ... do something ... $SIG{$sig} = \&handler; }
A better way is to localize a fresh value for the signal value in $SIG using local. This has the same effect as reassigning it explicitly, but with the advantage that the old value is restored immediately on exiting the handler without our intervention: sub handler { local $SIG{shift} = 'IGNORE'; ... do something... }
We can suppress signals in normal code using the same principles, for instance by temporarily reassigning a signal to a new handler or value. We can do that by either making a record of the old one, or using local to suppress it if we happen to be in a section of code that is scoped appropriately: $oldsig = $SIG{INT}; $SIG{INT} = 'IGNORE'; ... code we do not want interrupted ... $SIG{INT} = $oldsig;
As always, these techniques only work on trappable signals.
902
Creating and Managing Processes
Aborting System Level Operations On many versions of UNIX, as well as a few other platforms, signals that occur during some system calls, and in particular during input and output operations, may cause the operation to restart on receiving the return from the signal handler. Frequently we would rather abort the whole operation at this point, since resuming is likely to be either pointless or plain wrong. Unfortunately the only way to abort the interrupted code from inside a signal handler is die or CORE::exit. Moreover, to be able to resume normal execution at a point of our choice rather than jump into a die handler or exit the program, we have to put the die (or rather, the context of the die) inside an eval, since that will exit the eval and resume execution beyond it. So the code we want to abort must all be inside an eval: sub handler { $SIG{shift} = 'DEFAULT'; die; } $result = eval { $SIG{INT} = \&handler; ...read from a network connection... $SIG{INT} = 'DEFAULT'; 1; # return true on completion } warn "Operation interrupted! \n" unless $result;
AM FL Y
If the code in the eval completes successfully it returns 1 (because that is the last expression in the eval). If the handler is called, the die causes the eval to return undef. So we can tell if the handler was called or not by the return value from eval, and therefore we can tell if the code was interrupted. We can vary this theme a little if we want to return an actual result from the eval, so long as we do not need to validly return a false value, we can always use this technique. Note that even though the die itself is not in the eval, the context in which it is called is the eval's context, so it exits the eval, not the program as a whole. This approach also works for setting and canceling alarms to catch system calls that time out, see 'Alarms' later for more.
TE
Flexibly Installing Signal Handlers
We have already seen how to install a signal handler by hand; simply set the value of the relevant signal in the signal hash: $SIG{INT} = \&handler;
While fine for a single signal, this is cumbersome for handling many of them. One way to set up multiple signals is to assign a new hash to %SIG, for example: %SIG = (%SIG, INT => IGNORE, PIPE = \&handler, HUP => \&handler);
There is, however, a better way, with the sigtrap pragmatic module. This takes a list of signal actions and signals and assigns each signal the action that immediately preceded it. sigtrap provides two handlers of its own; a stack-trace handler, the default action, and die, which does what it implies. It also provides several keywords for common groups of signals, as well as a keyword for currently unassigned signals.
903
Team-Fly®
Chapter 22
The default action is stack-trace, so the following three pragmas all have the same effect; normalsignals is the group comprised of the SIGINT, SIGHUP, SIGPIPE, and SIGTERM signals: use sigtrap qw(INT HUP PIPE TERM); use sigtrap qw(stack-trace INT HUP PIPE TERM); use sigtrap qw(stack-trace normal-signals);
Alternatively we can choose to set a die handler. Here are two examples of using die with the signals that sigtrap categorizes under error-signals: use sigtrap qw(die ABRT BUS EMT FPE ILL QUIT SEGV SYS TRAP); use sigtrap qw(die error-signals);
We can also supply our own handler, by prefixing it with the keyword handler: use sigtrap qw(handler myhandler ALRM HUP INT);
If we want to be sure the handler exists before we install it we can drop the qw and use a sub reference in a regular list: use sigtrap (handler => \&myhandler, ALRM HUP INT);
We may assign different handlers to different signals all at the same time; each signal is assigned the handler before it in the list. The signals at the front are assigned stack-trace if the first item in the list is not die or a handler of our own devising: use sigtrap qw( stack-trace normal-signals ALRM USR2 die error-signals handler usrhandler USR1 USR2 die PWR handler inthandler INT HUP );
We can specify as many handlers and signals as we like. In addition, later assignments supplant earlier ones, so handler inthandler INT HUP replaces the assignment to stack-trace of these signals in the first line (in the guise of normal-signals). If we want to assign a handler to all signals only if they have not already been assigned or ignored, we can precede the signals we want to trap conditionally with the untrapped keyword. For example, to call the stack-trace handler for normal signals not already trapped: use sigtrap qw(stack-trace untrapped normal-signals);
The opposite of untrapped is any; this cancels the conditional assignment of untrapped: use sigtrap qw(stack-trace untrapped normal-signals any ALRM USR1 USR2);
If sigtrap is not passed any signal names at all, it defaults to a standard set that is trapped in previous incarnations of the module. This list is also defined as old-interface-signals. The following are therefore equivalent:
904
Creating and Managing Processes
use sigtrap qw(stack-trace old-interface-signals); use sigtrap;
Sending Signals The traditional UNIX command (and C system call) for sending signals is kill, a curious piece of nomenclature that comes about from the fact that the default signal sent by the kill command was 15 - SIGTERM which caused the program to exit. Despite this, we can send any signal using Perl's kill function. The kill command takes at least two arguments. The first is a signal number or a signal name given as a string. The second and following arguments are the IDs of processes to kill. The return value from kill is the number of processes to which a signal was delivered (which, since they may ignore the signal, is not necessarily the number of processes that acted on it): # tell kids to stop hogging the line kill 'INT', @mychildren; # a more pointed syntax kill INT => @mychildren, $grandpatoo; # commit suicide (die would be simpler) kill KILL => $$; # $$ is our own process ID kill (9, $$); # put numerically # send our parent process a signal kill USR1 => getppid
Sending a signal to a negative process ID will send it to the process group of that process (including any child processes it may have, and possibly the parent process that spawned it too, unless it used setpgrp). For example, this instruction will send an HUP signal to every other process in the same process group: kill HUP => -$$;
The HUP signal in particular is useful for a parent process to tell all its children to stop what they are doing and re-initialize themselves. The Apache web server does exactly this in forked mode (which is to say, on UNIX but not Windows) if we send the main process an HUP. Of course, the main process does not want to receive its own signal, so we would temporarily disable it: sub huphandler { local $SIG{HUP} = 'IGNORE'; kill HUP => -$$; }
The signal 0 (or ZERO) is special. It does not actually send a signal to the target process or processes at all, but simply checks that those process IDs exist. Since kill returns the number of processes to which a signal was successfully sent, for signal 0 it reports the number of processes that exist, which makes it a simple way to test if a process is running: warn "Child $child is dead!" unless kill 0, $child;
905
Chapter 22
Alarms An alarms is a particularly useful kind of signal that is issued whenever an internal timer counts down to zero. We can set an alarm with the alarm function, which takes an integer number of seconds as an argument: # set an alarm for sixty seconds from now alarm 60;
If the supplied number is zero, the previous alarm, if any, is cancelled: # cancel alarm alarm 0;
Setting the alarm will cause the process to exit (as per the default behavior of the signal), unless we also set a new handler for the signal: alarmhandler { print "Alarm at ", scalar(localtime), "\n"; } $SIG{ALRM} = \&alarmhandler;
Please note that specifying a time interval does not mean that the timer will raise a SIGALRM in exactly that interval. What it says is that sometime after that interval, depending on the resolution of the system clock and whether our process is in the current context, a signal will be sent.
A Simple Use of Alarms We can only ever have alarm active at one time, per process, so setting a new alarm value resets the timer on the existing alarm, with zero canceling it as noted above. Here is a program that demonstrates a simple use of alarms, to keep re-prompting the user to input a key: #!/usr/bin/perl # alarmkey.pl use strict; use warnings; use Term::ReadKey; # Make read blocking until a key is pressed, and turn on autoflushing (no # buffered IO) ReadMode 'cbreak'; $| = 1; sub alarmhandler { print "\nHurry up!: "; alarm 5; }
906
Creating and Managing Processes
$SIG{ALRM} = \&alarmhandler; alarm 5; print "Hit a key: "; my $key = ReadKey 0; print "\n You typed '$key' \n"; # cancel alarm alarm 0; # reset readmode ReadMode 'restore';
We use the Term::ReadKey module to give us instant key-presses without returns, and set $| = 1 to make sure the Hurry up! prompt appears in a timely manner. In this example the alarm 0 is redundant because we are about to exit the program anyway, but in a larger application this would be necessary to stop the rest of the program being interrupted by Hurry up! prompts every five seconds.
Using Alarms to Abort Hung Operations We can also use alarms to abort an operation that has hung or is taking too long to complete. This works very similarly to the eval-and-die example we gave earlier for aborting from a section of code rather than continuing it – this time we use alarms instead to catch an interrupt. Here is a code snippet that aborts from an attempt to gain an exclusive lock over the file associated with the filehandle HANDLE if the lock is not achieved within one minute: sub alarmhandler { die "Operation timed out!"; } sub interrupthandler { die "Interrupted!"; } $SIG{ALRM} = \&alarmhandler; $SIG{INT} = \&interrupthandler; $result = eval { # set a timer for aborting the lock alarm 60; # block waiting for lock flock HANDLE, LOCK_EX; # lock achieved! cancel the alarm alarm 0; ... read/write to HANDLE ... flock HANDLE, LOCK_UN; 1; # return true on completion } warn @_ unless $result;
907
Chapter 22
The alternative approach to this is to turn the lock into a non-blocking lock, then check the return value, sleep for a short while, and then try again. The signal approach is more attractive than this solution in some ways, since it spends less time looping. It is, though, limited by the fact that we can no longer use an alarm for any other purpose until we have stopped using the alarm here for this purpose.
Starting New Processes Perl supports the creation of new processes through the fork function, which is an interface to the operating system's underlying fork function, if we have one. Forking is a pre-UNIX concept, and is universal on all UNIX-like operating systems. Other platforms like Windows and Macintosh have chosen the threading model over the forking model, but Perl can still fork on these platforms. For those that do not support true forking, Perl emulates it closely. We can also choose to use the native thread model instead, should our Perl installation be built to support threads. See below for more information on threads. The fork function spawns a clone of itself, creating a child process (the original process is called the parent process). The cloning is 'complete' since the child shares the same application code, variables, and filehandles, as the parent. The typical method used to tell the difference between the parent and child is in the return value from fork. For the child, it is zero and for the parent it is the process ID of the child if successful, undef if not. As a result, fork is always found in close proximity to an if statement, for example: if ($pid = fork) { print "Still in the parent process – we started a child with ", "process id $pid \n"; } else { print "This is the child process\n"; # terminate child and return success to the parent exit 0; }
What the above example does not do, though, is check the return value for a failure of the fork operation. If the fork failed entirely it would return undef and we should always check for this possibility (which might occur if the system has reached a limit on processes or does not have enough memory to create the new process). So, we can check it using defined: $pid = fork; die "Failed to fork: $! \n" unless defined $pid; if ($pid) { # parent ... } else { # child ... }
908
Creating and Managing Processes
Replacing the Current Process with Another On a related topic, we can replace the current process with a completely different one by using the exec command. This command is a rather dramatic function as it replaces the current process in its entirety with the external command supplied as an argument in the same context. Even the PID is inherited for the calling process: exec 'command', @arguments;
In general exec is used with fork to run an external command as a sub-process. We will show some examples of this in action a little later in the chapter. We do not need to check the return value from exec since if it succeeds we will not be there to return to. Consequently the only time that we will continue past an exec is if it fails: # hand over to the next act exec @command; # if we get here the exec must have failed die "Exec failed: $!\n";
Several of Perl's built-in functions perform automatic fork-and-execs, including the system and backtick functions, and certain variants of open. We cover all of these later on in this chapter.
Process IDs The process ID of the current process can always be found from the special variable $$ ($PID or $PROCESS_ID with the English module): print "We are process $$\n";
A child process can find the process ID of the parent process with the getppid function (not implemented on Win32): $parent = getppid; print "Our parent is $parent \n";
This allows us to use kill to send a signal to the parent process: kill "HUP", $parent;
On Win32, there is one caveat to be aware of: Handles (the Win32 equivalents of PIDS) are unsigned 32 bit integers, while Perl's 32 bit integers are signed. Due to this fact, PIDS are occasionally interpreted as a negative integer. Any signal sent to a negative integer goes to the entire process group, not just a specific process and these process groups are explained in more detail below.
Process, Groups and Daemons Whenever a parent process uses fork (either directly or implicitly) to create a child, the child inherits the process group of the parent. Process groups become significant when signals are sent to a group rather than a single process. If it is intended to be an independent process then the parent may have its own group ID (which is generally the same as its process ID), otherwise it will have inherited the process group from its own parent.
909
Chapter 22
We can find the process group of a process by using the getpgrp function (like getppid, this is not implemented on Win32), which takes a process ID as an argument. To find our own process group, we can write: $pgid = getpgrp $$;
We can also supply any false value, including undef, to get the process group for the current process. This is generally a better idea because not all versions of getpgrp are the same, and the only value they have in common is 0. Since Perl maps directly to the underlying function, it means only 0 is truly portable: $pgid = getpgrp 0; $pgid = getpgrp;
In practice we usually do not want to find out the process group, we already know the child will have the same group as the parent. What we often do want to do is change the group, which we can do with setpgrp, which takes a process ID and a process group ID as arguments: setpgrp $pid, $pgid;
In practice the only process we want to change the group for is our own – if we are a child process and we want to divorce ourselves from the fate of our parent. Likewise, while we can in theory join any group, we generally want to create our own group. Process group IDs are just numbers, and usually have the same value as the process ID that owns the group, that is, there is usually one process whose process ID and process group ID are the same. Children of that process, which didn't change their group have the same process group ID. To put ourselves in our own group therefore, we simply need to give ourselves a process group ID that is equal to our process ID: setpgrp $$, $$;
Again, we can use a false value for both IDs to default to the current process ID, so we can also say: setpgrp 0, 0;
Or even just the following (though not quite as portable): setpgrp;
Both calls will put the child process into its own group. By doing this we isolate the child from the process group that the parent originally belonged to. Here is a quick example of a daemon that performs the role of a back seat driver (that is, it keeps shouting out directions when we're trying to concentrate): #!/usr/bin/perl # backseat.pl; # lack of use warnings; use strict;
910
Creating and Managing Processes
setpgrp 0, 0; if (my $pid = fork) { # exit if parent print "Backseat daemon started with id $pid \n"; exit 0; } # child loops in background while (1) { alarm 60; foreach (int rand(3)) { $_ == 0 and print ("Go Left! \n"), last; $_ == 1 and print ("Go Right! \n"), last; $_ == 2 and print ("Wait, go back! \n"), last; } sleep rand(3)+1; }
Yes, this example did not use warnings. We intentionally left it out this time, since the valid syntax in the foreach loop would otherwise generate spurious warnings to verify our intent. Regardless, note that in this example we changed the process group of the parent, then forked the child before exiting. We could equally fork and then change the process group of the child since it makes little difference. If we do not want to print a friendly message we can also simplify the fork statement to just: fork and exit;
# isn't self documenting code great?
Handling Children and Returning Exit Codes Keeping an eye on children is always a challenging task, and it is just as true for processes as real life. We can get a child to stop if it is running away with itself by killing it (which may involve sending it a friendlier signal like TERM, so it has an opportunity to tidy up and exit, rather than simply killing it outright), but we still have to deal with the remnants regardless of who killed it or why it died. When a child exits, the operating system keeps a record of its exit value and retains the process in the process table. The process is stored until the exit value is recovered by the parent. We are expected to tidy up after ourselves and recover the return values of child processes when they exit. If we fail to do this, the dead children turn into zombies when the parent exits (no, really – this is all serious UNIX terminology, do not think that it all sounds like a Hammer horror flick) and hang around the process table, dead but not buried. Using the ps command on a system where this is happening reveals entries marked zombie or defunct. Perl's built-in functions (other than fork) automatically deal with this issue for us, so if we create extra processes with open we do not have to clean up afterwards. For fork and the IPC::Open2 and IPC::Open3 modules (both of which we cover later) we have to do our own housekeeping.
911
Chapter 22
Waiting for an Only Child If we, as the parent process, want to wait for a child to finish before continuing then there is no problem, we simply use the wait function to cease execution until the child exits: # fork a child and execute an external command in it exec @command unless fork; # wait for the child to exit $child_pid = wait;
The process ID of the child is returned by wait, if we need it.
Getting the Exit Status When wait returns it sets the exit code of the child in the special variable $?. This actually contains two eight-bit values, the exit code and the signal (if any) that caused the child to exit, combined into a sixteen-bit value. To get the actual exit code and signal number we therefore need to use: $exitsig = $? & 0xff; $exitcode = $? >> 8;
# signal is lower 8 bits # exit code is upper 8 bits
If we import the wait symbols from the POSIX module we can also say things like: use POSIX qw(:sys_wait_h); $exitsig = WSTOPSIG($?); $exitcode = WEXITSTATUS($?);
The POSIX module also contains a few other handy functions in the same vein, we list them all briefly at the end of the section. Only one of the exit codes and the signals will be set, so for a successful exit (that is, an exit code of zero), $? will be zero too. This is a convenient Boolean value, of course, so we can test $? for truth to detect a failed process, whether it exited with a non-zero status or aborted by a signal. Of course, since such codes are often left to the discretion of the developer to respect, we cannot always rely on that. In some cases, particularly if we wrote the external command, the returned code may be an errno value, which we can assign to $! for a textual description of the error. In the child process/external command: # exit with $! explicitly exit $! if $!; # 'die' automatically returns $! die;
912
Creating and Managing Processes
In the parent process: wait; $exitcode = $? >> 8; if ($exitcode) { $! = $exitcode; die "Child aborted with error: $! \n"; }
If there are no child processes then wait returns immediately with a return value of -1. However, this is not generally useful since we should not be in the position of calling wait if we did not fork first, and if we did attempt to fork, we should be testing the return value of that operation. If we are handling more than one process we should use waitpid instead.
Handling Multiple Child Processes If we have more than one child process then wait is not always enough, because it will return when any child process exits. If we want to wait for a particular child then we need to use waitpid: $pid = fork; if ($pid) { waitpid $pid, 0; } else { ...child... }
AM FL Y
Two arguments are taken by waitpid. The first is the process ID of the child to wait for. The second argument is a flag that effects the operation of waitpid. The most common flag to place here is WNOHANG, which tells waitpid not to wait if the child is still running but return immediately with -1. This argument is one of several constants defined by the POSIX module, and we can import it either from the :sys_wait_h group or directly: use POSIX qw(:sys_wait_h);
Or:
TE
use POSIX qw(WNOHANG);
We can use this to periodically check for a child's exit without being forced to wait for it: if ($pid = fork) { while (1) { $waiting = 1; if ($waiting and (waitpid $pid, WNOHANG)!= -1) { $waiting = 0; handle_return_value($? >> 8); ... last; } ... check on other children, or do other misc. tasks... } } else { ... child ... }
913
Team-Fly®
Chapter 22
This works for a single child process, but if we have several children to tidy up after we have to collect all their process IDs into a list and check all of them. This is not convenient. Fortunately we can pass -1 to waitpid to make it behave like wait and return with the value of the first available dead child: # wait until any child exits waitpid -1, 0; # this is the non-blocking version waitpid -1, WNOHANG;
We do not necessarily want to keep checking for our children exiting, particularly if we do not care about their exit status and simply want to remove them from the process table because we are conscientious and responsible parents. What we need is a way to call waitpid when a child exits without having to check periodically in a loop. Fortunately we can install a signal handler for the SIGCHLD signal that allow us to do exactly this: sub waitforchild { waitpid -1, 0; } $SIG{CHLD} = \&waitforchild;
We do not need to specify WNOHANG here because we know by the fact that the signal handler was called, a child has exited. Unfortunately, if more than one child exits at once, we might only get one signal, so calling waitpid once is not enough. A truly efficient signal handler thus needs to keep calling waitpid until there are no exit codes left to collect, which brings us back to WNOHANG again: sub waitforchildren { $pid; do { $pid = waitpid -1, WNOHANG; } until ($pid! == -1); } $SIG{CHLD} = \&waitforchildren;
This is of course rather tedious, but it is necessary if we are to manage child processes responsibly. On some systems we can get away with simply ignoring SIGCHLD, and have the operating system remove dead children for us: $SIG{CHLD} = 'IGNORE';
Or, if we can let the child change its process group, we can let init reap children instead. This is not portable across all systems, though, so in general the above solutions are preferred.
POSIX Flags and Functions The POSIX module defines several symbols and functions other than WNOHANG for use with the wait and waitpid system calls. We can make all of them available by importing the :sys_wait_h tag: use POSIX qw(:sys_wait_h);
914
Creating and Managing Processes
There are two flags. One is WNOHANG, which we have already seen. The other, WUNTRACED, also returns process IDs for children that are currently stopped (that is, have been sent a SIGSTOP) and have not been resumed. For example: $possibly_stopped_pid = waitpid -1, WNOHANG | WUNTRACED;
In addition, the following convenience functions are defined: Function
Action
WEXITSTATUS
Extract the exit status, if the process has exited. Equivalent to $? >> 8. For example: $exitcode = WEXITSTATUS($?);
The exit code is zero if the process was terminated by a signal. WTERMSIG
Extract the number of the signal that terminated the process, if the process aborted on a signal. For example: $exitsignal = WTERMSIG($?);
The signal number is zero if the process exited normally (even if with an error). WIFEXITED
Check that the process exited, as opposed to being aborted by a signal. The opposite of WIFSIGNALED. For example: if (WIFEXITED $?) { return WEXITSTATUS($?) } else {return -1};
WIFSIGNALED
Check that the process terminated on a signal, as opposed to exited. The opposite of WIFEXITED. For example: if (WIFSIGNALED $?) { print "Aborted on signal ", WTERMSIG($?), "\n"; } elsif (WEXITSTATUS $?) { print "Exited with error ", WEXITSTATUS($?), "\n"; } else { # exit code was 0 print "Success! \n";
} Table continued on following page
915
Chapter 22
Function
Action
WSTOPSIG
Extract the number of the signal that stopped the process, if we specified WUNTRACED and the process returned is stopped as opposed to terminated. For example: $stopsig = WSTOPSIG($?)
This is usually SIGSTOP, but not necessarily. WIFSTOPPED
If we specified WUNTRACED as a flag, this returns true if the returned process ID was for a stopped process: if (WIFSTOPPED $?) { print "Process stopped by signal ", WSTOPSIG($?),"\n"; } elsif (...){ ... }
Communicating Between Processes Communicating between different processes, or Inter-Process Communication (IPC), is a subject with many facets. Perl provides us with many possible ways to establish communications between processes, from simple unidirectional pipes, through bi-directional socket pairs to fully-fledged control over an externally executed command. In this section we are going to discuss the various ways in which different processes can communicate with one another, and the drawbacks of each approach whilst paying particular attention to communications with an externally run command. The open function provides a simple and easy way to do this, but unless we trust the user completely it can also be a dangerous one, unless we take the time to do some thorough screening. Instead we can turn to the forking version of open, or use the IPC::Open2 or IPC::Open3 library modules for a safer approach. Before covering the more advanced ways to establish communications between processes and external programs we will cover the obvious and simple methods like system and do. They may not be sophisticated, but sometimes they are all we need to get the job done.
Very Simple Solutions If we simply want to run an external program we do not necessarily have to adopt measures like pipes (covered in the next section) or child processes, we can just use the system function: system "command plus @arguments";
However, if system is passed a single argument it checks it to see if it contains shell-special characters (like spaces, or quotes), and if present, starts a shell as a sub-process to run the command. This can be dangerous if the details of the command are supplied by the user; we can end up executing arbitrary commands like rmdir.
916
Creating and Managing Processes
A better approach is to pass the arguments individually and this causes Perl to execute the command directly, bypassing the shell: system 'command', 'plus', @arguments;
However, system only returns the exit code of the command, which is great if we want to know if it succeeded or not but fairly useless for reading its output. For that we can use the backtick operator qx, or backtick quotes: # these two statements are identical. Note both interpolate too. $output = `command plus @arguments`; $output = qx/command plus @arguments/;
Unfortunately, both variants of the backtick operator create a shell, and this time there is no way to avoid it, since there is no way to pass a list of arguments. If we really want to avoid the shell we have to start getting creative using a combination of open, pipe, and fork. Fortunately Perl makes this a little simpler for us, as we will see shortly. Not to be overlooked, we can also execute external programs with the do command if they are written in Perl. This command is essentially an enhanced version of eval which reads its input from an external file rather than having it written directly into the code. Its intended purpose is for loading in library files (other than modules, for which use or require are preferred), which do not generate output. If the script happens to print things to standard output, that will work too, so long as we don't want to control the output. $return_value = do './local_script.pl';
The do command has a lot of disadvantages, however. Unlike eval it does not cache the code executed for later use, and it is strictly limited in what it can do. For more useful and constructive forms of interprocess communication we need to involve pipes.
Pipes A pipe is, simply put, a pair of file handles joined together. In Perl we can create pipes with the pipe function, which takes two filehandle names as arguments. One filehandle is read-only, and the other is write-only. By default, Perl attempts to buffer IO. This is also the case when we want a command to respond to each line without waiting for a series of lines to come through to flush the buffer. So we would write: pipe (READ, WRITE); select WRITE; $| = 1;
Or, using IO::Handle: use IO::Handle; pipe (READ, WRITE); WRITE->autoflush(1);
917
Chapter 22
We can also create pipes with the IO::Pipe module. This creates a 'raw' IO::Pipe object that we convert into a read-only or write-only IO::Handle by calling either the reader or writer methods: use IO::Pipe; $pipe = new IO::Pipe; if (fork) { $pipe->reader; # $pipe is now a read-only IO::Handle } else { $pipe->writer; # $pipe is now a write-only IO::Handle }
The most common use for pipes is for IPC (Inter-Process Communication mentioned earlier), which is why we bring them up here rather than in Chapter 23. As a quick example and also as an illustration of pipes from the filehandle perspective, here is a program that passes a message back and forth between a parent and child process using two pipes and a call to fork: #!/usr/bin/perl # message.pl use warnings; use strict; pipe CREAD, PWRITE; pipe PREAD, CWRITE;
# parent->child # child->parent
my $message = "S"; if (fork) { # parent - close child end of pipes close CREAD; close CWRITE; syswrite PWRITE, "$message \n"; while () { chomp; print "Parent got $_ \n"; syswrite PWRITE, "P$_ \n"; sleep 1; } } else { # child - close parent end of pipes close PREAD; close PWRITE; while () { chomp; print "Child got $_ \n"; syswrite CWRITE, "C$_ \n"; } }
918
Creating and Managing Processes
As this example shows, both processes have access to the filehandles of the pipes after the fork. Each process closes the two filehandles that it does not need and reads and writes from the other two. In order to ensure that buffering does not deadlock the processes waiting for each other's message, we use the system level syswrite function to do the actual writing, which also absolves us of the need to set the autoflush flag. When run, the output of this program looks like this: Child got : S Parent got: CS Child got : PCS Parent got: CPCS Child got : PCPCS Parent got: CPCPCS Child got : PCPCPCS Parent got: CPCPCPCS Child got : PCPCPCPCS Parent got: CPCPCPCPCS Child got : PCPCPCPCPCS Parent got: CPCPCPCPCPCS Child got : PCPCPCPCPCPCS Parent got: CPCPCPCPCPCPCS ... Note that this program works because each process reads exactly one line of input and produces exactly one line of output, so they balance each other. A real application where one side or the other needs to read arbitrary quantities of data will have to spend more time worrying about deadlocks; it is all too easy to have each process waiting for input from the other.
Opening and Running External Processes If the first or last character of a filename passed to open is a pipe (|) symbol, the remainder of the filename is taken to be an external command. A unidirectional pipe connects the external process to our program. Input to or output from the external process is connected to our process based on the position of the pipe symbol. If the symbol precedes the command, the program's input (STDIN) is connected to an output statement in our process. If the symbol appends the command, the program's output (STDOUT) is accessed with an input type statement from our process. If the pipe character occurs at the start of the filename then the input to the external command is connected to the output of the filehandle, so printing to the filehandle will send data to the input of the command. For example, here is a standard way of sending an email on a UNIX system by running the sendmail program and writing the content of the email to it through the pipe: # get email details from somewhere, e.g. HTML form via CGI.pm ($from_addr, $to_addr, $subject, $mail_body, $from_sig) = get_email(); # open connection to 'sendmail' and print email to it open (MAIL, '|/usr/bin/sendmail -oi -t') || die "Eep! $! \n"; print MAIL <<END_OF_MAIL; From: $from_addr; To: $to_addr Subject: $subject $mail_body
919
Chapter 22
$from_sig END_OF_MAIL # close connection to sendmail close MAIL;
If, on the other hand, the pipe character occurs at the end then the output of the external command is connected to the input of the filehandle. Reading from the filehandle we will receive anything that the external command prints. For example, here is another UNIX-based example of using an open to receive the results of a ps (list processes) command: #!/usr/bin/perl # pid1.pl use warnings; use strict; my $pid = open (PS,"ps aux|") || die "Couldn't execute 'ps': $! \n"; print "Subprocess ID is: $pid \n"; while () { chomp; print "PS: $_ \n"; } close PS;
The return value of open when it is used to start an external process this way, is the process ID of the executed command. We can use this value with functions such as waitpid, which we covered earlier in this chapter. Note that it is not possible to open an external program for both reading and writing this way, since the pipe is unidirectional. Attempting to pass a pipe in at both ends will only result in the external command's output being chained to nothing, which is unlikely to be what we want. One solution to this is to have the external command write to a temporary file and then read the temporary file to see what happened: if (open SORT, "|sort > /tmp/output$$") { ... print results line by line to SORT ... close SORT; open(RESULT, '/tmp/output$$') ... read sorted results ... close RESULT; unlink '/tmp/output$$'; }
Another, better, solution is to use the IPC::Open2 or IPC::Open3 modules, which allow both read and write access to an external program. The IPC:: modules are also covered in this chapter, under 'Bi-directional Pipes to External Processes'.
Bi-directional Pipes Standard pipes on UNIX are only one-way, but we can create a pair of filehandles that look like just one.
920
Creating and Managing Processes
Perl provides a number of functions to create and manage sockets, which are bi-directional networking endpoints represented in our applications as filehandles. They can be used to communicate between different processes and different systems across a network. The socketpair function stands from the other socket functions because it does not have any application in networking applications. Instead, it creates two sockets connected back to back, with the output of each connected to the input of the other. The result is something which looks and feels like a bi-directional pipe. This is unsupported on Win32 platforms. Sockets have domains, types and protocols associated with them. The domain in Perl can be either the Internet or UNIX, the type a streaming socket, datagram socket or raw socket, and the protocol can be something like PF_INET or PF_UNIX (these constants actually stand for Protocol Families). This is the general form of socketpair being used to create a parent and child filehandle: socketpair PARENT, CHILD, $domain, $type, $protocol;
However, most of this is fairly irrelevant to socketpair; its sockets do not talk to network interfaces or the filesystem, they do not need to listen for connections, they cannot be bound to addresses and finally are not bothered about protocols, since they have no lower level protocol API to satisfy. Consequently, we generally use a UNIX domain socket, since it is more lightweight than an Internet domain socket. We make it streaming so we can use it like an ordinary filehandle, and don't bother with the protocol at all. The actual use of socketpair is thus almost always: use Socket; socketpair PARENT, CHILD, AF_UNIX, SOCK_STREAM, PF_UNSPEC;
The symbols AF_UNIX, SOCK_STREAM, and PF_UNSPEC are all defined by the Socket module, which provides support for Perl's socket functions. We cover it in a lot more detail in Chapter 23, when we come to real sockets. Here is a version of the message passing program we showed earlier using a pair of sockets rather than two pairs of pipe handles: #!/usr/bin/perl # socketpair.pl use warnings; use strict; use Socket; socketpair PARENT, CHILD, AF_UNIX, SOCK_STREAM, PF_UNSPEC; my $message = "S"; if (fork) { syswrite PARENT, "$message \n"; while () { chomp; print "Parent got: $_ \n"; syswrite PARENT, "P$_ \n"; sleep 1; }
921
Chapter 22
} else { while () { chomp; print "Child got : $_ \n"; syswrite CHILD, "C$_ \n"; } }
In fact we could also close the parent and child socket handles in the child and parent processes respectively, since we do not need them – we only need one end of each pipe. We can also use the socket in only one direction by using the shutdown function to close either the input or the output: shutdown CHILD, 1; shutdown PARENT, 0;
# make child read-only # make parent write-only
The difference between close and shutdown is that close only affects the filehandle itself, the underlying socket is not affected unless the filehandle just closed was the only one pointing to it. In other words, shutdown includes the socket itself, and hence all filehandles associated with it are affected. This program is a lot more advanced in the way that it manufactures the conduit between the two processes, because it uses Perl's socket support to create the filehandles, but the benefit is that it makes the application simpler to write and generates fewer redundant filehandles.
Avoiding Shells with the Forked Open One problem with using open to start an external process is that if the external command contains spaces or other such characters that are significant to a shell, open will run a shell as a sub-process and pass on the command for it to interpret, so that any special characters can be correctly parsed. The problem with this is that it is potentially insecure if the program is to be run by un-trusted users, as for example is the case with a CGI script. Functions such as exec and system allow us to separate the parameters of a command into separate scalar values and supply them as a list, avoiding the shell. Unfortunately open does not directly allow us to pass in a command as a list. Instead it allows us to use exec to actually run the command by supplying it with the magic filenames |- or -|. This causes open to create a pipe and then fork to create a child process. The child's standard input or standard output (depending on whether |- or -| was used) is connected to the filehandle opened in the parent. If we then use exec to replace the child process with the external command, the standard input or output is inherited, connecting the external process directly to the filehandle created by open in the parent. The return value from open in these cases is the process ID of the child process (in the parent) and zero (in the child), the same as the return value from fork, enabling us to tell which process we are now running as. Of course it is not obligatory to run an external command at this point, but this is by far the most common reason for using a forked open. It is also by far the most common reason for using exec, which replaces the current process by the supplied command. Since exec allows the command to be split up into a list (that is, a list containing the command and arguments as separate elements), we can avoid the shell that open would create if we used it directly (or handed it a scalar string with the complete command and arguments in it). Here's an example of running the UNIX ps command via a forked open:
922
Creating and Managing Processes
#!/usr/bin/perl # pid2.pl use warnings; use strict; my $pid = open (PS, "-|"); die "Couldn't fork: $! \n" unless defined $pid; if ($pid) { print "Subprocess ID is: $pid \n"; while () { chomp; print "PS: $_\n"; } close PS; } else { exec 'ps', 'aef'; # no shells here }
Or, more tersely and without recording the process ID: #!/usr/bin/perl # pid3.pl use warnings; use strict;
AM FL Y
open (PS, "-|") || exec 'ps', 'aux'; while () { chomp; print "PS: $_ \n"; } close PS;
Bi-directional Pipes to External Processes Perl provides a pair of modules, IPC::Open2 and IPC::Open3, which provide access to bi-directional pipes. These modules have the added advantage that the subroutines they supply, open2 and open3, permit the external commands to be given as a list, again avoiding an external shell.
TE
As we know from Chapter 12, all applications get three filehandles for free when they start which are represented by STDIN, STDOUT, and STDERR. These are the three filehandles all applications use, and they are also the three filehandles with which we can talk and listen to any external process. This is what the piped open does, but only for one of the filehandles. However, due to the facts that STDIN is read-only, and STDOUT and STDERR are write-only, we can in theory create pipes for each of them, since we only need a unidirectional conduit for each handle. This is what the open2 and open3 subroutines provided by IPC::Open2 and IPC::Open3 do. The difference between them is that open2 creates pipes for standard input and output, whereas open3 deals with standard error too. Using either module is very simple. In the old style of Perl IO programming we supply typeglobs (or typeglob references) for the filehandles to associate with the external command, followed by the command itself: use IPC::Open2; $pid = open2(*RD, *WR, @command_and_arguments);
923
Team-Fly®
Chapter 22
Or: use IPC::Open3; $pid = open3(*WR, *RD, *ERR, @command_and_arguments);
Confusingly, the input and output filehandles of open2 are in a different order in open3. This is a great source for errors, so check carefully, or only use open3 to avoid getting them the wrong way around. Since typeglobs are considered somewhat quaint these days we can also pass in IO::Handle objects for much the same effect: use IPC::Open2; use IO::Handle; $in = new IO::Handle; $out = new IO::Handle; $pid = open2($in, $out, 'command', 'arg1', 'arg2'); print $out "Hello there \n"; $reply = <$in>;
In a similar vein but without the explicit calls to IO::Handle's new method, we can have open2 or open3 create and return the filehandles for us if we pass in a scalar variable with an undefined value: use IPC::Open3; $pid = open3($out, $in, $error, 'command', 'arg1', 'arg2'); print $out "Hello there \n"; $reply = <$in>;
If we also use IO::Handle in this example the returned filehandles will be easily manipulated with IO::Handle's methods. Both open2 and open3 perform a fork-and-exec behind the scenes, in the same way that a forked open does. The return value is the process ID of the child that actually executed (with exec) the external command. In the event that establishing the pipes to the external command fails, both subroutines raise a SIGPIPE signal, which we must catch or ignore (this we have covered earlier on in the chapter). Neither subroutine is concerned with the well-being of the actual external command, however, so if the child aborts or exits normally we need to check for ourselves and clean up with waitpid: use POSIX qw(WNOHANG); $pid = open2($in, $out, 'command', @arguments); until (waitpid $pid, WNOHANG) { # do other stuff, and/or read/writes to $in & $out }
Note that this particular scheme assumes we are using non-blocking reads and writes, otherwise the waitpid may not get called. We can also check for a deceased external process by detecting the EOF condition on the input or error filehandles (or by returning nothing from a sysread), but we still need to use waitpid to clean up the child process.
924
Creating and Managing Processes
If we do not want to do anything else in the meantime, perhaps because we have another child process busy doing things elsewhere, we can simply say: # wait for child process to finish waitpid $pid, 0;
If the write filehandle is prefixed with >& then the external command is connected directly to the supplied filehandle, so input that arrives on it is sent directly to the external command. The filehandle is also closed in our own process. Similarly, if <& is given to the read or error filehandles they are connected directly to the application. Here's a script that illustrates this in action: # Usage: myscript.pl 'somecommand @args' use IPC::Open3; print "Now entering $ARGV[0] \n"; $pid = open3('>&STDIN', '<&STDOUT', '<&STDERR', @ARGV); waitpid $pid, 0; print "$ARGV[0] finished. \n";
Note that we use standard input for the writing handle and standard output for the reading handle. That's because we are in the middle of this transaction; input comes in from standard input but then goes out to the external command, and vice versa for standard output. Of course this particular script is fairly redundant, but with a little embellishment it could be made useful, for example, logging uses of various commands, checking arguments, and processing tainted data (if we switch on the -T option). We also do not have to redirect all three filehandles, only the ones we don't want to deal with: sub open2likeopen3 { return open3(shift, shift, '>&STDERR', @_); }
This implements a subroutine functionally identical to open2 (unless we have redirected standard error already, that is) but with the arguments in the same order as open3.
Handling Bi-directional Communications Some external command will communicate with us on a line-by-line basis. That is, whenever we send them something we can expect at least one line in response. In these cases we can alternate between writing and reading as we did with our pipe and socketpair examples earlier. Commands like ftp are a good example of this kind of application. However, many commands can accept arbitrary input, and will not send any output until they have received all of the input. The only way to tell such a command that it has all the input it needs is to close the filehandle, so we often need to do something like this: @instructions = ; ($in, $out); # send input and close filehandle $pid = open2($in, $out, 'command', @arguments); print $out @instructions; close $out;
925
Chapter 22
# receive result @result = <$in>; close $in; # clean up and use the result waitpid $pid, 0; print "Got: \n @result";
If we want to carry out an ongoing dialogue with an external process then an alternating write-readwrite process is not always the best way to approach the problem. In order to avoid deadlocks we have to continually divert attention from one filehandle to the other. Adding standard error just makes things worse. Fortunately there are two simple solutions, depending on whether we can (or want to) fork or not. First, we can use the select function or the vastly more convenient IO::Select module to poll multiple input filehandles, including both the normal and error output of the external command plus our own standard input. Alternatively, we can fork child processes to handle each filehandle individually. We can even use threads, if we have a version of Perl that is built to use them. We go into more detail on threads later on.
Sharing Data Between Processes The problem with forked processes, as we have observed before and will observe again when we get to threads, is that they do not share any data between them. Consequently to communicate with each other and share resources, processes need to either establish a channel for communication or find some common point in memory that can be seen by all the processes that need access to it. In this section we are going to discuss the IPC facilities provided by UNIX systems and their near cousins, generically known as System V IPC, after the variant of UNIX on which it first appeared. There are three components to System V IPC: message queues, semaphores and shared memory segments. While the first of these is now rarely used because we can replicate some of its functionality far more easily with pipes, semaphores and shared memory are still very useful, as we will see later in this chapter. System V IPC is now a fairly venerable part of UNIX, and is not generally portable to non-UNIX platforms that do not comply with POSIX. Some parts of it are also implemented through pipes and socket pairs for ease of implementation. It still has its uses, though, in particular because the objects that we can create and access with it are persistent in memory and can survive the death of the applications that use them. This allows an application to store all its mission-critical data in shared memory, and to pick up exactly where it left off if it is terminated and restarted. Like most parts of Perl that interface to a lower level library, IPC requires a number of constants that define the various parameters required by its functions. For IPC these constants are defined by the IPC::SysV module, so almost any application using IPC includes the statement: use IPC::SysV;
926
Creating and Managing Processes
Note that we do not actually have to be on a System V UNIX platform to use IPC, but we do have to be on a platform that has the required IPC support. In general Perl will not even have the modules installed if IPC is not available, and we will have to find alternative means to our goal, some of which are illustrated during the course of the other sections in this chapter. Even on UNIX systems IPC is not always present. If it is, we can usually find out by executing the command: > ipcs If IPC is available this will produce a report of all currently existing shared memory segments, semaphores, and message queues, usually in that order. The common theme between IPC message queues, semaphores, and shared memory segments is that they all reside persistently in memory, and can be accessed by any process that knows the resource ID and has access privileges. This differentiates them from most other IPC strategies, which establish private lines of communication between processes that are not so easily accessed by other unrelated processes. Strictly speaking, Perl's support for IPC is available in the language as the function calls msgctl, msgget, msgrcv, msgsnd, semctl, semget, semop, shmctl, shmget, shmread, and shmwrite. These are merely wrappers for the equivalent C calls, and hence are pretty low-level, though very well documented. However, since these functions are not very easy to use, the IPC:: family of modules includes support for object-oriented IPC access. In the interests of brevity, we will concentrate on these modules only in the following sections.
IPC::SysV The IPC::SysV module imports all of the specified SysV constants for use in our programs in conjunction with the other IPC modules. The following is a summary of the most widely used constants. To see the full set of what IPC::SysV can export, we can do a perldoc -m on the module, and consult development man pages for further information. Error constants are imported with the core module Errno. Constant
Purpose
GETALL
Return an array of all the values in the semaphore set.
GETNCNT
Return the number of processes waiting for an increase in value of the specified semaphore in the set.
GETPID
Return the PID of the last process that performed an operation on the specified semaphore in the set.
GETVAL
Return the value of the specified semaphore in the set.
GETZCNT
Return the number of processes waiting for the specified semaphore in the set to become zero.
IPC_ALLOC
Currently allocated.
IPC_CREAT
Create entry if key doesn't exist.
IPC_EXCL
Fail if key exists. Table continued on following page
927
Chapter 22
Constant
Purpose
IPC_NOERROR
Truncate the message and remove it from the queue if it is longer than the read buffer size (normally keeps the message, and returns an error) when applied to message queues.
IPC_NOWAIT
Error if request must wait.
IPC_PRIVATE
Private key.
IPC_RMID
Remove resource.
IPC_SET
Set resource options.
IPC_STAT
Get resource options.
IPC_W
Write or send permission.
MSG_NOERROR
The same as IPC_NOERROR.
SEM_UNDO
Specifies the operation be rolled when the calling process exits.
SETALL
Set the value of all the semaphores in the set to the specified value.
SETVAL
Set the value of the semaphore in the set to the specified value.
S_IRUSR
00400 – Owner read permission.
S_IWUSR
00200 – Owner write permission.
S_IRWXU
00700 – Owner read/write/execute permission.
S_IRGRP
00040 – Group read permission.
S_IWGRP
00020 – Group write permission.
S_IRWXG
00070 – Group read/write/execute permission.
S_IROTH
00004 – Other read permission.
S_IWOTH
00002 – Other write permission.
S_IRWXO
00007 – Other read/write/execute permission.
ftok
Convert a pathname and a process ID into a key_t type SysV IPC identifier.
Messages Queues At one time, message queues were the only effective way to communicate between processes. They provide a simple queue that processes may write messages into at one end, and read out at the other. We can create two kinds of queue, private and public. Here is an example of creating a private queue: use IPC::SysV qw(IPC_PRIVATE IPC_CREAT S_IRWXU); use IPC::Msg; $queue = new IPC::Msg IPC_PRIVATE, S_IRWXU | IPC_CREAT;
928
Creating and Managing Processes
The new constructor takes two arguments. The first is the queue's identifier, which for a private queue is IPC_PRIVATE. The second is the permissions of the queue, combined with IPC_CREAT if we wish to actually create the queue. Like files, queues have a mask for user, group, and other permissions. This allows us to create a queue that we can write to but applications running under other user IDs can only read, or not read at all. For a private queue, full user permissions are the most obvious and we specify these with S_IRWXU. This is defined by the Fcntl module, but IPC::SysV imports the symbols for us so we don't have to use Fcntl ourselves. We could also have said 0700 using octal notation. (Permissions are discussed in detail in Chapter 13.) If the queue is private, only the process that created the queue and any forked children can access it. Alternatively, if it is public, any process that knows the queue's identifier can access it. For a queue to be useful, therefore, it needs to have a known identifier, or key, by which it can be found. The key is simply an integer, so we can create a public queue, which can be written to by processes running under our user ID (that is this application) and only read by others with something like the following: # create a queue for writing $queue = new IPC::Msg 10023, 0722 | IPC_CREAT;
Another process can now access this message queue with: $queue = new IPC::Msg 10023, 0200;
On return, $queue will be a valid message object for the queue, or undef if it did not exist. Assuming success, we can now read and write the queue to establish communications between the processes. If we had created a private queue (not specifying the SysV ID key) and we want to send the message queue's key to another process so we can establish communications, we can extract it with id: $queue = new IPC::Msg IPC_PRIVATE, 0722 | IPC_CREAT; $id = $queue->id;
To send a message we use the snd method: $queue->snd($type, $message, $flags);
The message is simply the data we want to send. The type is an integer that is a positive long integer, and which can be used by the rcv method to select different messages based on their type. The flags argument is optional, but can be set to IPC_NOWAIT to have the method return if the message could not be sent immediately (in this case $! will be set to EAGAIN). It is possible for error code constants to be imported by the module. To receive a message we use the rcv method: $message; $queue->rcv ($message, $length, $type, $flags);
The first argument is a scalar variable into which the message is read. The length defines the maximum length of the message to be received. If the message is larger than this length then rcv returns undef and sets $! to E2BIG. The type allows us to control which message we receive, and is an integer with the following possible meanings:
929
Chapter 22
Integer Value
Meaning
0
Read the first message on the queue, regardless of type.
> 0
Read the first message on the queue of the specified type. For example, if the type is 2 then only messages of type 2 will be read. If none are available, this will cause the process to block until one is. However, see IPC_NOWAIT and MSG_EXCEPT below.
< 0
Read the lowest and furthest forward message on the queue with a type equal to or less than the absolute value of the specified type. For example, if the type is -2 then the first message with type 0 will be returned, if no messages of type 0 are present, the first message of type 1 will be returned, or if no messages of type 1 are present, the first message of type 2 will be returned.
The flags may be one or more of: Flag
Action
MSG_EXCEPT
Invert the sense of the type so that for zero or positive values it retrieves the first message not of the specified type. For instance, a type of 1 causes rcv to return the first message not of type 1.
MSG_NOERRO R
Allow outsize messages to be received by truncating them to the specified length, rather than returning E2BIG as an error.
IPC_NOWAIT
Do not wait for a message of the requested type if none are available but return with $! set to EAGAIN.
Put together, these two functions allow us to set up a multi-level communications queue with differently typed messages for different purposes. The meaning of the type is entirely up to us; we can use it to send different messages to different child processes or threads within our application. Also it can be used as a priority or Out-Of-Band channel marker, to name a few examples. The permissions of a queue can be changed with the set method, which takes a list of key-value pairs as parameters: $queue->set ( uid => $user_id, # i.e., like 'chmod' for files gid => $group_id, # i.e., like 'chgrp' for files mode => $permissions, # an octal value or S_ flags qbytes => $queue_size # the queue's capacity )
We can also retrieve an IPC::Msg::stat object, which we can in turn create, modify, and then update the queue properties via the stat method: $stat = $queue->stat; $stat->mode(0722); $queue->set($stat);
930
Creating and Managing Processes
The object that the stat method returns is actually an object based on the Class::Struct class. It provides the following get/set methods (the * designates portions of the stat structure that cannot be directly manipulated): Method
Purpose
uid
The effective UID queue is running with.
gid
The effective GID queue is running with.
cuid*
The UID queue was started with.
cgid*
The GID queue was started with.
mode
Permissions set on the queue.
qnum*
Number of messages currently in the queue.
qbytes
Size of the message queue.
lspid*
PID of the last process that performed a send on the queue.
lrpid*
PID of the last process that performed a receive on the queue.
stime*
Time of the last send call performed on the queue.
rtime*
Time of the last receive call performed on the queue.
ctime*
Time of the last change performed on the stat data structure.
Note that changes made to the object returned by the stat method do not directly update the queue's attributes. For that to happen, we need to either pass the updated stat object or individual key-value pairs (using the keys listed above as stat methods) to the set method. Finally, we can destroy a queue, assuming we have execute permission for it, by calling remove: $queue->remove;
If we cannot remove the queue, the method returns undef, with $! set to indicate the reason, most likely EPERM. In keeping with good programming practices, it is important that the queue is removed, especially since it persists even after the application exits. However, it is equally important that the correct processes do so, and no process pulls the rug out from under the others.
Semaphores IPC Semaphores are a memory-resident set of one or more numeric flags (also known as semaphore sets) that can be read and written by different processes to indicate different states. Like message queues, they can be private or public having an identifier by which they can be accessed with a permissions mask control rolling who can access them. Semaphores have two uses, firstly, as a set of shared values, which can be read and written by different processes. Secondly, to allow processes to block and wait for semaphores to change value, so that the execution of different processes can be stopped and started according to the value of a semaphore controlled by a different process.
931
Chapter 22
The module that implements access to semaphores is IPC::Semaphore, and we can use it like this: use IPC::SysV qw(IPC_CREAT IPC_PRIVATE S_IRWXU); use IPC::Semaphore; $size = 4; $sem = new IPC::Semaphore(IPC_PRIVATE, $size,IPC_CREAT | S_IRWXU);
This creates a private semaphore set with four semaphores and owner read, write and execute permissions. To create a public semaphore set we provide a literal key value instead: $sem = new IPC::Semaphore(10023, 4, 0722 | IPC_CREAT);
Other processes can now access the semaphore with: $sem = new IPC::Semaphore(10023, 4, 0200);
# or S_IRDONLY
As with message queues, we can also retrieve the key of the semaphore set with the id method, should we not know it and have access to the semaphore object that is: $id = $sem->id;
Once we have access to the semaphore we can use and manipulate it in various ways, assuming we have permission to do so. A number of methods exist to help us do this: Name
Action
getall
Return all values as a list. For example: @semvals = $sem->getall;
getval
Return the value of the specified semaphore. For example: # first semaphore is 0, so 4th is 3 $sem4 = $sem->getval(3);
setall
Set all semaphore values. For example, to clear all semaphores: $sem->setall( (0) x 4 );
setval
Set the value of the specified semaphore. For example: # set value of 4th semaphore to 1 $sem->setval(3, 1);
set
Set the user ID, group ID or permissions of the semaphore. For example: $sem->set( uid => $user_id, gid => $group_id, mode => $permissions, );
932
Creating and Managing Processes
Alternatively we can get, manipulate, and set a stat object as returned by the stat method in the same manner as IPC::Msg objects. Name
Action Generate an IPC::Semaphore::stat object we can manipulate and then use with set. For example:
stat
$semstat = $sem->stat; $semstat->mode(0722); $sem->set($semstat);
getpid
Return the process ID of the last process to perform a semop operation on the semaphore set.
getncnt
Return the number of processes that have executed a semop and are blocked waiting for the value of the specified semaphore to increase in value. $ncnt = $sem->getncnt;
getzcnt
Return the number of processes that have executed a semop and are blocked waiting for the value of the specified semaphore to become zero.
AM FL Y
The real power of semaphores is bound up in the op method, which performs one or more semaphore operations on a semaphore set. This is the mechanism by which processes can block and be unblocked by other processes. Each operation consists of three values; the semaphore number to operate on, an operation to perform, and a flag value. The operation is actually a value to increment or decrement by, and follows these rules: If the value is positive, the semaphore value is incremented by the supplied value. This always succeeds, and never blocks.
❑
If the supplied value is zero, and the semaphore value is zero, the operation succeeds. If the semaphore value is not zero then the operation blocks until the semaphore value becomes zero. This increases the value returned by getzcnt.
❑
If the value is negative, then the semaphore value is decremented by this value, unless this would take the value of the semaphore negative. In this case, the operation blocks until the semaphore becomes sufficiently positive enough to allow the decrement to happen. This increases the value returned by getncnt.
TE
❑
We can choose to operate as many semaphores as we like. All operations must be able to complete before the operation as a whole can succeed. For example: $sem->op( 0, -1, 0, 1, -1, 0, 3, 0 , 0 );
# decrement semaphore 1 # decrement semaphore 2 # semaphore 3 must be zero
933
Team-Fly®
Chapter 22
The rules for blocking on semaphores allow us to create applications that can cooperate with each other; one application can control the execution of another by setting semaphore values. Applications can also coordinate over access to shared resources. This a potentially large subject, so we will just give a simple illustrative example of how a semaphore can coordinate access to a common shared resource: Application 1 creates a semaphore set with one semaphore, value 1, and creates a shared resource, for example a file, or an IPC shared memory segment. However, it decides to do a lot of initialization and so doesn't access the resource immediately. Application 2 starts up, decrements the semaphore to 0, and accesses the shared resource. Application 1 now tries to decrement the semaphore and access the resource. The semaphore is zero, so it cannot access, it therefore blocks. Application 2 finishes with the shared resource and increments the semaphore, an operation that always succeeds. Application 1 can now decrement the semaphore since it is now 1, and so the operation succeeds and no longer blocks. Application 2 tries to access the resource a second time. First it tries to decrement the semaphore, but is unable to, and blocks. Application 1 finishes and increments the semaphore. Application 2 decrements the semaphore and accesses the resource. ...and so on. Although this sounds complex, in reality it is very simple. In code, each application simply accesses the semaphore, creating it if isnot present, and then adds two lines around all accesses to the resource to be protected: sub access_resource { # decrement semaphore, blocking if it is already zero $sem->op(0, -1, 0); ... access resource ... # increment semaphore, allowing access by other processes $sem->op(0, 1, 0); }
If we have more than one resource to control, we just create a semaphore set with more semaphores. The basis of this approach is that the applications agree to cooperate through the semaphore. Each one has the key for the semaphore (because it is given in a configuration file, for instance), and it becomes the sole basis for contact between them. Even though the resource being controlled has no direct connection to the semaphore, each application always honors it before accessing it. The semaphore becomes a gatekeeper, allowing only one application access at a time. If we do not want to block while waiting for a semaphore we can specify IPC_NOWAIT for the flag value. We can do this on a 'per semaphore basis' too, if we want, though this could be confusing. For example:
934
Creating and Managing Processes
sub access_resource { return undef unless $sem->op(0, -1, IPC_NOWAIT); ... access resource ... $sem->op(0, 1, 0); }
We can also set the flag SEM_UNDO (if we import it from IPC::SysV first). This causes a semaphore operation to be automatically undone if the process exits, either deliberately or due to an error. This is helpful in preventing applications that abort while locking a resource and then never releasing it again. For example: $sem->op(0, -1, IPC_NOWAIT | SEM_UNDO); die unless critical_subroutine();
As with message queues, care must be not to leave unused segments around after the last process exits. We will return to the subject of semaphores when we come to talk about threads, which have their own semaphore mechanism, inspired greatly by the original IPC implementation described above.
Shared Memory Segments While message queues and semaphores are relatively low level constructs made a little more accessible by the IPC::Msg and IPC::Semaphore modules, shared memory has an altogether more powerful support module in the form of IPC::Shareable. The key reason for this is that IPC::Shareable implements shared memory through a tie mechanism, so rather than reading and writing from a memory block we can simply attach a variable to it and use that. The tie takes four arguments, a variable (which may be a scalar, an array or a hash), IPC::Shareable for the binding, and then an access key, followed optionally by a hash reference containing one or more key-value pairs. For example, the following code creates and ties a hash variable to a shared memory segment: use IPC::Shareable; %local_hash; tie %local_hash, 'IPC::Shareable', 'key', {create => 1} $local_hash{'hashkey'} = 'value';
This creates a persistent shared memory object containing a hash variable which can be accessed by any application or process by tie-ing a hash variable to the access key for the shared memory segment (in this case key): # in a process in an application far, far away... %other_hash; tie %other_hash, 'IPC::Shareable', 'key'; $value = $other_hash{'hashkey'};
935
Chapter 22
A key feature of shared memory is that, like memory queues and semaphores, the shared memory segment exists independently of the application that created it. Even if all the users of the shared memory exit, it will continue to exist so long as it is not explicitly deleted (we can alter this behavior, though, as we will see in a moment). Note that the key value is actually implemented as an integer, the same as semaphores and message queues, so the string we pass is converted into an integer value by packing the first four characters into a 32 bit integer value. This means that only the first four characters of the key are used. As a simple example, baby and babyface are the same key to IPC::Shareable. The tied variable can be of any type, including a scalar containing a reference, in which case whatever the reference points to is converted into a shared form. This includes nested data structures and objects, making shared memory ties potentially very powerful. However, each reference becomes a new shared memory object, so a complex structure can quickly exceed the system limit on shared memory segments. In practice we should only try to tie relatively small nested structures to avoid trouble. The fourth argument can contain a number of different configuration options that determine how the shared memory segment is accessed: Option
Function
create
If true, create the key if it does not already exist. If the key does exist, then the tie succeeds and binds to the existing data, unless exclusive is also true. If create is false or not given then the tie will fail if the key is not present.
exclusive
Used in conjunction with create. If true, allows a new key to be created but does not allow an existing key to be tied to successfully.
mode
Determine the access permissions of the shared memory segment. The value is an integer, traditionally an octal number or a combination of flags like S_IRWXU | S_IRGRP.
destroy
If true, cause the shared memory segment to be destroyed automatically when this process exits (but not if it aborts on a signal). In general only the creating application should do this, or be able to do this (by setting the permissions appropriately on creation).
size
Define the size of the shared memory segment, in bytes. In general this defaults to an internally set maximum value, so we rarely need to use it.
key
If the tie is given three arguments, with the reference to the configuration options being the third, this value specifies the name of the shared memory segment: tie %hash, 'IPC::Shareable' {key => 'key', ...};
For example: tie @array, 'IPC::Shareable', 'mysharedmem', { create => 1, exclusive => 0, mode => 722, destroy => 1, }
936
Creating and Managing Processes
Other than the destroy option, we can remove a shared memory segment by calling one of three methods implemented for the IPC::Shareable object that implements the tie (which may be returned by the tied function): Option
Function
remove
Remove the shared memory segment, if we have permission to do so.
clean_up
Remove all shared memory segments created by this process.
clean_up_all
Remove all shared memory segments in existence for which this process has permissions to do so. For example: # grab a handle to the tied object via the 'tied' command $shmem = tied $shared_scalar; # use the object handle to call the 'remove' method on it print "Removed shared scalar" if $shmem->remove;
We can also lock variables using the IPC::Shareable object's shlock and shunlock methods. If the variable is already locked, the process attempting the lock will block until it becomes free. For example: $shmem->lock; $shared_scalar = "new value"; $shmem->unlock;
Behind the scenes this lock is implemented with IPC::Semaphore, so for a more flexible mechanism, use IPC::Semaphore objects directly. As a lightweight alternative to IPC::Shareable, we can make use of the IPC::ShareLite module, naturally available from CPAN. This provides a simple store-fetch mechanism using object methods and does not provide a tied interface. However, it is faster than IPC::Shareable.
Threads Threads are, very loosely speaking, the low fat and lean version of forked processes. The trouble with fork is that it not only divides the thread of execution into two processes, it divides their code and data too. This means that where we had a group containing the Perl interpreter, %ENV hash, @ARGV array, and the set of loaded modules and subroutines, we now have two groups. In theory, this is wasteful of resources, very inconvenient, and unworkable for large numbers of processes. Additionally, since processes do not share data they must use constructs like pipes, shared memory segments or signals to communicate with each other. In practice, most modern UNIX systems have has become very intelligent in sharing executable segments behind the scenes, reducing the expense of a fork.
937
Chapter 22
Regardless, threads attempt to solve all these problems. Like forked processes, each thread is a separate thread of execution. Also similar to forked processes, newly created threads are owned by the thread that created them, and they have unique identifiers, though these are thread IDs rather than process IDs. We can even wait for a thread to finish and collect its exit result, just like waitpid does for child processes. However, threads all run in the same process and share the same interpreter, code, and data, nothing is duplicated except the thread of execution. This makes them much more lightweight, so we can have very many of them, and we don't need to use any of the workarounds that forked processes need. Thread support in Perl is still experimental. Added in Perl 5.6, work continues into Perl 5.7 and Perl 6. As shipped, most Perl interpreters do not support threads at all, but if we build Perl from scratch, as in Chapter 1, we can enable them. There are actually two types of thread; the current implementation provides for a threaded application but with only one interpreter dividing its attention between them. The newer implementation uses an interpreter that is itself threaded, greatly improving performance. However, this thread support is so new it isn't even available in Perl code yet. The likelihood is that full official thread support will arrive with Perl 6, but that doesn't mean we can't use it now – carefully – for some useful and interesting applications.
Checking for Thread Support To find out if threads are available programmatically we can check for the usethreads key in the %Config hash: BEGIN { use Config; die "Threadbare! \n" unless $Config{'usethreads'}; }
From the command line we can check if threads are present by trying to read the documentation, the Thread module will only be present if thread support is enabled: > perldoc Thread The Thread module is the basis of handling threads in Perl. It is an object-oriented module that represents threads as objects, which we can create using new and manipulate with methods. In addition it provides a number of functions that operate on a per-thread basis.
Creating Threads Threads resemble forked processes in many ways, but in terms of creation they resemble and indeed are subroutines. The standard way of creating a thread, the new method, takes a subroutine reference and a list of arguments to pass to it. The subroutine is then executed by a new thread of execution, while the parent thread receives a thread object as the return value. The following code snippet illustrates how it works: use Thread; sub threadsub { $self = Thread->self; ... } $thread = new Thread \&threadsub, @args;
938
Creating and Managing Processes
This creates a new thread, with threadsub as its entry point, while the main thread continues on. The alternative way to create a thread is with the async function, whose syntax is analogous to an anonymous subroutine. This function is not imported by default so we must name it in the import list to the Thread module: use Thread qw(async); $thread = async { $self = Thread->self; ... };
Just like an anonymous subroutine, we end the async statement with a semicolon; the block may not need it, but the statement does. The choice of new or async depends on the nature of the thread we want to start; the two are identical in all respects apart from their syntax. If we only want to start one instance of a thread then async should be used whereas new is better if we want to use the same subroutine for many different threads: $thread1 = new Thread \&threadsub, $arg1; $thread2 = new Thread \&threadsub, $arg2; $thread3 = new Thread \&threadsub, $arg3;
Or, with a loop: # start a new thread for each argument passed in @ARGV: @threads; foreach (@ARGV) { push @threads, new Thread \&threadsub, $_; }
We can do this with a certain level of impunity because threads are so much less resource consumptive than forked processes.
Identifying Threads Since we can start up many different threads all with the same subroutine as their entry point, it might seem tricky to tell them apart. However, this is not so. First, we can pass in different arguments when we start each thread to set them on different tasks. An example of this would be a filehandle, newly created by a server application, and this is exactly what an example of a threaded server in Chapter 23 does. Second, a thread can create a thread object to represent itself using the self class method: $self = Thread->self;
With this thread object the thread can now call methods on itself, for example tid, which returns the underlying thread number: $self_id = $self->tid;
939
Chapter 22
It is possible to have more than one thread object containing the same thread ID, and this is actually common in some circumstances. We can check for equivalence by comparing the IDs, but we can do better by using the equals method: print "Equal! \n" if $self->equal($thread);
Or, equivalently: print "Equal! \n" if $thread->equal($self);
Thread identities can be useful for all kinds of things, one of the most useful being thread-specific data.
Thread-specific Data One of the drawbacks of forked processes is that they do not share any of their data, making communication difficult. Threads have the opposite problem; they all share the same data, so finding privacy is that much harder. Unlike some other languages, Perl does not have explicit support for thread-specific data, but we can improvise. If our thread all fits into one subroutine we can simply create a lexical variable with my and use that. If we do have subroutines to call and we want them to be able to access variables we create the our can be used. Alternatively we can create a global hash of thread IDs and use the values for threadspecific data: %thread_data; sub threadsub { $id = Thread->self->tid; $thread_data{$id}{'Started'} = time; ... $thread_data{$id}{'Ended'} = time; } new Thread \&thread_sub, $arg1; new Thread \&thread_sub, $arg2;
The advantage of using my or our is that the data is truly private, because it is lexically scoped to the enclosing subroutine. The advantage of the global hash is that threads can potentially look at each other's data in certain well-defined circumstances.
Thread Management Perl keeps a list of every thread that has been created. We can get a copy of this list, as thread objects, with the Thread->list class method: @threads = Thread->list;
940
Creating and Managing Processes
One of these threads is our own thread. We can find out which by using the equal method: $self = Thread->self; foreach (@threads) { next if $self->equal($_); ... }
Just because a thread is present does not mean that it is running, however. Perl keeps a record of the return value of every thread when it exits, and keeps a record of the thread for as long as that value remains unclaimed. This is similar to child processes that have not had waitpid called for them. The threaded equivalent of waitpid is the join method, which we call on the thread we want to retrieve the exit value for: $return_result = $thread->join;
The join method will block until the thread on which join was called exits. If the thread aborts (for example by calling die) then the error will be propagated to the thread that called join. This means that it will itself die unless the join is protected by an eval: $return_result = eval {$thread->join;} if ($@) { warn "Thread unraveled before completion \n"; }
As a convenience for this common construct, the Thread module also provides an eval method, which wraps join inside an eval for us: $return_result = $thread->eval; if ($@) { warn "Thread unraveled before completion \n"; }
It is bad form to ignore the return value of a thread, since it clutters up the thread list with dead threads. If we do not care about the return value then we can tell the thread that we do not want it to linger and preserve its return value by telling it to detach: $thread->detach;
The catch to this is that if the thread dies, nobody will notice, unless a signal handler for the __DIE__ hook, which checks Thread->self for the dying thread, has been registered. To reiterate: if we join a moribund thread from the main thread without precautions we do have to worry about the application dying as a whole. As a slightly fuller and more complete (although admittedly not particularly useful) example, this short program starts up five threads, then joins each of them in turn before exiting: #!/usr/bin/perl # join.pl use warnings; use strict;
941
Chapter 22
# check we have threads BEGIN { use Config; die "Threadbare! \n" unless $Config{'usethreads'}; } use Thread; # define a subroutine for threads to execute sub threadsub { my $self = Thread->self; print "Thread ", $self->tid, " started \n"; sleep 10; print "Thread ", $self->tid, " ending \n"; } # start up five threads, one second intervals my @threads; foreach (1..5) { push @threads, new Thread \&threadsub; sleep 1; } # wait for the last thread started to end while (my $thread = shift @threads) { print "Waiting for thread ", $thread -> tid, " to end... \n"; $thread->join; print "Ended \n"; } # exit print "All threads done \n";
Typically, we care about the return value, and hence would always check it. However, in this case we are simply using join to avoid terminating the main thread prematurely.
Variable Locks When multiple threads are all sharing data together we sometimes have problems stopping them from treading on each other's toes. Using thread-specific data solves part of this problem, but it does not deal with sharing common resources amongst a pool of threads. The lock subroutine does handle this, however. It takes any variable as an argument and places a lock on it so that no other thread may lock it for as long as the lock persists, which is defined by its lexical scope. The distinction between lock and access is important; any thread can simply access the variable by not bothering to acquire the lock, so the lock is only good if all threads abide by it. As a short and incomplete example, this subroutine locks a global variable for the duration of its body: $global; sub varlocksub { lock $global; }
942
Creating and Managing Processes
It is not necessary to unlock the variable; the end of the subroutine does it for us. Any lexical scope is acceptable, so we can also place locks inside the clauses of if statements, loops, map and grep blocks, and eval statements. We can also choose to lock arrays and hashes in their entirety if we lock the whole variable: lock @global; lock %global; lock *global;
Alternatively, we can lock just an element, which gives us a form of record-based locking, if we define a record as an array element or hash value: %global; sub varlocksub { $key = shift; lock $global{$key}; ... }
In this version, only one element of the hash is locked, so any other thread can enter the subroutine with a different key and the corresponding value at the same time. If a thread comes in with the same key, however, it will find the value under lock and key (so to speak) and will have to wait.
Condition Variables, Semaphores, and Queues
AM FL Y
Locked variables have more applications than simply controlling access. We can also use them as conditional blocks by having threads wait on a variable until it is signaled to proceed. In this mode the variable, termed a condition variable, takes on the role of a starting line; each thread lines up on the block (so to speak) until the starting pistol is fired by another thread. Depending on the type of signal we send, either a single thread is given the go-ahead to continue, or all threads are signaled.
TE
Condition variables are a powerful tool for organizing threads, allowing us to control the flow of data through a threaded application and preventing threads from accessing shared data in an unsynchronized manner. They are also the basis for other kinds of thread interaction. We can use thread semaphores, provided by the Thread::Semaphore module, to signal between threads. We can also implement a queue of tasks between threads using the Thread::Queue module. Both these modules build upon the basic features of condition variables to provide their functionality but wrap them in a more convenient form. To get a feel for how each of these work we will implement a basic but functional threaded application, first using condition variables directly, then using semaphores, and finally using a queue.
Condition Variables Continuing the analogy of the starting line, to 'line up' threads on a locked variable we use the cond_wait subroutine. This takes a locked variable, unlocks it, and then waits until it receives a signal from another thread. When it receives a signal, the thread resumes execution and relocks the variable. To have several threads all waiting on the same variable we need only have each thread lock and then cond_wait the variable in turn. Since the lock prevents more than one thread executing cond_wait at the same time, the process is automatically handled for us. The following code extract shows the basic technique applied to a pool of threads:
943
Team-Fly®
Chapter 22
$lockvar;
# lock variable – note it is not locked at this point
sub my_waiting_thread { # wait for signal { lock $lockvar; cond_wait $lockvar; } # ...the rest of the thread, where the work is done } for (1..10) { new Thread \&my_waiting_thread; }
This code snippet shows ten threads, all of which use the same subroutine as their entry point. Each one locks and then waits on the variable $lockvar until it receives a signal. Since we don't want to retain the lock on the variable after we leave cond_wait, we place both lock and cond_wait inside their own block to limit the scope of the lock. This is important since other threads cannot enter the waiting state while we have a lock on the variable; sometimes that is what we want, but more often it isn't. Having established a pool of waiting threads, we need to send a signal to wake one of them up, which we do with cond_signal: # wake up one thread cond_signal $lockvar;
This will unlock one thread waiting on the condition variable. The thread that is restarted is essentially random; we cannot assume that the first thread to block will be the first to be unlocked again. This is appropriate when we have a pool of threads at our disposal, all of which perform the same basic function. Alternatively, we can unlock all threads at once by calling cond_broadcast thus: # everybody up! cond_broadcast $lockvar;
This sends a message to each thread waiting on that variable, which is appropriate for circumstances where a common resource is conditionally available and we want to stop or start all threads depending on whether the resource is available or not. It is important to realize, however, that if no threads are waiting on the variable, the signal is discarded; it is not kept until a thread is ready to respond to it. It is also important to realize that this has nothing (directly) to do with process signals, as handled by the %SIG array; writing a threaded application to handle process signals is a more complex task; see the Thread::Signal module for details. Note that the actual value of the lock variable is entirely irrelevant to this process, so we can use it for other things. For instance we can use it to pass a value to the thread at the moment that we signal it. The following short threaded application does just this, using a pool of service threads to handle lines of input passed to them by the main thread. While examining the application, pay close attention to the two condition variables that lay at the heart of the application:
944
Creating and Managing Processes
❑
$pool – used by the main thread to signal that a new line is ready. It is waited on by all the service threads. Its value is programmed to hold the number of threads currently waiting, so the main thread knows whether or not it can send a signal or if it must wait for a service thread to become ready.
❑
$line – used by whichever thread is woken by the signal to $pool. Lets the main thread know that the line of input read by the main thread has been copied to the service thread and that a new line may now be read. The value is the text of the line that was read.
The two condition variables allow the main thread and the pool of service threads to cooperate with each other. This ensures that each line read by the main thread is passed to one service thread both quickly and safely: #!/usr/bin/perl # threadpool.pl use warnings; use strict; use Thread qw(cond_signal cond_wait cond_broadcast yield); my $threads = 3; # number of service threads to create my $pool = 0; # child lock variable and pool counter set to 0 here, # service threads increment it when they are ready for input my $line=""; # parent lock variable and input line set to "" here, we # assign each new line of input to it, and set it to 'undef' # when we are finished to tell service threads to quit # a locked print subroutine – stops thread output mingling sub thr_print : locked { print @_; } # create a pool of three service threads foreach (1..$threads) { new Thread \&process_thing; } # main loop: Read a line, wait for a service thread to become available, # signal that a new line is ready, then wait for whichever thread picked # up the line to signal to continue while ($line = <>) { chomp $line; thr_print "Main thread got '$line'\n"; # do not signal until at least one thread is ready if ($pool==0) { thr_print "Main thread has no service threads available, yielding\n"; yield until $pool>0; } thr_print "Main thread has $pool service threads available\n"; # signal that a new line is ready { lock $pool; cond_signal $pool; } thr_print "Main thread sent signal, waiting to be signaled\n";
945
Chapter 22
# wait for whichever thread wakes up to signal us { lock $line; cond_wait $line; } thr_print "Main thread received signal, reading next line\n"; } thr_print "All lines processed, sending end signal\n"; # set the line to special value 'undef' to indicate end of input $line = undef; { lock $pool; # tell all threads to pick up this 'line' so they all quit cond_broadcast $pool; } thr_print "Main thread ended\n"; exit 0; # the thread subroutine – block on lock variable until work arrives sub process_thing { my $self=Thread->self; my $thread_line; thr_print "Thread ",$self->tid," started\n"; while (1) { # has the 'quit' signal been sent while we were busy? last unless (defined $line); # wait to be woken up thr_print "Thread ",$self->tid," waiting\n"; { lock $pool; $pool++; cond_wait $pool; #all threads wait here for signal $pool--; } # retrieve value to process thr_print "Thread ",$self->tid," signaled\n"; $thread_line = $line; # was this the 'quit' signal? Check the value sent last unless (defined $thread_line); # let main thread know we have got the value thr_print "Thread ",$self->tid," retrieved data, signaling main\n"; { lock $line; cond_signal $line; } # do private spurious things to it chomp ($thread_line=uc($thread_line)); thr_print "Thread ",$self->tid," got '$thread_line'\n"; } thr_print "Thread ",$self->tid," ended\n"; }
946
Creating and Managing Processes
Once the basic idea of a condition variable is understood, the way in which this application works becomes clearer. However, a few aspects are still worth pointing out. In particular, the $pool variable is used by the main thread to ensure that it only sends a signal when there is a service thread waiting to receive it. To achieve this, we increment $pool immediately before cond_wait and decrement it immediately afterwards. By doing this we ensure that $pool accurately reflects the number of waiting service threads; if it is zero, the main thread uses yield to pass on execution until a service thread becomes available again. The means by which the application terminates is also worth noting. Threads do not necessarily terminate just because the main thread does, so in order to exit a threaded application cleanly we need to make sure all the service threads terminate, too. This is especially important if resources needed by some threads are being used by others. Shutting down threads in the wrong order can lead to serious problems. In this application the main thread uses cond_signal to signal the $pool variable and wake up one service thread when a new line is available. Once all input has been read we need to shut down all the service threads, which means getting their entire attention. To do that, we give $line the special value undef and then use cond_broadcast to signal all threads to pick up the new 'line' and exit when they see that it is undef. However, this alone is not enough because a thread might be busy and not waiting. To deal with that possibility, the service thread subroutine also checks the value of $line at the top of the loop, just in case the thread missed the signal. Finally, this application also illustrates the use of the locked subroutine attribute. The thr_print subroutine is a wrapper for the regular print function that only allows one thread to print at a time. This prevents the output of different threads from getting intermingled. For simple tasks like this one, locked subroutines are an acceptable solution to an otherwise tricky problem that would require at least a lock variable. For longer tasks, locked subroutines can be a serious bottleneck, affecting the performance of a threaded application, so we should use them with care and never for anything likely to take appreciable time.
Semaphores Although it works perfectly well, the above application is a little more complex than it needs to be. Most forms of threads (whatever language or platform they reside on) support the concept of semaphores and Perl is no different. We covered IPC semaphores earlier, and thread semaphores are very similar. They are essentially numeric flags that take a value of zero or any positive number and obey the following simple rules: ❑
Only one thread at a time may manipulate the value of a semaphore in either direction.
❑
Any thread may increment a semaphore immediately.
❑
Any thread may decrement a semaphore immediately if the decrement will not take it below zero.
❑
If a thread attempts to decrement a semaphore below zero, it will block until another thread raises the semaphore high enough.
Perl provides thread semaphores through the Thread::Semaphore module, which implements semaphores in terms of condition variables – the code of Thread::Semaphore is actually quite short, as well as instructive. It provides one subroutine and two object methods:
947
Chapter 22
new – create new semaphore, for example: $semaphore = new Thread::Semaphore; # create semaphore, initial value 1 $semaphore2 = new Thread::Semaphore(0) # create semaphore, initial value 0
up – increment semaphore, for example: $semaphore->up; # increment semaphore by 1 $semaphore->up(5); # increment semaphore by 5
down – decrement semaphore, blocking if necessary: $semaphore->down; # decrement semaphore by 1 $semaphore->down(5); # decrement semaphore by 5
Depending on our requirements we can use semaphores as binary stop/go toggles or allow them to range to larger values to indicate the availability of a resource. Here is an adapted form of our earlier threaded application, rewritten to replace the condition variables with semaphores: #!/usr/bin/perl # semaphore.pl use warnings; use strict; use Thread qw(yield); use Thread::Semaphore; my $threads = 3; # number of service threads to create my $line = ""; # input line my $main = new Thread::Semaphore; # proceed semaphore, initial value 1 my $next = new Thread::Semaphore(0); # new line semaphore, initial value 0 # a locked print subroutine – stops thread output mingling sub thr_print : locked { print @_; } # create a pool of three service threads foreach (1..$threads) { new Thread \&process_thing; } # main loop: read a line, raise 'next' semaphore to indicate a line is # available, then wait for whichever thread lowered the 'next' semaphore # to raise the 'main' semaphore, indicating we can continue. while ($line = <>) { chomp $line; thr_print "Main thread got '$line'\n"; # notify service threads that a new line is ready $next->up; thr_print "Main thread set new line semaphore, waiting to proceed\n";
948
Creating and Managing Processes
# do not proceed until value has been retrieved by responding thread $main->down; thr_print "Main thread received instruction to proceed\n"; } thr_print "All lines processed, sending end signal\n"; # set the line to special value 'undef' to indicate end of input $line = undef; # to terminate all threads, raise 'new line' semaphore to >= number of # service threads: all service threads will decrement it and read the # 'undef' $next->up($threads); thr_print "Main thread ended\n"; exit 0; # the thread subroutine - block on lock variable until work arrives sub process_thing { my $self = Thread->self; my $thread_line; thr_print "Thread ", $self->tid, " started\n"; while (1) { # try to decrement 'next' semaphore - winning thread gets line thr_print "Thread ", $self->tid, " waiting\n"; $next->down; # retrieve value to process thr_print "Thread ", $self->tid, " signalled\n"; $thread_line = $line; # was this the 'quit' signal? Check the value sent last unless (defined $thread_line); # let main thread know we have got the value thr_print "Thread ", $self->tid, " retrieved data, signaling main\n"; $main->up; # do private spurious things to it chomp ($thread_line=uc($thread_line)); thr_print "Thread ", $self->tid, " got '$thread_line'\n"; } thr_print "Thread ", $self->tid, " ended\n"; }
The semaphore version of the application is simpler than the condition variable implementation, if only because we have hidden the details of all the cond_wait and cond_signal functions inside calls to up and down. Instead of signaling the pool of service threads via a condition variable, the main thread simply raises the next semaphore by one, giving it the value 1. Meanwhile, all the service threads are attempting to decrement this semaphore. One will succeed and receive the new line of input, and the others will fail, continuing to block until the semaphore is raised again. When it has copied the line to its own local variable, the thread raises the main semaphore to tell the main thread that it can proceed to read another line. The concept is recognizably the same as the previous example, but is easier to follow.
949
Chapter 22
We have also taken advantage of the fact that semaphores can hold any positive value to terminate the application. When the main thread runs out of input it simply raises the 'next' semaphore to be equal to the number of service threads. At this point all the threads can decrement the semaphore, read the value of $line that we again set to undef, and quit. If a thread is still busy the semaphore will remain positive until it finishes and comes back to decrement it – we have no need to put in an extra check in case we missed a signal.
Queues Many threaded applications, our example a case in point, involve the transport of data between several different threads. In a complex application incoming data might travel through multiple threads, passed from one to the next before being passed out again: a bucket-chain model. We can create pools of threads at each stage along the chain in a similar way to the example application above, but this does not improve upon the mechanism that allows each thread to pass data to the next in line. The two versions of the application that we have produced so far are limited by the fact that they only handle a single value at a time. Before the main thread can read another line, it has to dispose of the previous one. Even though we can process multiple lines with multiple service threads, the communication between the main thread and the service threads is not very efficient. If we were communicating between different processes we might use a pipe, which buffers output from one process until the other can read it; the same idea works for threads, too, and takes the form of a queue. Perl provides support queues through the Thread::Queue module, which implements simple thread queues in a similar way to the semaphores created by Thread::Semaphore. Rather than a single numeric flag, however, the queue consists of a list to which values may be added to one and removed from the other. At heart this is essentially no more than a shift and pop operation. Using conditional variables and locking the module, however, creates a queue that values may be added to and removed from safely in a threaded environment, following rules similar to those for semaphores: ❑
Only one thread may add or remove values in the queue at a time.
❑
Any thread may add values to a queue immediately.
❑
Any thread may remove values from a queue immediately if there are enough values available in the queue.
❑
If a thread attempts to remove more values than are available, it will block until another thread adds sufficient values to the queue.
The Thread::Queue module provides one subroutine and four methods to create and manage queues: new – create a new queue, for example: $queue = new Thread::Queue;
enqueue – add one or more values to a queue, for example: $queue->enqueue($value); $queue->enqueue(@values);
950
# add a single value # add several values
Creating and Managing Processes
dequeue – remove a single value from a queue, blocking if necessary: $value = $queue->dequeue;
# remove a single value, block
dequeue_nb – remove a single value from a queue, without blocking (instead returning undef if nothing is available): $value = $queue->dequeue; if (defined $value) { ... }
# remove a single value, don't block
pending – return the number of values currently in the queue, for example: print "There are ",$queue->pending," items in the queue\n";
Using a queue we can rewrite our threaded application again to separate the main thread from the pool of service threads. Since the queue can take multiple values, the main thread no longer has to wait for each value it passes on to be picked up before it can continue. This simplifies both the code and the execution of the program. The queue has no limit, however, so we make sure not to read too much by checking the size of the queue, and yielding if it reaches a limit we choose. Here is a revised version of the same application, using a queue: #!/usr/bin/perl # queue.pl use warnings; use strict; use Thread 'yield'; use Thread::Queue; use Thread::Semaphore; my $threads = 3; # number of service threads to create my $maxqueuesize = 5; # maximum size of queue allowed my $queue = new Thread::Queue; # the queue my $ready = new Thread::Semaphore(0); # a 'start-gun' semaphore # initialized to 0 each service # thread raises it by 1 # a locked print subroutine – stops thread output mingling sub thr_print : locked { print @_; } # create a pool of service threads foreach (1..$threads) { new Thread \&process_thing, $ready, $queue; } # wait for all service threads to increment semaphore $ready->down($threads);
951
Chapter 22
# main loop: Read a line, queue it, read another, repeat until done # yield and wait if the queue gets too large. while (<>) { chomp; thr_print "Main thread got '$_'\n"; # stall if we're getting too far ahead of the service threads yield while $queue->pending >= $maxqueuesize; # queue the new line $queue->enqueue($_); } thr_print "All lines processed, queuing end signals\n"; # to terminate all threads, send as many 'undef's as there are service # threads $queue->enqueue( (undef)x$threads ); thr_print "Main thread ended\n"; exit 0; # the thread subroutine - block on lock variable until work arrives sub process_thing { my ($ready,$queue)=@_; my $self=Thread->self; my $thread_line; thr_print "Thread ",$self->tid," started\n"; $ready->up; #indicate that we're ready to go while (1) { # wait for queue to deliver an item thr_print "Thread ",$self->tid," waiting\n"; my $thread_line=$queue->dequeue(); # was this the 'quit' signal? Check the value sent last unless (defined $thread_line); # do private spurious things to it chomp ($thread_line=uc($thread_line)); thr_print "Thread ",$self->tid," got '$thread_line'\n"; } thr_print "Thread ", $self->tid, " ended\n"; }
Since the service threads block if no values are waiting in the queue, this approach effectively handles the job of having service threads wait – we previously dealt with this using condition variables and semaphores. However, we don't need a return semaphore anymore because there is no longer any need for a service thread to signal the main thread that it can continue – the main thread is free to continue as soon as it has copied the new line into the queue. The means by which we terminate the program has also changed. Originally we set the line variable to undef and broadcast to all the waiting threads. We replaced that with a semaphore, which we raised high enough so that all service threads could decrement it. With a queue we use a variation on the semaphore approach, adding sufficient undef values to the queue so that all service threads can remove one and exit.
952
Creating and Managing Processes
We have added one further refinement to this version of the application – a 'start-gun' semaphore. Simply put, this is a special semaphore that is created with a value of zero and increments it by one each service thread as it starts. The main thread attempts to decrement the semaphore by a number equal to the number of service threads, so it will only start to read lines once all service threads are running. Why is this useful? Because threads have no priority of execution. In the previous examples the first service threads will start receiving and processing lines before later threads have even initialized. In a busy threaded application, the activity of these threads may mean that the service threads started last do so very slowly, and may possibly never get the time to initialize themselves properly. In order to make sure we have a full pool of threads at our disposal, we use this semaphore to hold back the main thread until the entire pool is assembled. There are many other ways that threaded applications can be built and with more depth to the examples than we have shown here. Applications that also need to handle process signals should consider the Thread::Signal module, for instance. Perl threads are still experimental, so an in-depth discussion of them is perhaps inappropriate. However, more examples and a more detailed discussion of threaded programming models and how Perl handles threads can be found in the perlthrtut manual page by executing: > perldoc perlthrtut
Thread Safety and Locked Code
AM FL Y
Due to the fact that threads share data, modules and applications that were not built with them in mind can become confused if two threads try to execute the same code. This generally comes about when a module keeps data in global variables, which is visible to both threads and which is overwritten by each of them in turn. The result is threads that pollute each other's data. The best solution to avoid this problem is to rewrite the code to allow multiple threads to execute it at once. Code that does this is called thread safe. If this is too hard or time consuming we can use the :locked subroutine attribute to force a subroutine into a single-threaded mode, only one thread can be inside the subroutine at any one time (other calling threads will block until the current call is completed):
TE
singlethreadedsub:locked { print "One at a time! Wait your turn! \n"; }
This works well for functional applications, but for object-oriented programming locking a method can be overly restrictive; we only need to make sure that two threads handling the same object do not conflict. To do this we can add the :method attribute: singlethreadedmethod:locked:method { $self = shift; ... }
953
Team-Fly®
Chapter 22
We can also lock a subroutine in code by calling lock on it: sub mysub { lock \&mysub if $Config{'usethreads'}; ... }
The main use for this is for applications that we want to work in both threaded and unthreaded environments and which contain subroutines that cannot have two threads executing them at once.
Summary Over the course of this chapter we have covered the following:
954
❑
Signals and how to handle them.
❑
Using Perl's fork function to start new processes.
❑
Inter-Process Communication and the various ways in which different processes can communicate with one another.
❑
Sharing of data between processes, using message queues, semaphores, and shared memory segments.
❑
Thread support in Perl.
Creating and Managing Processes
955
Chapter 22
956
Networking with Perl Network programming is a very large topic. In this chapter, we will attempt to provide a solid foundation in the basic concepts and principles of networking. We will cover the TCP/IP protocol and its close relatives, then move on to developing simple network clients and servers using Perl's standard libraries. We will also look at the functions Perl provides for determining network configurations, and the helper modules that make analyzing this information simpler.
An Introduction to Networks Most Perl network programming involves establishing connections across, and communicating through, TCP/IP networks. In order to understand how network-programming interfaces work the way that they do, an overview of networking and the TCP/IP protocol in particular can be useful. In pursuit of this, what follows is a somewhat terse but reasonably informative overview of how networks function at a fundamental level, and how TCP/IP bridges the gap between basic network functionality and the application services.
Protocol Layers For any message to be communicated it must travel through some form of medium. This medium can range from air and space (used for satellite links and packet radio), to copper cabling and fiber optics. The medium of choice for the last twenty years has been copper cabling, due to the ubiquity of costeffective equipment based on open standards like Ethernet. This supplies the physical part of the network, also known as the physical layer.
Chapter 23
On top of this is layered a series of protocols, each providing a specific function – the preceding layer acting as the foundation for the one above. ISO worked on a seven-layer reference model for protocol standardization, to allow multiple vendors to communicate with each other. While it does not define any specific protocol or implementation, many modern protocols were designed against its model. The OSI (Open Systems Interconnection) reference model appears as follows: OSI Model Application Presentation Session Transport Network Data-link Physical
The bottom layer, which we already mentioned, is the communication, or physical, medium. Using Ethernet as an example, this layer includes not only the copper cabling, but also the impedance and voltage levels used on it. The second layer, known as the data-link layer, defines how that medium is to be used. For instance, our standard telephone systems are analog devices; the voice traffic is carried over analog carrier waves on the wire. Ethernet, like most computer network devices, prefers square waves. Because of this, this layer comprises the circuitry used to encode any transmissions in a form the physical layer can deliver, and decode any incoming traffic. It also has to ensure successful transmission of traffic. If none of this makes any sense, don't despair, as this is typically the domain of electrical and electronic engineers, not programmers. The network layer works very closely with the data-link layer. Not only must it ensure that it is providing the data-link layer with information it can encode (handling only digital traffic to digital networks, etc.), but it also provides the addressing scheme to be used on the network. Transport layers provide handshaking and check to ensure reliability of delivery. Certain protocols demand acknowledgment of every packet of information delivered to the recipient, and if certain portions aren't acknowledged, they are re-sent by this layer. The session layer is responsible for establishing and maintaining sessions, as its name would suggest. Sessions are typically necessary for larger and multiple exchanges between specific services and/or users, where the order of information transmitted and received is of great importance. The presentation layer is just that – a layer responsible for presenting a consistent interface or API for anything needing to establish and maintain connections.
958
Networking with Perl
Finally, the application layer specifies precisely how each type of service will communicate across these connections. HTTP, NNTP, and SMTP are examples of application protocols.
Frames, Packets, and Encapsulation In order to put all of this into perspective, let's examine how one of the most prevalent network architectures compares to this reference model. Since we'll be reusing much of this information later in the chapter, we'll look at TCP/IP networks running over Ethernet: OSI Model
TCP/IP over Ethernet
Application Application/Terminal Presentation Session TCP/UDP Transport Network
IP
Data-link Ethernet Topology Physical
As you can see, the Ethernet standard itself (of which, there are several variations and evolutions) encompasses both the physical and data-link layers. At the physical layer, it specifies the electrical signaling, clocking requirements, and connectors. At the data-link layer, Ethernet transceivers accomplish the encoding and decoding of traffic, and ensure delivery of traffic at both ends of the communication. This last part is actually a bit more involved than one might guess, due to the nature of the physical layer. Since the medium is copper, it should be obvious that only one host (or node) can transmit at a time. If more than one tries to send out a signal simultaneously, all signals are effectively scrambled by the mixed voltage levels. This event is called a collision. The Ethernet specification anticipates that this will happen (especially on extremely busy networks), and so provides a solution. When a collision is detected, each node will wait for a random interval before re-transmitting its data. In order to prevent particularly long communication sessions between hosts from hogging the line, Ethernet specifies that all communications be of a specific length of data, called a frame. Each frame is delivered only as the wire is sensed as free of traffic. This allows multiple hosts to appear to be communicating simultaneously. In truth, they are cooperatively sending out their frames one at a time, allowing control of the wire to be passed among several hosts before sending out the succeeding frames. On the receiving end, the specification requires that every node on the network have a unique hardware address (called a MAC address). Every frame is addressed to a specific address, and hence, only the node with that address should actually listen to those frames. Each address is comprised of two 24-bit segments. The first is the manufacturer's ID, assigned by the IEEE body. The last 24 bits are the unique equipment identifiers. The exception, of course, is broadcast traffic, which is not so addressed, since all nodes should be listening for it. This brings up a couple of caveats that I should mention here; while in theory you should never have to worry about two pieces of off-the-shelf gear having the same MAC address, it can and does happen, especially with those units that allow that address to be reprogrammed. This can cause some quite bizarre network problems that are extremely difficult to isolate if we're not expecting them.
959
Chapter 23
Another caveat is that while each node should only read the frames addressed to it, it doesn't necessarily have to be that way. Low-level network monitoring software ignores this, and reads all traffic. This is called sniffing the wire, and there are legitimate reasons to do this. With the case mentioned above, realizing that two nodes have the same MAC address often requires that you sniff the wire. As you can see, Ethernet is a carefully managed cooperative environment, but it is fallible. To learn more about Ethernet, or any of the competing standards, we can order publications from the IEEE (Institute of Electrical and Electronics Engineers) web site (http://www.ieee.org/), which governs the standards; or alternatively, from related sites like the Gigabit Ethernet Alliance (http://www.gigabitethernet.org/). http://www.ethernet.org/, while offering a few forums discussing Ethernet standards, is not affiliated with any of the official industry bodies. Lastly, we can always search on the Internet for IEEE 802.3, the CSMA/CD specification that defines the operation of Ethernet technologies. For printed references, Macmillan Technical Publishing's Switched, Fast, and Gigabit Ethernet ISBN-1578700736, written by Robert Breyer and Sean Riley, is one of the best publications available. It covers the whole gamut of Ethernet networking, including bridging, routing, QOS, and more.
The Internet Protocol Referring once more to our comparison chart, we now move into the realm of TCP/IP. TCP/IP is actually a bit misleading, since it refers to two separate protocols (and more are used, though not explicitly mentioned). As can be seen, the network layer in the OSI model directly correlates to IP (Internet Protocol), providing an addressing and routing scheme for the networks. Why, if Ethernet provides for addressing via MAC addresses, we need yet another addressing scheme? The answer is, because of the segmented nature of large networks. The Ethernet MAC system only works if every node addressed is on the same wire (or bridge). If two segments, or even two separate networks, are linked together via a router or host, then a packet addressed from one host on one network to another on the other network would never reach its destination if it could only rely on the MAC address. IP provides the means for that packet to be delivered across the bridging router or host onto the other network, where it can finally be delivered via the MAC address. We'll illustrate this more clearly in a moment, but first let's get a better understanding of what IP is. IP as we know it today is actually IPv4, a protocol that provides a 32 bit segmented addressing system. The protocol specification is detailed in RFC 791 (all RFC documents can be obtained from the RFC Editor's office web site, http://www.rfc-editor.org/, which is responsible for assigning each of them numbers and publishing them). As an aside, we should notice that we're dealing with more than one governing body just to implement one model. The reason for this is simple; the IEEE, being electrical and electronics engineers, are responsible for the hardware portion of the implementation. Once we move into software, though, most open standards are suggested through RFCs, or 'Request for Comments'. Multiple bodies work closely with the RFC Editor's office, including IANA (Internet Assigned Numbers Authority), IETF (Internet Engineering Task Force), and others. There is an emerging new standard, called IPv6, which offers a larger addressing space and additional features, but we won't be covering them here since IPv6 networks are still fairly rare. You can read more about IPv6 through RFC 2460.
960
Networking with Perl
Back to IPv4 and its addressing scheme. Notice that we said that IPv4 provides a segmented 32-bit address space. This address space is similar to MAC addresses, in that we divide the 32 bits into two separate designators. Unlike MAC's manufacturer:unit–# scheme, the first segment refers to the network, and the second the host. Also unlike MAC, the division doesn't have to be equally sized – in fact, it rarely is. While the distinction between the network and host can appear subtle, it's extremely important for routing traffic through disparate networks, as you will see. Regardless of this, an IP address is usually represented as four quads (as in four bits) in decimal form, and incorporates both the network and the host segment: 209.165.161.91. In RFC 1700, the Assigned Numbers standard, five classes of networks are defined: Class
Range
Bitmask of first octet
A
0-127
0xxxxxxx
B
128-191
10xxxxxx
C
192-223
110xxxxx
D
224-239
1110xxxx
E
240-255
1111xxxx
The 10.0.0.0 network address, for instance, is a class A network, since the first octet falls in the class A range, and 192.168.1.0 is a class C network. In practice, A-C are reserved for unicast addresses (networks where nodes typically communicate with single end-to-end connections), class D is reserved for multicast (nodes typically talking simultaneously to groups of other nodes), and E is reserved for future use. We need only be concerned with A through C here. The class of each unicast network determines the maximum size of each network, since each class masks out an additional octet in each IP address, restricting the number of hosts: Class
Netmask
A
255.0.0.0
B
255.255.0.0
C
255.255.255.0
For instance, a class A network, say 10.0.0.0, has a netmask of 255.0.0.0, which, in binary, would be represented as such: 00001010 00000000 00000000 00000000 Network address 11111111 00000000 00000000 00000000 Netmask The network administrator knows that the netmask masks (with 1s) all of the bits that he's not allowed to use as valid addresses. Hence, he or she can use any of the last three quads to address a specific host, allowing him or her to potentially address 16,777,216 hosts.
961
Chapter 23
An administrator assigned a class C network, like 198.168.1.0, would only be able to use the last octet, and hence only address 256 hosts (actually, the number of hosts is always 2 less, since 0 is reserved for the network address, and the highest number is usually reserved for network-wide broadcasts, which is 255 in this example): 11000000 10101000 00000001 00000000 Network address 11111111 11111111 11111111 00000000 Netmask As you might have surmised, an independent authority allocates network addresses, which prevents conflicting addresses. There are also, however, smaller blocks within each class that are set aside for internal use for all organizations (and homes, if need be): Class
Private IP block
A
10.0.0.0 - 10.255.255.255
B
172.16.0.0 - 172.31.255.255
C
192.168.0.0 - 192.168.255.255
Because these are unregulated addresses, multiple organizations will often use the same blocks internally. This does not cause any conflicts since these blocks are not routable on public networks (like the Internet). Any host on a company network using these blocks needs to either go through a proxy with a public address, or use some sort of network address translation (often referred to as NAT) in order to communicate with other hosts on public networks. Private IP assignments are covered in RFC 1918. It is important to note that none of this precludes the ability of any network administrator from subdividing internal network blocks in any manner they see fit – dividing the 10.x.x.x class A address space into several class Cs, or even smaller divisions (bit masks can also be applied to the last octet as well). This is performed quite frequently to relieve network congestion and localize traffic onto subnets. This, in essence, creates a classless (or CIDR, Classless InterDomain Routing) network, though it may still appear as one monolithic class network to the outside world. IP uses these network addresses and masks to determine the proper routing and delivery for each bit of information being sent. Each connection is evaluated as either needing local or remote delivery. Local delivery is the easiest, since IP will actually delegate to the lower layers to deliver the data:
192.168.1.7
3C:89:00:10:FA:0B
192.168.1.10
5A:03:26:DE:02:A1
In this example, both hosts are on the same network, so IP requests that Ethernet deliver the data directly. A separate protocol (called ARP – Address Resolution Protocol, which is covered in RFC 826) is used for this. That transaction will go along these lines (imagine that the host above, 192.168.1.7, is sending a message to its neighbor, 192.168.1.10):
962
Networking with Perl
❑
IP receives a packet (in this case, packet refers to some kind of payload) for delivery.
❑
IP determines that local delivery can be done since the destination address appears to be on the same network as its own.
❑
An ARP request is broadcasted asking, 'Who has .10?'.
❑
192.168.1.10 replies with its own MAC address.
❑
IP hands over the packet to the Ethernet layer for delivery to the MAC in the ARP response.
❑
Ethernet delivers the packet directly.
This method will not work, however, when the network is segmented. When this is done, some sort of bridge or router must exist that links the two networks, such as a host with two NICs, or a router. That bridge will have a local address on both networks, and the network will be configured to deliver all traffic that's not for local hosts through the bridge's local address. Recall that we explained that Ethernet delivers all of its payloads in frames. IP delivers all of its payloads in packets, and both frames and packets have their own addressing schemes. The Ethernet layer is intentionally ignorant of what it is delivering, and who the ultimate recipient is, since its only concern is local delivery. In order to preserve all of that extra addressing information that IP requires, it performs complete encapsulation of the IP packet. That packet, as you may have guessed, has more than just raw data from the upper layers, it begins with a header that includes the IP addressing information, with the remainder taken up by the data itself. Ethernet receives that packet (which must be small enough to fit into an Ethernet frame's data segment), stores the entire packet in its frame's data segment, and delivers it to the local recipient's MAC address.
AM FL Y
Accordingly, remote traffic is processed at two layers. Once IP determines that it is delivering to a remote host, it requests the local address of the router or gateway, and has Ethernet deliver the packet to the gateway, and the gateway forwards to the other network for local delivery:
192.168.1.10
TE
192.168.1.7
3C:89:00:10:FA:0B
5A:03:26:DE:02:A1 192.168.1.1 2B:00:3E:46:20:01
2B:00:3E:56:F2:FA 192.168.2.1 5A:03:A3:56:02:01
3C:E9:34:01:0A:AC
192.168.2.4
192.168.2.101
963
Team-Fly®
Chapter 23
So, if 192.168.1.7 wanted to send a message to 192.168.2.4, the transaction would look like: ❑
IP receives a packet for delivery.
❑
IP determines that local delivery cannot be done since the destination address appears to be on a different network from its own, and so needs to deliver it through the local gateway (referred to as the default route in networking terms), which it has previously been informed is 192.168.1.1.
❑
An ARP request is broadcasted asking, 'Who has .1?'
❑
192.168.1.1 replies with its own MAC address.
❑
IP hands over the packet to the Ethernet layer for delivery to the MAC in the ARP response.
❑
Ethernet delivers the packet to the gateway.
❑
The gateway examines the packet, and notices the IP address is meant for its other local network.
❑
The IP layer on the gateway's other local port receives the packet.
❑
IP determines that it can deliver the packet locally.
❑
An ARP request is broadcasted asking, "Who has .4?"
❑
192.168.2.4 replies with its own MAC address.
❑
IP hands over the packet to the Ethernet layer for delivery to the MAC in the ARP response.
❑
Ethernet delivers the packet.
IP is also responsible for another very important task: if the upper layers hand it large bundles of data (each individual bundle would be considered an IP datagram), it must break them apart into something small enough for it to handle with its header information, and still be encapsulated into a single frame. When this happens, each packet sent out includes a sequence number in the header, allowing the receiving end's IP layer to reconstruct the original datagram from the individual packets.
UDP & TCP By now, we should have a good idea how networks deliver and route traffic as needed, and we've covered the first three layers of the OSI Model with illustrations from TCP/IP over Ethernet. We can also see that IP could also route traffic from Ethernet to another network layer, such as token ring, or PPP over a MODEM connection. We're now ready to move on to the next two layers in the OSI model, Transport and Session. These two layers are combined and handled by TCP (Transmission Control Protocol) and UDP (User Datagram Protocol). We'll cover UDP since it's the simpler of the two protocols. UDP provides a simple transport for any applications that don't need guarantees for delivery or data integrity. It may not be apparent from that description what the benefits of UDP are, but it does have a few. For one, UDP gains a great deal of performance since less work needs to be done. There's a great deal of overhead involved in maintaining TCP's state and data integrity. UDP applications can send and forget. Another benefit is the ability to broadcast (or multicast) data to groups of machines. TCP is a tightly controlled end-to-end protocol, permitting only one node to be addressed.
964
Networking with Perl
NTP (Network Time Protocol) is one application protocol that derives some great advantages from using UDP over TCP. Using TCP, a time server would have to wait and allow individual machines to connect and poll the time, meaning that redundant traffic is generated for every additional machine on the network. This would mean a second or more could elapse between the time being sent and the time being received, meaning an incorrect time being recorded. Using UDP allows it to multicast its time synchronization traffic in one burst and have all machines receive and respond accordingly, relieving a significant amount of traffic, which can be better used for other services. In this scenario, it's not critical that each machine be guaranteed delivery of that synchronization data since it's not likely that a machine will drift much, if at all, between the multicast broadcasts. As long as it receives some of the broadcasts occasionally, it will stay in sync, and bandwidth is preserved. TCP, on the other hand, is much more expensive to maintain, but if you need to ensure that all the data arrives intact, there's no other alternative. TCP also allows state-full connections, with each exchange between nodes acknowledged and verified. If any portion of the data fails to come in, TCP will automatically request a retransmission, and the same would occur for any data arriving that fails to generate the same checksum. Another important feature is TCP's use of internal addressing. Every application using a TCP connection is assigned a unique port number. Server applications typically will use the same port to listen on, to ensure that all clients know what IP port they need to connect to for reaching that service. TCP connections take this one step farther, by maintaining each completed circuit by independent local and remote IP:port pairs. This allows the same server to serve multiple clients simultaneously since the traffic is segregated according to the combined local and remote addresses and ports. It important to note that while UDP has a concept of ports as well, being a connectionless protocol, no virtual circuits are built, it simply sends and receives on the port the application has registered with the UDP layer, and hence, the same capabilities do not exist. Also note that port addressing for UDP and TCP exist independently, allowing you to run a UDP and a TCP application on the same port number simultaneously. The port portion of the IP:port pair is only 16 bits, though that rarely presents a problem. Traditionally, ports 1-1023 are reserved as privileged ports, which means that no service should be running on these ports that hasn't been configured by the system administrator. Web servers, as an example, run on port 80, and mail services run on port 25. When a user application connects to such a service, it is assigned a random high number port that is currently unused. Therefore, the connection to a web server will typically look like 209.165.161.91:80 <-> 12.18.192.3:4616. These features make TCP the most widely used protocol on the Internet. Applications like web services (HTTP protocol) would certainly not be as enjoyable if the content came in out of order and occasionally corrupted. The same goes for other application protocols like mail (SMTP), and news (NNTP). Both protocols are covered by their own RFCs: 768 for UDP, and 793 for TCP.
ICMP ICMP (Internet Control Message Protocol) is another protocol that should be mentioned for completeness. It's not a transport protocol in its own right, but a signaling protocol used by the other protocols. For instance, should a problem arise that appears to blocking all datagrams from reaching the recipient, IP would use ICMP to report the problem to the sender. The popular ping command generates ICMP messages ('Echo Request' message, in this case) too, to verify that the desired hosts are on the network, and 'alive'.
965
Chapter 23
ICMP can deliver a range of messages including Destination Unreachable, Time Exceeded, Redirect Error, Source Quench, Parameter Problem, and a wide variety of sub-messages. It can also provide some network probing with Echo/Timestamp/Address Mask Request and Replies. ICMP is more fully documented in RFC 792.
Other Protocols It is important to realize that other protocols exist at various levels of the OSI reference model, such as IPX, Appletalk, NetBEUI, and DecNET. Perl can even work with some of these protocols. We won't be covering them here, though, since a clear majority of Perl applications reside in the TCP/IP domain. Furthermore, the principles illustrated here will apply in part to the operation of other types of networks.
Presentation/Application Layers The last part of our OSI model comparison to examine is the presentation and application layers. TCP/IP doesn't actually have any formal presentation services, but there are many that can be optionally used as needed. Terminal applications, for instance, use a character-based service called the Network Virtual Terminal. Many applications, however, simply send their data and commands directly through the transport layer for direct handling by the application. As mentioned earlier, SMTP, HTTP, and NNTP are examples of application protocols that work directly with the transport protocol. The only thing that these types of protocols specify is how they are to use the connection presented by the lower layers. For HTTP, as an example, the protocol specifies that each connection will consist of simple query and response pairs, and what the syntax and commands allowed in the queries are.
Anonymous, Broadcast, and Loopback Addresses Any address within a network with a host segment (the host segment, remember, is the unmasked portion of any network address) that consists entirely of ones or zeros is not normally treated as a specific network interface. By default, a host with segments containing all zeros is an 'anonymous' address (since it is, in actuality, the network's address) that can be set as the sender in a TCP/IP packet. For example: 192.168.0.0. Of course, a receiving system is entirely within its rights to ignore anonymous transmissions sent to it. This is not a valid address for the recipient however. Conversely, setting the host segment to all ones is a broadcast to all of the systems that have IP addresses in that range, for example: 192.168.255.255. Pinging this address would result in every host on that subnet returning that echo request (at least, it does with UNIX systems, Windows NT appears to ignore it). In addition, most hosts define the special address 127.0.0.1 as a self-referential address, also known as the loopback address. The loopback address allows hosts to open network connections to themselves, which is actually more useful than it might seem; we can for example run a web browser on the same machine as a web server without sending packets onto the network. This address, which is usually assigned to the special host name localhost, is also useful for testing networking applications when no actual network is available.
966
Networking with Perl
Networking with Perl Sockets are Perl's interface into the networking layer, mapping the details of a network connection into something that looks and feels exactly like a filehandle. This analogy didn't start with Perl, of course, but came into popular usage with BSD UNIX. Perl merely continues that tradition, just as it is with other languages.
Sockets There are two basic kinds of sockets: Internet Domain and UNIX domain sockets. Internet Domain sockets (hereafter referred to as INET sockets) are associated with an IP address, a port number, and a protocol, allowing you to establish network connections. UNIX domain sockets, on the other hand, appear as files in the local filing system, but act as a FIFO stack and are used for communication between processes on the same machine. Sockets have a type associated with them, which often determines the default protocol to be used. INET stream sockets use the TCP protocol, since TCP provides flow control, state, and session management. INET datagram sockets, on the other hand, would naturally use UDP. Information flows in the same way that it does with a regular filehandle, with data read and written in bytes and characters. Some socket implementations already provide support for the IPv6 protocol or handle other protocols such as IPX, X25, or AppleTalk. Depending on the protocol used, various different socket types may or may not be allowed. Other than stream and datagram sockets, a 'raw' socket type exists, which gives a direct interface to the IP protocol. Other socket types supported will be documented in the socket(2) man page on a UNIX-like system.
'Socket.pm' Since Perl's socket support is a wrapper for the native C libraries (on UNIX, anyway), it can support any type your system can. Non-UNIX platforms may have varying or incomplete support for sockets (like Windows), so check your Perl documentation to determine the extent of your support. At the very least, standard INET sockets should be supported on most platforms. The socket functions are a very direct interface. One consequence of this is that the idiosyncrasies of the original C API poke through in a number of places, notably the large number of numeric values for the various arguments of the socket functions. In addition, the address information arguments required by functions like bind, connect, and send need to be in a packed sockaddr format that is acceptable to the underlying C library function. Fortunately, the Socket module provides constants for all of the socket arguments, so we won't have to memorize those numeric values. It also provides conversion utilities like inet_aton and inet_ntoa that convert string addresses into the packed form desired and returned by the C functions. A summary of Perl's socket functions is given overleaf, some directly supported by Perl, the rest provided by Socket.pm. Each of them is explained in more detail later in the chapter.
967
Chapter 23
General: Function
Description
socket
Create a new socket filehandle in the specified domain (Internet or UNIX) of the specified type (streaming, datagram, raw, etc.) and protocol (TCP, UDP, etc.). For example: socket(ISOCK, PF_INET, SOCK_STREAM, $proto);
shutdown
Close a socket and all filehandles that are associated with it. This differs from a simple close, which only closes the current process's given filehandle, not any copies held by child processes. In addition, a socket may be closed completely or converted to a read-only or write-only socket by shutting down one half of the connection. For example: shutdown(ISOCK, 2);
socketpair
Generate a pair of UNIX domain sockets linked back-to-back. This is a quick and easy way to create a full-duplex pipe (unlike the pipe function) and is covered in more detail in Chapter 22. For example: socketpair(RSOCK, WSOCK, SOCK_STREAM, PF_UNSPEC);
Servers: Function
Description
bind
Bind a newly created socket to a specified port and address. For example: bind ISOCK, $packed_addr;
listen
(Stream sockets only) Set up a queue for receiving incoming network connection requests on a bound socket. For example: listen ISOCK, $qsize;
accept
(Stream sockets only) Accept and create a new communications socket on a bound and listened to server socket. For example: accept CNNCT, ISOCK;
Clients: Function
Description
connect
Connect a socket to a remote server at a given port and IP address, which must be bound and listening to the specified port and address. Note: while datagram sockets can't really connect, this is supported. This sets the default destination for that type. For example: connect ISOCK, $packed_addr;
968
Networking with Perl
Options: Function
Description
getsockopt
Retrieve a configuration option from a socket. For example: $opt = getsockopt ISOCK, SOL_SOCKET, SO_DONTROUTE;
setsockopt
Set a configuration option on a socket. For example: setsockopt ISOCK, SOL_SOCKET, SO_REUSEADDR, 1;
IO: Function
Description
send
Send a message. For UDP sockets this is the only way to send data, and an addressee must be supplied, unless a default destination has already been set by using the connect function. For TCP sockets no addressee is needed. For example: send ISOCK, $message, 0;
recv
Receive a message. For a UDP socket, this is the only way to receive data, and returns the addressee (as a sockaddr structure) on success. TCP sockets may also use recv. For example: $message = recv ISOCK, $message, 1024, 0;
Conversion Functions: Function
Description
inet_aton
aton = ASCII to Network byte order. Convert a hostname (for example, www.myserver.com) or textual representation of an IP address (for example, 209.165.161.91) into a four byte packed value for use with INET sockets. If a host name is supplied, a name lookup, possibly involving a network request, is performed to find its IP address (this is a Perl extra, and not part of the C call). Used in conjunction with pack_sockaddr_in. For example: my $ip = inet_aton($hostname);
inet_ntoa
ntoa = Network byte order to ASCII Convert a four byte packed value into a textual representation of an IP address, for example, 209.165.161.91. Used in conjunction with unpack_sockaddr_in. Table continued on following page
969
Chapter 23
Function
Description
pack_sockaddr_in
Generate a sockaddr_in structure suitable for use with the bind, connect, and send functions from a port number and a packed four-byte IP address. For example: my $addr = pack_sockaddr_in($port, $ip);
The IP address can be generated with inet_aton. unpack_sockaddr_in
Extract the port number and packed four-byte IP address from the supplied sockaddr_in structure: my ($port, $ip) = unpack_sockaddr_in($addr);
sockaddr_in
Call either unpack_sockaddr_in or pack_sockaddr_in depending on whether it is called in a list (unpack) or scalar (pack) context: my $addr = sockaddr_in($port, $ip); my ($port, $ip) = sockaddr_in($addr);
The dual nature of this function can lead to considerable confusion, so using it in only one direction or using the pack and unpack versions explicitly is recommended. pack_sockaddr_un
Convert a pathname into a sockaddr_un structure for use with UNIX domain sockets: my $addr = pack_sockaddr_un($path);
unpack_sockaddr_un
Convert a sockaddr_un structure into a pathname: my $path = unpack_sockaddr_un($addr);
sockaddr_un
Call either unpack_sockaddr_un or pack_sockaddr_un depending on whether it is called in a list (unpack) or scalar (pack) context: my $addr = sockaddr_in($path); my ($path) = sockaddr_in($addr);
This function is even more confusing than sockaddr_in, since it only returns one value in either case. Using it for packing only, or using the pack and unpack versions explicitly is definitely recommended. In addition to the utility functions, Socket supplies four symbols for special IP addresses in a prepacked format suitable for passing to pack_sockaddr_in:
970
Networking with Perl
Symbol
Description
INADDR_ANY
Tells the socket that no specific address is requested for use, we just want to be able to talk directly to all local networks.
INADDR_BROADCAST
Uses the generic broadcast address of 255.255.255.255, which transmits a broadcast on all local networks.
INADDR_LOOPBACK
The loopback address for the local host, generally 127.0.0.1.
INADDR_NONE
An address meaning 'invalid IP address' in certain operations. Usually 255.255.255.255. This is invalid for TCP for example, since TCP does not permit broadcasts.
Opening Sockets The procedure for opening sockets is the same for both UNIX and INET sockets; it only differs depending on whether you're opening a client or server connection. Server sockets typically follow three steps: creating the socket, initializing a socket data structure, and binding the socket and the structure together: socket(USOCK, PF_UNIX, SOCK_STREAM, 0) || die "socket error: $!\n"; $sock_struct = sockaddr_un('/tmp/server.sock'); bind(USOCK, $sock_struct) || die "bind error: $!\n";
Of course, you may also need to insert a setsockopt call to fully configure the socket appropriately, but the above three steps are needed as a minimum. Opening client connections is similar: create socket, create the packed remote address, and connect to the server: socket(ISOCK, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die "socket error: $!\n"; $paddr = sockaddr_in($port, inet_aton($ip)); connect(ISOCK, $paddr) || die "connect error: $!\n";
Configuring Socket Options The socket specification defines a number of options that may be set for sockets to alter their behavior. For direct manipulation of socket options, we can make use of Perl's built-in getsockopt and setsockopt functions, which provide low-level socket option handling capabilities. For the most part we rarely want to set socket options; however, there are a few socket options that are worth dealing with from time to time. These options apply almost exclusively to INET sockets. The getsockopt function takes three arguments: a socket filehandle, a protocol level, and the option to set. Protocol level refers to the level the desired level operates at in the network stack. TCP, for instance, is a higher-level protocol than IP, and so has a higher value. We needn't bother with the particular values, though, since constants are available. setsockopt takes the same three arguments, but it can also include a value to set the option to. Leaving that argument out effectively unsets that option.
971
Chapter 23
Options include: Option
Description
SO_ACCEPTCONN
The socket is listening for connections.
SO_BROADCAST
The socket is enabled for broadcasts (used with UDP).
SO_DEBUG
Debugging mode enabled.
SO_LINGER
Sockets with data remaining to send continue to exist after the process that created them exits, until all data has been sent.
SO_DONTROUTE
The socket will not route. See also MSG_DONTROUTE in 'Reading from and Writing to Sockets' below.
SO_ERROR
(Read-only) The socket is in an error condition.
SO_KEEPALIVE
Prevent TCP/IP closing the connection due to inactivity by maintaining a periodic exchange of low-level messages between the local and remote sockets.
SO_DONTLINGER
Sockets disappear immediately, instead of continuing to exist until all data has been sent.
SO_OOBINLINE
Allow 'out of band' data to be read with a regular recv – see 'Reading fromand Writing to Sockets' below.
SO_RCVBUF
The size of the receive buffer.
SO_RCVTIMEO
The timeout value for receive operations.
SO_REUSEADDR
Allow bound addresses to be reused immediately.
SO_SNDBUF
The size of the send buffer.
SO_SNDTIMEO
The timeout period for send operations.
SO_TYPE
The socket type (stream, datagram, etc.).
Of these, the most useful is SO_REUSEADDR, which allows a given port to be reused, even if another socket exists that's in a TIME_WAIT status. The common use for this is to allow an immediate server restart, even if the kernel hasn't finished it's garbage collection and removed the last socket: setsockopt SERVER, SOL_SOCKET, SO_REUSEADDR, 1;
We can be a little more creative with the commas to provide an arguably more legible statement: setsockopt SERVER => SOL_SOCKET, SO_REUSEADDR => 1;
The SO_LINGER option is also useful, as is its opposite, SO_DONTLINGER. Usually the default, SO_DONTLINGER causes sockets close immediately when a process exits, even if data remains in the buffer to be sent. Alternatively, you can specify SO_LINGER, which causes sockets to remain open as long as they have data still to send to a remote client. setsockopt SERVER, SOL_SOCKET, SO_DONTLINGER, 1;
972
Networking with Perl
SO_KEEPALIVE informs the kernel that TCP messages should be periodically sent to verify that a connection still exists on both ends. This feature can be handy for connections going through a NAT firewall, since that firewall has to maintain a table of connection translations, which are expired automatically after an interval with no activity. It also provides a periodic link test for those applications that keep predominantly idle connections, but need to be assured that a valid connection is maintained. setsockopt SERVER, SOL_SOCKET, SO_KEEPALIVE, 1;
SO_BROADCAST sets or tests a socket for the ability to broadcast, an option which is only applicable to UDP sockets. SO_TYPE is a read-only option that returns the socket type. SO_ERROR indicates that the socket is in an error condition, though $! is an alternative. The getsockopt function simply retrieves the current value of a socket option: my $reusable = getsockopt SERVER, SOL_SOCKET, SO_REUSEADDR;
For experienced developers, there are also some lower-level options The expert network programmer can use getsockopt and setsockopt at lower levels in the protocol stack by supplying a different level as the protocol level argument. The Socket module does not define symbols for these levels since they are nothing directly to do with sockets, so to get useful names we must include the sys/socket.ph definition file: require 'sys/socket.ph'.
AM FL Y
This file defines symbols for all the system constants related to networking on the platform in question, including SOL_ symbols for lower-level protocols. The most interesting for us are: Description
SOL_IP
IP version 4 protocol
SOL_TCP
TCP protocol
SOL_UDP
UDP protocol
SOL_IPV6
IP version 6 protocol
SOL_RAW
'raw' protocol (SOCK_RAW)
SOL_IPX
IPX protocol
SOL_ATALK
Appletalk protocol
SOL_PACKET
'packet' protocol (SOCK_SEQPACKET)
TE
Symbol
Each protocol defines its own set of options. The only one of immediate interest to us here is TCP_NODELAY, which disables buffering at the TCP protocol level, essential for sending things like typed characters in real time across network connections: setsockopt CONNECTION, SOL_TCP, TCP_NODELAY, 1;
973
Team-Fly®
Chapter 23
Reading from and Writing to Sockets Reading from and writing to sockets is handled by the 'recv' and 'send' functions, respectively. This is used with datagram sockets more often than with other types, but it's a valid call for all of them. As was mentioned above, Perl merely wraps the C function, so the syntax and behaviors are the same. While we'll be covering them in some detail here, you can get more information directly from the development manual pages (see recv(2) and send(2)). The send function takes four arguments; a socket filehandle, a message, a numeric flag (which in most cases can be '0'), and an addressee, in the form of a sockaddr_in or sockaddr_un structure: send SOCKET, $message, $flags, $addressee;
Since stream sockets are connected, we can omit the last argument for that type. For other types, the addressee is a sockaddr structure built from either a host and port number, or a local pathname, depending on the socket domain: my $inetaddr = sockaddr_in($port, inet_aton($hostorip)); my $unixaddr = sockaddr_un($pathname);
The flags argument is a little more interesting, and is combined from a number of flags with symbols (defined by the Socket module) that alter the form or way in which the message is sent: Send Flag
Description
MSG_OOB
Send an Out-Of-Band (that is, high priority) message.
MSG_DONTROUTE
Ignore the local network gateway configuration and attempt to send the message directly to the destination.
MSG_DONTWAIT
Do not wait for the message to be sent before returning. If the message would have blocked, the error EAGAIN (see the Errno module in Chapter 16) is returned, indicating that we must try again later.
MSG_NOSIGNAL
TCP sockets only. Do not raise a SIGPIPE signal when the other end breaks the connection. The error EPIPE is still returned.
For example: use Errno qw(EAGAIN EPIPE); $result = send SOCKET, $message, MSG_DONTWAIT|MSG_NOSIGNAL, $addressee; if ($result == EAGAIN) { ... }
The recv function also takes four arguments, but instead of an addressee, it specifies a length for the data to be received. It resembles the read function in several ways; the arguments are a socket filehandle, as before, then a scalar variable into which the message is placed and an expected length, followed up by a flags argument as before. The return value on success is the sockaddr structure for the client: my $message; # scalar to store message in my $length = 1024; # expected maximum length my $sender = recv SOCKET, $message, $length, $flags;
974
Networking with Perl
The length argument is in fact somewhat arbitrary, since the size of the message is determined when it is read and adjusted accordingly. However, the supplied length is used to pre-allocate storage for the received message, so a size that is equal to or larger than the message will allow Perl to perform slightly quicker. Conversely, configuring a too large message will waste time to start with. A reasonable value is therefore the best idea. The flags argument allows us to modify the way in which we receive messages: Receive Flag
Description
MSG_OOB
Receive an Out-Of-Band message rather than a message from the normal message queue, unless OOB messages are being put at the head of the queue of normal messages by setting the SO_OOBINLINE socket option (see 'Configuring Socket Options').
MSG_PEEK
Examine and return the data of the next message on the socket without removing it.
MSG_WAITALL
Wait until a message has been completely received before attempting to retrieve it.
MSG_ERRQUEUE
Retrieve a message from the error queue rather than the normal message queue.
MSG_NOSIGNAL
TCP sockets only. Do not raise a SIGPIPE signal when the other end breaks the connection. The error 'EPIPE' is still returned.
For example: my $message; my $sender = recv SOCKET, $message, 1024,MSG_PEEK|MSG_WAITALL;
Note that depending on the socket implementation, the send and recv methods may accept more or different flags from those listed above. However, these are reasonably typical for most platforms, and certainly most UNIX platforms.
Closing Sockets Like all other in-memory variables, Perl performs reference counting on sockets, and will close them automatically, should all the existing references go out of scope. In keeping with good programming practice, we should explicitly release unneeded resources, and Perl provides a way to do so. In fact, since sockets are in most ways filehandles, you can use the same function you would on a filehandle, like this: close ISOCK;
There is a caveat to this rule, however. If a child has been forked that has a handle to the same socket, close only affects the process that called it. All other children with a handle to the socket will still have access to it, and the socket remains in use. A workaround for this situation exists in the shutdown function: shutdown(ISOCK, 2);
975
Chapter 23
This function closes the socket completely, even for any child processes with handles to it. The second argument to this function is what instructs shutdown to do this, and changing that value illuminates further capabilities of shutdown. Sockets can provide bi-directional communication, and shutdown will allow you to shut down either of the two channels independently. If we wish the socket to become a read-only socket, we can call shutdown with a second argument of 1. To make it write-only, we call it with a second argument of 0. The Socket module also defines constants for these values, which make them slightly friendlier: shutdown SOCKET, SHUT_RD; # shutdown read shutdown SOCKET, SHUT_WR; # shutdown write shutdown SOCKET, SHUT_RDWR; # shutdown completely
'IO::Socket.pm' Although the Socket module makes using the socket functions possible, it still leaves a lot to be desired. To make life simper all round, the IO::Socket module and its children, IO::Socket::INET and IO::Socket::UNIX, provide a friendlier and simpler interface that automatically handles the intricacies of translating host names into IP addresses and thence into the sockaddr structure as well as translating protocol names into numbers and even services into ports. All the socket functions are represented as methods; new sockets are created, bound, listened to, or connected with the new method, and all other functions become methods of the resulting filehandle object: $socket->accept; $udpsocket->send("Message in a Datagram", 0, $addressee); $tcpsocket->print("Hello Network World!\n"); $socket->close;
The main distinctions from the native Perl functions are in the way they are created and configured.
Configuring Socket Options Options can be set on sockets at creation time using the new method, but the IO::Socket modules also support getsockopt and setsocketopt methods. These are direct interfaces to the functions of the same name and behave identically to them, though usage syntax differs: $connection->setsockopt(SOL_SOCKET, SO_KEEPALIVE => 1);
The IO::Socket modules also allow us to configure aspects of the socket when we create it, though not with the same flexibility as getsockopt and setsockopt. The arguments differ for Internet and UNIX domain sockets, in accordance with their different behaviors.
976
Networking with Perl
The full range of options accepted by the new method of IO::Socket::INET can be seen below: INET Options
Description
LocalAddr
(Server only) The local address or hostname to serve optionally followed by the port number or service name (which may additionally be followed by a default in parentheses) prefixed by a colon. For example:
LocalHost
127.0.0.1 127.0.0.1:4444 127.0.0.1:myservice(4444) www.myserver.com www.myserver.com:80 www.myserver.com:http
LocalPort
(Server only) The local port number or service name, if not given in LocalAddr/LocalHost. For example: 80 http
Listen
(Server only) The size of the queue to establish on the bound server socket. The maximum possible size is defined on a platform basis by the constant SOMAXCONN. Default: 5
Multihomed
(Server only) Boolean value. Listen for connections on all locally available network interfaces. Default: 0
Reuse
(Server only) Boolean value. Set the SO_REUSEADDR option to free up the socket for immediate reuse. Default: 0
PeerHost
(Client only) The remote address or hostname to connect to optionally followed by a port number or service name (which may additionally be followed by a default in parentheses) prefixed by a colon. For example: 127.0.0.1 127.0.0.1:4444 127.0.0.1:myservice(4444) www.myserver.com www.myserver.com:80 www.myserver.com:http
PeerPort
(Client only) The remote port number or service name, if not given in PeerAddr/PeerHost. For example: 80 http
Table continued on following page
977
Chapter 23
INET Options
Description
Proto
The protocol number, or more sensibly, name. For example: tcp udp icmp
The type is set implicitly from the socket type if it is SOCK_STREAM, SOCK_DGRAM, or SOCK_RAW. Default: tcp Timeout
Timeout value for socket operations, in seconds. Default: 0 (no timeout).
Type
The socket type. For example: SOCK_STREAM (for TCP) SOCK_DGRAM (for UDP) SOCK_RAW (for ICMP) SOCK_SEQPACKET
The socket type is set implicitly from the protocol if it is tcp, udp, or icmp. Default: SOCK_STREAM If only one argument is passed to the new method of IO::Socket::INET, it is assumed to be a client's peer address, including the port number or service name. my $server = new IO::Socket::INET($server_address_and_port);
The new method of IO::Socket does not have this shorthand: defining the socket domain is mandatory. The new method of the IO::Socket::UNIX module is considerably simpler. It takes only four arguments, three of which we have already seen: UNIX Option
Description
Local
(Server only) The pathname of the socket connection in the filesystem, for example: /tmp/myapp_socket
The pathname should be somewhere that the server has the ability to create a file. A previous file should be checked for and removed first if present. Listen
(Server only) Specify that this is a server socket, and set the size of the queue to establish on the bound server socket. The maximum possible size is defined on a platform basis by the constant SOMAXCONN. Note that unlike INET sockets, it is the setting of this parameter, not Local or Peer, that causes the module to bind and listen to the socket. It is not implied by Local. Default: 5
978
Networking with Perl
UNIX Option
Description
Peer
(Client only) The pathname of the socket connection in the filesystem, for example: /tmp/myapp_socket
Type
The socket type, for example: SOCK_STREAM SOCK_DGRAM SOCK_RAW SOCK_SEQPACKET
Default: SOCK_STREAM If using new via the parent IO::Socket module we must also supply a domain for the socket: my $socket = new IO::Socket(Domain => PF_INET, ...);
This may be from the PF_ family of constants, for example PF_INET or PF_UNIX (or indeed PF_APPLETALK, etc.). These are not to be confused with the AF_ constants (as supplied to gethostbyaddr, or getnetbyaddr, which are from the address family of symbols). Now that we've covered all of the basics of using both Socket and IO::Socket methodologies, we'll illustrate them will sample applications. We'll cover both client and server implementations, using both methodologies, and using both TCP and UDP protocols.
TCP INET Examples A server is simply an application that listens for and accepts incoming connections from client applications. Once the connection is established, the two ends may then carry out whatever service the server is there to perform – receive a file, send a web page, relay mail, and so on. In general, a server creates a TCP socket to listen for incoming connections, by first creating the socket with socket, and then assigning it to server duty with the bind function. The resulting socket is not connected, nor can it be used for communications, but it will receive notification of clients that connect to the address and port number otof which it is bound. To prepare to receive connections, we use the listen function, which creates an internal queue for incoming connections to collect. We then use the accept function to pull a connection request from the queue and create a socket filehandle for it. With it, we can communicate with the remote system. Behind the scenes, TCP sends an acknowledgment to the client, which creates a similar socket to communicate with our application.
A Simple TCP Socket Server The following simple application demonstrates one way to create a very simple TCP server operating over an INET socket, using Perl's built-in socket functions: #!/usr/bin/perl # tcpinetserv.pl use strict; use warnings use Socket;
979
Chapter 23
my $proto = getprotobyname('tcp'); my $port = 4444; # Create 'sockaddr_in' structure to listen to the given port # on any locally available IP address my $servaddr = sockaddr_in($port, INADDR_ANY); # Create a socket for listening on socket SERVER, PF_INET, SOCK_STREAM, $proto or die "Unable to create socket: $!"; # bind the socket to the local port and address bind SERVER, $servaddr or die "Unable to bind: $!"; # listen to the socket to allow it to receive connection requests # allow up to 10 requests to queue up at once. listen SERVER, 10; # now accept connections print "Server running on port $port...\n"; while (accept CONNECTION, SERVER) { select CONNECTION; $| = 1; select STDOUT; print "Client connected at ", scalar(localtime), "\n"; print CONNECTION "You're connected to the server!\n"; while () { print "Client says: $_\n"; } close CONNECTION; print "Client disconnected\n"; }
We use the Socket module to define the constants for the various socket functions – PF_INET means create an INET socket (we could also have said PF_UNIX for a UNIX domain socket or even PF_APPLTALK for a Macintosh Appletalk socket). SOCK_STREAM means create a streaming socket, as opposed to SOCK_DGRAM, which would create a datagram socket. We also need to specify a protocol. In this case we want TCP, so we use the getprotobyname function to return the appropriate protocol number – we explore this and similar functions at the end of the chapter. Once the constants imported from Socket are understood, the socket function becomes more comprehensible; it essentially creates a filehandle that is associated with a raw, unassigned socket. To actually use it for something, we need to program it. First, we bind it to listen to port 4444 on any locally available IP address using the bind function. The bind function is one of the esoteric socket functions that requires a packet C-style data structure, so we use the sockaddr_in function to create something that is acceptable to bind from ordinary Perl values. The special INADDR_ANY symbol tells bind to listen to all local network interfaces; we could also specify an explicit IP address with inet_aton: # bind to the loopback address # (will only respond to local clients, not remote) bind SERVER, inet_aton('127.0.0.1');
980
Networking with Perl
The listen function enables the socket's queue to receive incoming network connections, and sets the queue size. It takes the socket as a first argument and a queue size as the second. In this case, we said 10 to allow up to ten connections to queue; after that, the socket will start rejecting connections. We could also use the special symbol (again defined by Socket) of SOMAXCONN to create a queue of the maximum size supported by the operating system: # create a queue of the maximum size allowed listen SERVER, SOMAXCONN;
Having gone through all this setup work, the actual server is almost prosaic – we simply use accept to generate sockets from the server socket. Each new socket represents an active connection to a client. In this simple example we simply print a hello message and then print out whatever the client sends until it disconnects. In order to make sure that anything we send to the client is written out in a timely manner we set autoflush on the socket filehandle. This is an important step since our response to the client may never reach them since our socket is waiting for enough data to fill the buffer before flushing it to the client. IO buffering is enabled in Perl by default. Alternatively, we could have used the syswrite function, which explicitly bypasses any buffering.
A Simple TCP 'IO::Socket' Server The IO::Socket module simplifies all this somewhat by relieving us from having to do all of the low level sockaddr_in manipulation socket function calls. Here is the same server written using IO::Socket: #!/usr/bin/perl # tcpioinetserv.pl use warnings; use strict; use IO::Socket; my $port = 4444; my $server = Domain Proto LocalPort Listen );
IO::Socket->new( => AF_INET, => 'tcp', => $port, => SOMAXCONN,
print "Server running on port $port...\n"; while (my $connection = $server->accept) { print "Client connected at ", scalar(localtime), "\n"; print $connection "You're connected to the server!\n"; while (<$connection>) { print "Client says: $_"; } close $connection; print "Client disconnected\n"; }
This is altogether simpler, as well as being easier to read. Note that we didn't need to specify a socket type, since the IO::Socket module can work it out from the protocol – TCP means it must be a SOCK_STREAM. It also has the advantage that autoflushing is enabled by default on the newly created sockets, so we don't have to do it ourselves.
981
Chapter 23
As a slightly simplified case for when we know we want an INET socket we can use the IO::Socket::INET module directly, obviating the need for the Domain argument: my $server = Proto LocalPort Listen );
new IO::Socket::INET( => 'tcp', => $port, => SOMAXCONN,
The filehandles returned by either approach work almost identically to 'ordinary' filehandles for files and pipes, but we can distinguish them from other filehandles with the -S file test: print "It's a socket!" if -S CONNECTION;
Filehandles returned by the IO::Socket modules also support all the methods supported by the IO::Handle class, from which they inherit, so we can for example say: $socket->autoflush(0); $socket->printf("%04d %04f", $int1, $int2); $socket->close;
See Chapter 12 for more on the IO::Handle module and the other modules in the 'IO' family. Our example above lacks one primary attribute necessary for a true multi-user server – it can only handle one connection at a time. A true server would either fork or use threads to handle each connection, freeing up the main process to continue to accept concurrent connections. For more information on that aspect, please consult Chapter 22.
A Simple TCP Socket Client While servers listen for and accept network connections, clients originate them. Implementing a client is similar to creating a server, although simpler. We start out by creating an unconnected stream socket of the correct domain and protocol with socket. We then connect the socket to the remote server with the connect function. If the result of the connect is not a failure, then this is all we have to do, apart from use the socket. Once we're done, we need to close the socket, in order to maintain our status as good denizens on the host. The following simple application demonstrates a very simple INET TCP client, using Perl's built-in socket functions: #!/usr/bin/perl # tcpinetclient.pl use warnings; use strict; use Socket; my $proto = getprotobyname('tcp'); my $host = inet_aton('localhost'); my $port = 4444;
982
Networking with Perl
# Create 'sockaddr_in' structure to connect to the given # port on the IP address for the remote host my $servaddr = sockaddr_in($port, $host); # Create a socket for connecting on socket SERVER, PF_INET, SOCK_STREAM, $proto or die "Unable to create socket: $!"; # bind the socket to the local port and address connect SERVER, $servaddr or die "Unable to connect: $!"; # enable autoflush select SERVER; $| = 1; select STDOUT; # communicate with the server print "Client connected.\n"; print "Server says: ", scalar(<SERVER>); print SERVER "Hello from the client!\n"; print SERVER "And goodbye!\n"; close SERVER;
AM FL Y
The initial steps for creating the client are identical to those for the server. First, we create a socket of the right type and protocol. We then connect it to the server by feeding connect the socket filehandle and a sockaddr_in structure. However, in this case the address and port are those of the server, not the client (which doesn't have a port until the connection succeeds). The result of the connect is either an error, which returns undef and we catch with a die, or a connected socket. If the latter, we enable autoflush on the server connection to ensure data is written out in a timely manner, read a message from the server, send it a couple of messages, and close the connection. Clients may bind to a specific port and address if they wish to, just like servers, but may not listen on them. It is not common that we would want to create a bound client, but one possible use is to prevent two clients on the same machine from talking to the server at the same time. This would only be useful if SO_REUSEADDR is not set for the socket.
A Simple TCP 'IO::Socket' Client
TE
Again, we can use the IO::Socket module to create the same client in a simpler fashion, avoiding the need to convert the hostname to a packed internet address with inet_aton and setting autoflush: #!/usr/bin/perl # tcpioinetclient.pl use warnings; use strict; use IO::Socket; my $host = 'localhost'; my $port = 4444; my $server = IO::Socket->new( Domain => PF_INET, Proto => 'tcp', PeerAddr => $host, PeerPort => $port, );
983
Team-Fly®
Chapter 23
die "Connect failed: $!\n" unless $server; # communicate with the server print "Client connected.\n"; print "Server says: ", scalar(<$server>); print $server "Hello from the client!\n"; print $server "And goodbye!\n"; close $server;
As with the server, we can also eliminate the Domain argument by using the IO::Socket::INET module directly: my $server = new IO::Socket::INET( Proto => 'tcp', PeerAddr => $host, PeerPort => $port, );
In fact, if we use IO::Socket::INET we can omit the protocol too, and combine the address and port together. This gives us a single argument, which we can pass without a leading name: my $server = new IO::Socket::INET("$host:$port");
This one argument version only works for creating INET TCP clients, since all other uses of the IO::Socket modules require at least one other argument. Fortunately, creating an INET TCP client is a common enough requirement that it's still a useful shortcut.
Reading from and Writing to TCP Sockets Socket filehandles that represent the ends of a TCP connection are trivial to use for input and output. Since TCP streams data, we can read and write it in bytes and characters, just as we do with any normal filehandle: # send something to the remote socket print SOCKET "Send something to the other end\n"; # get something back my $message = <SOCKET>;
As we mentioned in the previous examples, Perl buffers IO by default, since the message will only reach the client as we fill the buffer and force a flush. For that reason, we need to either use the syswrite function, or set autoflushing to 1 after selecting our handle: select SOCKET; $| = 1; select STDOUT;
The IO::Socket modules sets autoflush automatically on the socket filehandles it creates, so we don't have to worry when using it. If we wanted to send individual key presses in real time, we might also want to set the TCP_NODELAY option as an additional measure.
984
Networking with Perl
We can also use send and recv to send messages on TCP connections, but in general we don't usually want to – the whole point of a streaming socket is that it is not tied to discrete messages. These functions are more useful (indeed, necessary) on UDP sockets. On a TCP connection, their use is identical – except that we do not need to specify a destination or source address since TCP connections, being connection-oriented, supply this information implicitly.
Determining Local and Remote Addresses We occasionally might care about the details of either the local or the remote address of a connection. To find our own address we can use the Sys::Hostname module, which we cover at the end of the chapter. To find the remote address we can use the getpeername function. This function returns a packed address sockaddr structure, which we will need to unpack with unpack_sockaddr_in: my ($port, $ip) = unpack_sockaddr_in(getpeername CONNECTION); print "Connected to: ", inet_ntoa($ip), ", port $port\n";
We can use this information for a number of purposes, for example keeping a connection log, or rejecting network connections based on their IP address: # reject connections outside the local network inet_ntoa($ip) !~ /^192\.168\.1\./ and close CONNECTION;
A quirk of the socket API is that there is no actual 'reject' capability – the only way to reject a connection is to accept it and then close it, which is not elegant but gets the job done. This does open the possibilities of Denial of Service (DoS) attacks, though, and care needs to be taken to protect against such. The IO::Socket::INET module also provides methods to return the local and remote address of the socket: # find local connection details my $localaddr = $connection->sockaddr; my $localhost = $connection->sockhost; my $localport = $connection->sockport; # find remote connection details my $remoteaddr = $connection->peeraddr; my $remotehost = $connection->peerhost; my $remoteport = $connection->peerport;
As with other methods of the IO::Socket::INET module, unpacking the port number and IP address is done for us, as is automatic translation back into an IP address, so we can do IP-based connection management without the use of unpack_sockaddr_in or inet_ntoa: # reject connections outside the local network $remoteaddr =~ /^192.168.1/ and close CONNECTION
We can also reject by host name, which can be more convenient but which can also be unreliable because it may require DNS lookups on which we cannot necessarily rely. The only way to verify that a host name is valid is to use gethostbyname on the value returned by peerhost and then check the returned IP address (or possibly addresses) to see if one matches the IP address returned by peeraddr. Comparing both forward and reverse name resolutions is a good security practice to employ. See the end of the chapter and the discussion on the host functions for more details.
985
Chapter 23
UDP INET Examples UDP servers and clients have many uses. The NTP protocol, for instance, relies on it to conserve bandwidth by broadcasting time synchronization information to multiple clients simultaneously. Obviously, clients missing the occasional broadcast aren't a critical issue. If it is imperative that your clients receive all of your broadcasts, then you should be using TCP.
A Simple UDP Socket Server Here is an example of a simple INET UDP server, written using Perl's built-in socket functions. It is similar to the TCP server, creating a server socket and binding it, but differs in that you don't listen to it (that call is restricted to stream and sequential packet sockets): #!/usr/bin/perl # udpinetserv.pl use warnings; use strict; use Socket; my $proto = getprotobyname('udp'); my $port = 4444; # Create 'sockaddr_in' structure to listen to the given port # on any locally available IP address my $servaddr = sockaddr_in($port, INADDR_ANY); # Create a socket for listening on socket SERVER, PF_INET, SOCK_DGRAM, $proto or die "Unable to create socket: $!"; # bind the socket to the local port and address bind SERVER, $servaddr or die "Unable to bind: $!"; # now receive and answer messages print "Server running on port $port...\n"; my $message; while (my $client = recv SERVER, $message, 1024, 0) { my ($port, $ip) = unpack_sockaddr_in($client); my $host = gethostbyaddr($ip, AF_INET); print "Client $host:$port sent '$message' at ", scalar(localtime), "\n"; send SERVER, "Message '$message' received", 0, $client; }
The key difference with this server is that because UDP is connectionless we do not generate new filehandles with accept and then communicate over them. Instead, we read and write messages directly using the send and recv functions. Each of those functions takes optional flags (which we did not use, hence the argument of 0 on both calls), and they are documented in your development man pages: > man 2 recv > man 2 send
986
Networking with Perl
Every message we receive comes with the sender's address attached. When we retrieve a message with recv we get the remote address of the client (as a sockaddr_in structure for INET servers or a sockaddr_un structure for UNIX domain ones), which we can feed directly to the send function to return a response.
A Simple UDP' IO::Socket' Server Again we can use the IO::Socket modules to do the same thing in a clearer way: #!/usr/bin/perl # udpioinetserv.pl use warnings; use strict; use IO::Socket; my $port = 4444; my $server = Domain LocalPort Proto );
new IO::Socket( => PF_INET, => $port, => 'udp',
die "Bind failed: $!\n" unless $server; print "Server running on port $port...\n"; my $message; while (my $client = $server->recv($message, 1024, 0)) { my ($port, $ip) = unpack_sockaddr_in($client); my $host = gethostbyaddr($ip, AF_INET); print "Client $host:$port sent '$message' at ", scalar(localtime), "\n"; $server->send("Message '$message' received", 0, $client); }
The send and recv methods of IO::Socket are simply direct interfaces to the functions of the same name, so the client address passed to the send method is a sockaddr_in (or sockaddr_un, in the UNIX domain) structure. We'll cover both send and recv in more depth shortly.
A Simple UDP Socket Client Whereas TCP clients create a dedicated connection to a single host with each socket, a UDP client can broadcast packets to any host it wishes to with the same socket. Consequently, we do not use connect but use send and recv immediately on the created socket: #!/usr/bin/perl #udpinetclient.pl use warnings; use strict; use Socket; my $proto = getprotobyname('udp'); my $host = inet_aton('localhost'); my $port = 4444;
987
Chapter 23
# Create a socket for sending & receiving on socket CLIENT, PF_INET, SOCK_DGRAM, $proto or die "Unable to create socket: $!"; # Create 'sockaddr_in' structure to connect to the given # port on the IP address for the remote host my $servaddr = sockaddr_in($port, $host); # communicate with the server send CLIENT, "Hello from client", 0, $servaddr or die "Send: $!\n"; my $message; recv CLIENT, $message, 1024, 0; print "Response was: $message\n";
The UDP socket is free to communicate with any other UDP client or server on the same network. In fact, we can bind and listen to a UDP socket and use it as a client to send a message to another server as well.
A Simple UDP IO::Socket Client The IO::Socket version of this client is: #!/usr/bin/perl # udpioinetclient.pl use warnings; use strict; use IO::Socket; my $host = 'localhost'; my $port = 4444; my $client = new IO::Socket( Domain => PF_INET, Proto => 'udp', ); die "Unable to create socket: $!\n" unless $client; my $servaddr = sockaddr_in($port, inet_aton($host)); $client->send("Hello from client", 0, $servaddr) or die "Send: $!\n"; my $message; $client->recv($message, 1024, 0); print "Response was: $message\n";
We can also use IO::Socket::INET directly and omit the Domain argument: my $client = new IO::Socket::INET(Proto => 'udp');
Unfortunately, because the send and recv methods are direct interfaces to the send and recv functions, the recipient address is still a sockaddr_in structure and we have to do the same contortions to provide send with a suitable addressee that we do for the built-in function version. This aside, IO::Socket still makes writing UDP clients simpler, if not quite as simple as TCP clients.
988
Networking with Perl
Broadcasting One of the key benefits of UDP is that we can broadcast with it. In fact, broadcasting is trivially easy to do once we realize that it simply involves sending messages to the broadcast address of the network we want to communicate with. For the 192.168 network, the broadcast address is 192.168.255.255: my $port = 444; my $broadcast = '192.168.255.255'; my $broadcastaddr = ($port, inet_aton($broadcast));
The special address 255.255.255.255 may therefore be used to broadcast to any local network. The pre-packed version of this address is given the symbolic name INADDR_BROADCAST by the Socket and IO::Socket modules: $udpsocket->send("Hello Everybody", 0, INADDR_BROADCAST);
Standard UDP makes no provision for narrowing the hosts to which a broadcast is sent – everything covered by the broadcast address will get the message. An enhancement to UDP called Multicast UDP provides for selected broadcasting, subscribers, and other enhancements to the standard UDP protocol. However, it is more complex to configure and requires multicast-aware routers and gateways if broadcasts are to travel beyond hosts on the immediate local network. The IO::Socket::Multicast module found on CPAN provides a programming interface to this capability.
UNIX Sockets UNIX domain sockets are the second type of socket that we can create. Whereas INET sockets allow remote communications via the networking infrastructure of the operating system, UNIX domain sockets work through the filing system, restricting communications to local processes. They behave (and are used) in the same manner as INET sockets, and are preferred for inter-process communication where no networking support is needed. The absence of network support makes this a lighter-weight implementation with higher performance. As the name implies, UNIX domain sockets may not be supported on non-UNIX systems.
A Simple UNIX Socket Server Writing an application to a UNIX domain socket is almost identical to writing it to use an INET socket: #!/usr/bin/perl # unixserv.pl use warnings; use strict; use Socket; my $file = '/tmp/unixserv_socket'; # Create 'sockaddr_un' structure to listen the local socket my $servaddr = sockaddr_un($file); # remove an existing socket file if present unlink $file;
989
Chapter 23
# Create a socket for listening on socket SERVER, PF_UNIX, SOCK_STREAM, 0 or die "Unable to create socket: $!"; # bind the socket to the local socket bind SERVER, $servaddr or die "Unable to bind: $!"; # listen to the socket to allow it to receive connection requests # allow up to 10 requests to queue up at once. listen SERVER, 10; # now accept connections print "Server running on file $file...\n"; while (accept CONNECTION, SERVER) { select CONNECTION; $| = 1; select STDOUT; print "Client connected at", scalar(localtime), "\n"; print CONNECTION "You're connected to the server!\n"; while () { print "Client says: $_"; } close CONNECTION; print "Client disconnected\n"; }
First, we create a socket, but now we make it a UNIX domain socket by specifying PF_UNIX for the socket domain. We also do not specify a protocol – unlike INET sockets, UNIX domain sockets do not care about the protocol ahead of time – so instead we give it a value of 0 for the protocol. Before we create the socket we clear away the filename created, if any, by the previous incarnation of the server, otherwise we may not be able to create the new socket. The address for bind still needs to be converted into a form it will accept, but since we are converting a UNIX domain address (a filename), we use sockaddr_un rather than sockaddr_in. Other than these changes, the server is identical.
A Simple UNIX 'IO::Socket' Server The IO::Socket implementation of this server is similarly changed in only the barest details from the INET version. One important, but perhaps not obvious, difference is that an explicit Listen argument must be given to the new method for the socket to be bound and listened to. Without this, the server will generate an 'invalid argument' error from accept. This is not the case with INET sockets, though we can specify a Listen argument if we want. #!/usr/bin/perl # iounixserv.pl use warnings; use strict; use IO::Socket; my $file = '/tmp/unixserv_socket'; # remove previous socket file, if present unlink $file;
990
Networking with Perl
my $server = IO::Socket->new( Domain => PF_UNIX, Type => SOCK_STREAM, Local => $file, Listen => 5, ); die "Could not bind: $!\n" unless $server; print "Server running on file $file...\n"; while (my $connection = $server->accept) { print $connection "You're connected to the server!\n"; while (<$connection>) { print "Client says: $_\n"; } close $connection; }
A Simple UNIX Socket Client The UNIX domain version of the client is also almost identical to its INET counterpart: #!/usr/bin/perl # unixclient.pl use warnings; use strict; use Socket; my $file = '/tmp/unixserv_socket'; # Create 'sockaddr_un' structure to connect to the given # port on the IP address for the remote host my $servaddr = sockaddr_un($file); # Create a socket for connecting on socket SERVER, PF_UNIX, SOCK_STREAM, 0 or die "Unable to create socket: $!"; # bind the socket to the local socket connect SERVER, $servaddr or die "Unable to connect: $!"; # enable autoflush select SERVER; $| = 1; select STDOUT; # communicate with the server print "Client connected.\n"; print "Server says: ", scalar(<SERVER>); print SERVER "Hello from the client!\n"; print SERVER "And goodbye!\n"; close SERVER;
Again, we create a socket in the UNIX domain and specify a protocol of '0'. We provide connect with a UNIX domain address compiled by sockaddr_un and connect as before. Together, this server and client operate identically to their INET counterparts – only they do it without consuming network resources. The only drawback is that a remote client cannot connect to our UNIX domain-based server. Of course, in some cases, we might actually want that, for security purposes for example.
991
Chapter 23
A Simple UNIX 'IO::Socket' Client The IO::Socket version of this client is: #!/usr/bin/perl # iounixclnt.pl use warnings; use strict; use IO::Socket; my $file = '/tmp/unixserv_socket'; my $server = IO::Socket->new( Domain => PF_UNIX, Type => SOCK_STREAM, Peer => $file, ); die "Connect failed: $!\n" unless $server; # communicate with the server print "Client connected.\n"; print "Server says: ", scalar(<$server>); print $server "Hello from the client!\n"; print $server "And goodbye!\n"; close $server;
Determining Local and Remote Pathnames When called on a UNIX domain socket, getpeername should return the name of the file being accessed by the other end of the connection. This would normally be the same as the file used for this end, but the presence of symbolic or hard links might give the client a different pathname for its connection than that of the server. # get remote path my $remotepath = unpack_socket_un(getpeername);
The IO::Socket::UNIX module provides the peerpath and hostpath to return the name of the file used by the local and remote sockets respectively: my $localpath = $server->hostpath; my $remotepath = $server->peerpath;
Multiplexed Servers So far in this chapter, we have covered networking fundamentals implementing simple single-client TCP/IP servers that are capable of handling only one connection at a time. For UDP this isn't too much of a limitation, since UDP works on a message-by-message basis. For TCP it is severely limiting, however, because while one client has the server's attention, connection requests from other clients must wait in the queue until the current client disconnects. If the number of requests exceeds the size of the queue specified by listen, connection requests will begin to bounce.
992
Networking with Perl
In order to solve this problem we need to find a way to multiplex the server's attention, so that it can handle multiple clients at once. Depending on our requirements, we have three options: ❑
Give the server the ability to monitor all client connections at once. This involves using the select function to provide one loop with the ability to respond to any one of several connections whenever any one of them becomes active. This kind of server is called a 'nonforking' or 'polling' server. Although it is not as elegant or efficient as a forking or threading server (code-wise), a polling server does have the advantage that it will work on any platform.
❑
Dedicate a separate process to each connection. This involves the fork function, and consists of forking a new child process to manage each connection. This kind of server is called a 'forking' server. However, fork may not be available on all platforms.
❑
Spawn a new thread to handle each connection. This is a similar idea to the forking server but is much more resource-efficient since threads are much more lightweight than separate processes. This kind of server is called a threaded server. However, like the fork function, threads are not always available to us, though this situation is improving and threads are available on more platforms than fork. Perl has two different thread models, as we discussed in Chapter 22, and we may or may not have access to either of them.
We'll consider and give an example of each kind of server in the following section, illustrating the basic similarities and differences between them.
Polling Servers
TE
The 'select' Function
AM FL Y
If we want to write a single process server that can handle multiple connections we need to use the select function, or more conveniently, the IO::Select module. Both the function and the module allow us to monitor, or poll, a collection of filehandles for events, in particular readability. Using this information we can build a list of filehandles and watch for activity on all of them. One of the filehandles is the server socket, and if input arrives on it, we accept the new connection. Otherwise, we read from an existing client's filehandle, and do whatever processing we need to, based on what it sends. Developing a selecting server is actually simple, once the basics of the select function are understood.
The select function has two completely different uses in Perl. The standard use takes one argument that changes the default output filehandle (STDOUT is the unstated default filehandle for functions like print, but we can change that). The select function also has a four-argument version, which takes three bitmasks (read, write, and exceptions) and an optional timeout value. We can use this version of select to continuously scan multiple filehandles simultaneously for a change in their condition, and react whenever one of them becomes available for reading or writing, or enters an error condition. The bitmasks represent the file numbers, also called file descriptors, of filehandles that we are interested in communicating with; the first bit refers to filehandle number 0 (usually STDIN), the second to 1 (STDOUT) and the third to 2 (STDERR). Every new filehandle that we create contains inside it a low-level file descriptor, which we can extract with fileno. Using select effectively requires first extracting the file numbers from the filehandles we want to monitor, which we can do with fileno, then building bitmasks from them, which we can do with vec. The following code snippet creates a mask with bits set for standard input and a handle created with one of the IO:: modules, for example IO::Socket:
993
Team-Fly®
Chapter 23
my @handles; $mask = 0; push @handles, \*STDIN; vec($mask, fileno(STDIN), 1); push @handles, $handle; vec($mask, fileno($handle), 1);
We can now feed this mask to select. The @handles array is going to be useful later, but we will disregard it for now. The select function works by writing new bitmasks indicating the filehandles that are actually in the requested state into the passed arguments; each bit initially set to '1', will be turned off if the handle is not either readable or writable, and does not have an exception (meaning an unusual condition, such Out-Of-Band data), and left at '1' otherwise. To preserve our mask for a later date, we have to assign it to new scalars that select will assign its result to. If we don't want to know about writable file descriptors, which is likely, we can pass undef for that bitmask: my ($read, $except) = ($mask, $mask); while (my $got = select $read, undef, $except) { $except?handle_exception:handle_read; my ($read, $except) = ($mask, $mask); }
Here handle_read and handle_exception are subroutines, which we have yet to write. We can also write, more efficiently but marginally less legibly: while (my $got = select $read = $mask, undef, $except = $mask) { $except?handle_exception:handle_read; }
If select is not given a fourth argument for a timeout, it will wait forever for something to happen, at which point the number of interesting file descriptors is returned. If a fourth parameter of a timeout in seconds is specified $got will contain '0': my ($read, $except); while ('forever') { while (my $got = select $read = $mask, undef, $except = $mask, 60) { $except?handle_exception:handle_read; } print "Nothing happened for an entire minute!\n"; }
The only remaining aspect to deal with is how to handle the result returned by select. We can again use vec for this, as well as the @handles array we created earlier but have not until now used. In essence, we scan through the array of filehandles, checking for the bit that represents its file descriptor in the mask. If it is set, that file descriptor is up to something and we react to it:
994
Networking with Perl
handle_read { foreach (@handles) { vec($read, fileno $_) and read_from_client($_); } }
Both select and the IO::Select module are unbuffered operations. That is, they operate directly on file descriptors. This means that buffers are ignored and buffered IO functions such as print and read may produce inconsistent results; for example, there may be input in the filehandle buffer but not at the system level, so the file descriptor shows no activity when there is in fact data to be read. Therefore, rather than using print, read, and other buffered IO functions, we must use system level IO functions like sysread and syswrite, which write directly to the file descriptor and avoid the buffers. Note that simply setting '$| = 1' is not adequate – see Chapter 12 for more details. Also note that one advantage of forked and threaded servers is that they do not have this requirement to use system IO, though we may choose to do so anyway.
The' IO::Select' Module Having described how the select function works, we don't actually need to use it because the IO::Select module provides a far friendlier interface to the select function. Instead of building bitmasks, we just create an object and add filehandles to it: new IO::Select; my $selector = new IO::Select(\*STDIN, $handle);
To scan for active filehandles we then use the can_read method. The return value from both functions is an array of the filehandles (not file descriptors) that are currently in the requested state: my @handles = $selector->can_read;
We can now iterate commands for each handle: foreach (@handles) { read_from_client $_; }
The counterpart to can_read, can_write, scans for writable filehandles. The has_exception method checks for filehandles that are in an exception state. To add a new handle, for example from an accept, we use add: $selector->add($server->accept);
Finally, to remove a handle, we use remove: $selector->remove($handle);
995
Chapter 23
These are the primary methods supplied by IO::Select. There are other, less commonly used methods. For completeness, we will mention each of them briefly: Method
Description
exists
Called with a filehandle, returns the filehandle if it is present in the select object, and undef if not. For example: $selector->add($fh) unless $selector->exists($fh);
handles
Returns an array of all the filehandles present in the select object: my @handles = $selector->handles;
count
Returns the number of filehandles present in the select object: my $handles = $selector->count;
bits
Returns a bitmask representing the file numbers of the registered filehandles, suitable for passing to the select function: my $bitmask = $selector->bits;
select
Class method. Performs a select in the style of the function but using select objects rather than bitmasks. Three array references are returned, each holding the handles that were readable, writable, and exceptional (so to speak) respectively. An optional fourth argument containing a timeout in seconds may also be supplied. For example: my ($readers, $writers, $excepts) = select IO::Select( $read_selector, $write_selector, $except_selector, 60
# timeout
); foreach (@{$readers}) { ... }
This function is essentially similar to calling can_read, can_write and has_exception on three different select objects simultaneously. It mimics the simultaneous polling of these three conditions with three different bitmasks that the select function performs. Note that the same select object can be passed for all three select object arguments. Armed with this interface, we can now build a simple multiplexing server based on the select function.
A Simple Polling Server Here is an example of a simple INET polling TCP server. It uses IO::Select to monitor a collection of filehandles, and reacts whenever one of them becomes active. The first handle to be monitored is the server socket, and we check for it especially when an input event occurs. If the server socket is active, it means a new connection request has arrived, so we accept it. Otherwise, it means a client has sent us something, so we respond:
996
Networking with Perl
#!/usr/bin/perl # ioselectserv.pl use warnings; use strict; use IO::Socket; use IO::Select; my $serverport = 4444; # create a socket to listen on my $server = new IO::Socket( Domain => PF_INET, Proto => 'tcp', LocalPort => $serverport, Listen => SOMAXCONN, ); die "Cannot bind: $!\n" unless $server; # create a 'select' object and add server fh to it my $selector = new IO::Select($server); # stop closing connections from aborting the server $SIG{PIPE} = 'IGNORE'; # loop and handle connections print "Multiplex server started on port $serverport...\n"; while (my @clients = $selector->can_read) { # input has arrived, find out which handles(s) foreach my $client (@clients) { if ($client == $server) { # it's a new connection, accept it my $newclient = $server->accept; syswrite $newclient, "You've reached the server\n"; my $port = $newclient->peerport; my $name = $newclient->peerhost; print "New client $port:$name\n"; $selector->add($newclient); } else { # it's a message from an existing client my $port = $client->peerport; my $name = $client->peerhost; my $message; if (sysread $client, $message, 1024) { print "Client $name:$port sent: $message"; syswrite $client, "Message received OK\n"; } else { $selector->remove($client); $client->shutdown(SHUT_RDWR); print "Client disconnected\n"; # port, name not defined } } } }
This passable server handles new connections, carries out a simple exchange of messages with a client, and drops them when they disconnect. The select call causes Perl to raise SIGPIPE signals that will cause our application to abort if not trapped. In this case, we choose to ignore them and detect the closing of the connection by getting a read event on a filehandle with no data – sysread returns 0 bytes read. As we discussed in Chapter 12, this is the only reliable way to check for closed connections at the system level, as eof does not work on unbuffered connections.
997
Chapter 23
A Simple Forking Server The alternative to using a single process to read from multiple connections is to use multiple processes to read from a single connection. The following server uses fork to generate an individual child process for each child connection: #!/usr/bin/perl # ioforkserv.pl use warnings; use strict; use IO::Socket; # Turn on autoflushing $| = 1; my $port = 4444; my $server = Domain Proto LocalPort Listen Reuse );
IO::Socket->new( => PF_INET, => 'tcp', => $port, => SOMAXCONN, => 1,
die "Bind failed: $!\n" unless $server; # tell OS to clean up dead children $SIG{CHLD} = 'IGNORE'; print while my my if
"Multiplex server running on port $port...\n"; (my $connection = $server->accept) { $name = $connection->peerhost; $port = $connection->peerport; (my $pid = fork) { close $connection; print "Forked child $pid for new client $name:$port\n"; next; # on to the next connection } else { # child process - handle connection print $connection "You're connected to the server!\n"; while (<$connection>) { print "Client $name:$port says: $_"; print $connection "Message received OK\n"; } print "Client $name:$port disconnected\n"; $connection->shutdown(SHUT_RDWR); }
}
When the server receives a connection, it accepts it and then forks a child process. It then closes its copy of the connection since it has no further use for it – the child contains its own copy, and can use that. Having no further interest in the connection, the main process returns to the accept call and waits for another connection.
998
Networking with Perl
Meanwhile, the child communicates with the remote system over the newly connected socket. When the readline operator returns a false value, an indication of the end-of-file condition, the child determines that the remote end has been closed, and shuts down the socket. Because each connection communicates with a separate child process, each connection can coexist with all the others. The main process, freed from the responsibility of handling all the communication duty, is free to accept more connections, allowing multiple clients to communicate with the server simultaneously.
A Simple Threaded Server The alternative to forking processes is to use threads. Like forking, threads may or may not be available on our platform, so which we use may be as much a matter of circumstance as choice. Instructions on how to build a thread-capable Perl are included in Chapter 1. Assuming we do have threads, we can write a multiplexed server along very similar lines to the forking server. The threaded version is however simpler, as well as considerably less resource hungry. Below is a threaded version (designed for Perl's older threaded model) of the forking server we showed above: #!/usr/bin/perl # iothreadserv.pl use warnings; use strict; BEGIN { use Config; die "No thread support!\n" unless $Config{'usethreads'}; } use Thread; use IO::Socket; # Autoflushing on $| = 1; my $port = 4444; my $server = Domain Proto LocalPort Listen Reuse );
IO::Socket->new( => PF_INET, => 'tcp', => $port, => SOMAXCONN, => 1,
die "Bind failed: $!\n" unless $server; print while my my
"Multiplex server running on port $port...\n"; (my $connection = $server->accept) { $name = $connection->peerhost; $port = $connection->peerport;
my $thread = new Thread(\&connection, $connection, $name, $port); print "Created thread ",$thread->tid," for new client $name:$port\n"; $thread->detach; }
999
Chapter 23
# child thread - handle connection sub connection { my ($connection, $name, $port) = @_; $connection->autoflush(1); print $connection "You're connected to the server!\n"; while (<$connection>) { print "Client $name:$port says: $_"; print $connection "Message received OK\n"; } $connection->shutdown(SHUT_RDWR); }
When the server receives a connection, it accepts it and spawns a new thread to handle it. We pass the connection filehandle to the new thread on start up. We don't need to worry about locking variables either, since each thread cares only about its own connection. Of course, in a real server environment, when performing more complex tasks such as database queries, we would probably need to worry about this, but this simple example can get away without them. We also pass in the name of the host and the port number, though we could have easily passed these inside the thread as well. The main thread is only concerned with accepting new connections and spawning child threads to deal with them. It doesn't care about what happens to them after that (though again a more complex server probably would), so it uses the detach method to renounce all further interest in the child. Threads all run in the same process, so there are no process IDs to manage, no child processes to ignore, and no duplicate filehandles to close. In fact, all the threads have access to all the same filehandles, but since each thread only cares about its own particular filehandle, none of them treads on etheach others' toes. It's a very simple design and it works remarkably well. More information about threaded processes in Perl can be found in Chapter 22.
Getting Network Information All systems have network information configured locally (unless of course they are diskless workstations, in which case their configuration was loaded from the network and stored in memory). This information includes things like host names and IP addresses, network services and so on. General information about addresses (both hostnames and resolved IPs), networks, services, and protocols can usually be accessed through a standard suite of operating system functions to which Perl provides a direct mapping. Alternatively, for the specific and common purpose of finding the name of our own system (which is more complex than it might at first appear), we can use the Sys::Hostname module. Hostname queries in particular will also make use of any name resolution services available on the system, including DNS, NIS/NIS+, and files, if configured. This is because they use the underlying operating system calls to do the actual work rather than attempting to parse local configuration files. Not only does this make them more portable, it also makes them able to work on both local and remote hostnames. A lot of the functions detailed here, particularly for hosts and networks, are wrapped and made easier to use by the utility functions inet_aton and inet_ntoa, covered under Sockets earlier in the chapter. The IO::Socket module further abstracts the details of service, protocol, and network configuration from the user; in these cases we may be able to avoid actually calling any of these functions directly.
1000
Networking with Perl
System Network Files All networked systems hold information about hosts, services, networks, and protocols. UNIX systems keep this information in a set of specially formatted files in the /etc directory, while Windows NT stores them under \WinNT\system32\drivers\etc: Hosts:
/etc/hosts
Networks:
/etc/networks
Services:
/etc/services
Protcols:
/etc/protocols
To extract information from these files, UNIX defines a standard set of C library functions, analogues of which are available within Perl as built-in functions. Because Perl tries to be a platform-independent language, many of these functions also work on other systems such as Windows or VMS, though they don't always read the same files or formats as the UNIX equivalents. Perl provides no fewer than twenty built-in functions for making inquiries of the system network configuration. They are divided into four groups of five, for hosts, networks, services, and protocols respectively. Each group contains three functions for getting, resetting, and closing the related network information file, plus two functions for translating the name of the host, network, service, or protocol to and from its native value. The table below summarizes all twenty (functions marked with an * are not supported on Win32 platforms, and many applications will not run because of their use): Read
Reset
Close
From Name
To Name
Hosts
gethostent
sethostent
endhostent*
gethostbyname
gethostaddr
Networks
getnetent
setnetent
endnetent*
getnetbyname
getnetbyaddr
Services
getservent
setservent
endservent*
getservbyname
getservbyport
Protocols
getprotoent
setprotoent
endprotoent*
getprotobyname
getprotobynumber*
Perl's versions of these functions are smart and pay attention to the contexts in which they are called. If called in a scalar context, they return the name of the host, network, service or protocol. For example: # find the service assigned to port 80 (it should be 'http') my $protocol = getprotobyname('tcp'); my $servicename = getservbyport(80, $protocol);
The only exception to this is the -name functions, which return the other logical value for the network information being queried – the IP address for hosts and networks, the port number for services, and the protocol number for protocols: # find out what port this system has 'http' assigned to my $httpd_port = getservbyname('http', $protocol);
1001
Chapter 23
Called in a list context, each function returns a list of values that varies according to the type of information requested. We can process this list directly, for example: #!/usr/bin/perl # listserv.pl use warnings; use strict; while (my ($name, $aliases, $port, $proto) = getservent) { print "$name\t$port/$proto\t$aliases\n"; }
Alternatively we can use one of the four Net:: family modules written for the purpose of making these values easier to handle. Each module overrides the original Perl built-in function with a new objectoriented version that returns an object, with methods for each value: #!/usr/bin/perl # listobjserv.pl use warnings; use strict; use Net::servent; while (my $service = getservent) { print 'Name : ', $service->name, "\n"; print 'Port No: ', $service->port, "\n"; print 'Protocol: ', $service->proto, "\n"; print 'Aliases: ', join(', ', @{$service->aliases}), "\n\n"; }
These four modules all accept an optional fields argument that imports a variable for each part of the returned list. For the Net::servent module, the variables are the subroutines prefixed with s_, so the alternative way of writing the above is: #!/usr/bin/perl # listfldserv.pl use warnings; use strict; use Net::servent qw(:FIELDS); while (my $service = getservent) { print 'Name : ', $s_name, "\n"; print 'Port No: ', $s_port, "\n"; print 'Protocol: ', $s_proto, "\n"; print 'Aliases: ', join(', ', @s_aliases), "\n\n"; }
Importing the variables still overrides the core functions however. We can get them back by prefixing them with CORE::, as in: ($name, $aliases, $proto) = CORE::getprotobyname($name_in);
1002
Networking with Perl
Alternatively, we can gain access to the object methods without the overrides by passing an empty import list: #!/usr/bin/perl # listcorserv.pl use warnings; use strict; use Net::servent (); while (my $service = Net::servent::getservent) { print 'Name : ', $service->name, "\n"; print 'Port No: ', $service->port, "\n"; print 'Protocol: ', $service->proto, "\n"; print 'Aliases: ', join(', ', @{$service->aliases}), "\n\n"; }
We will cover each of the four kinds of network information in a little more detail below.
Hosts Host information consists of an IP address followed by a primary name and possibly one or more aliases. On a UNIX server, this is defined in the /etc/hosts file, of which a semi-typical example might be: localhost localhost.localdomain borahorzagobachul.culture.gal sleepersevice.culture.gal littlerascal.culture.gal hub.chiark.culture.gal hub
AM FL Y
127.0.0.1 192.168.1.10 192.168.1.102 192.168.1.103 192.168.1.1
myself horza sleeper rascal chiark
#!/usr/bin/perl # listhost.pl use warnings; use strict;
TE
Host information is also available via DNS, NIS, NIS+, files, and a number of other name resolution protocols; the gethostent, gethostbyname, and gethostbyaddr will all use these automatically if they are configured on the system. Whatever the eventual origin of the information, each definition (or 'entry', hence gethostent) comprises an IP address, a primary name, and a list of aliases. The gethostent function retrieves one of these in turn from the local host configuration, starting from the first:
while (my ($name, $aliases, $type, $length, @addrs) = gethostent) { print "$name\t[$type]\t$length\t$aliases\n"; }
Here the name and aliases are as listed above in the example hosts file. The type and length are the address type (2, or AF_INET for Internet addresses, which is in fact the only supported type) and length (4 for IPv4 and 16 for IPv6). The list of addresses comprises all the IP addresses to which this entry applies. In a local host configuration file this should only be one element in length, so we could also get away with saying: my ($name, $aliases, $type, $length, @addrs) = gethostent;
1003
Team-Fly®
Chapter 23
The format of the address or addresses is a packed string of four bytes (or octets) describing the IP address. To get a string representation we need to process it with unpack or use the Socket module's inet_ntoa subroutine, which we'll come back to in a moment or two. Since gethostent only operates on the local host configuration, this is okay. For network lookups as performed by gethostbyname or gethostbyaddr we may get more than one address. Alternatively, if gethostent is called in a scalar context, it just returns the name: #!/usr/bin/perl # listhostnames.pl use warnings; use strict; my @hosts; while (my $name = gethostent) { push @hosts, $name; } print "Hosts: @hosts\n";
gethostent resembles the opendir and readdir directory functions, except that the open happens automatically and there is no filehandle that we get to see. However, Perl does keeps a filehandle open internally, so if we only want to pass partway through the defined hosts we should close that filehandle with: endhostent;
If the resolution of the request involves a network lookup to a remote DNS or NIS server, endhostent will also close down the resulting network connection. We can also carry out the equivalent of rewinddir with the sethostent function: sethostent 1;
If we are querying the local host file, this rewinds to the start of the file again. If we are making network queries, this tells the operating system to reuse an existing connection if one is already established. More specifically, it tells the operating system to open and maintain a TCP connection for queries rather than use a one-off UDP datagram for the query. In this case, we can shutdown the TCP connection with: sethostent 0;
Rather than scanning host names one by one, we can do direct translations between name and IP address with the gethostbyname and gethostbyaddr functions. Both these functions will cause the operating system to make queries of any local and remote name resolution services configured. In list context, both functions return a list of values like gethostent. In scalar context, they return the name (for gethostbyaddr) or address (for gethostbyname): ($name, $aliases, $type, $length, @addrs) = gethostbyname($name); $name = gethostbyaddr($addr, $addrtype); $addr = gethostbyname($name);
1004
Networking with Perl
In the event that the host could not be found, both functions return undef (or an empty list, in list context) and set $? (not, incidentally, $!) to indicate the reason. We can use $? >> 8 to obtain a real exit code that we can assign to $!, however (see Chapter 16 for more on this and the Errno module too). The addrtype parameter of gethostbyaddr specifies the address format for the address supplied. There are two formats we are likely to meet, both of which are defined as symbolic constants by the Socket module: AF_INET
Internet address
AF_INET6
IPv6 Internet address
In all cases, the addresses returned (and in the case of gethostbyaddr, the address we supply) are in packed form. This isn't (except by sheer chance) printable, and it isn't very convenient to pass as a parameter either. We can use pack and unpack to convert between 'proper' integers and addresses (the following example is for IPv4, or AF_INET addresses, on a little-endian machine): $addr = pack('C4', @octets); @octets = unpack('C4', $addr);
More conveniently, we can make use of the inet_aton and inet_ntoa functions from the Socket module to convert between both addresses and hostnames and IP strings of the form n1.n2.n3.n4. For example, this short script looks up and prints out the IP addresses for the hostnames given on the command line: #!/usr/bin/perl # hostlookup.pl use warnings; use strict; use Socket qw(/inet/); die "Give me a hostname!\n" unless @ARGV; while (my $lookup = shift @ARGV) { my ($name, $aliases, $type, $length, @addrs) = gethostbyname($lookup); if ($name) { foreach (@addrs) { $_ = inet_ntoa($_); } if ($name eq $lookup) { print "$lookup has IP address: @addrs\n"; } else { print "$lookup (real name $name) has IP address: @addrs\n"; } } else { print "$lookup not found\n"; } }
We could use this script like this: > perl hostlookup.pl localhost rascal hub chiark
1005
Chapter 23
And we would get (assuming the host file example above and no network lookups): localhost has IP address: 127.0.0.1 rascal (real name rascal.culture.gal) has IP address: 192.168.1.103 hub (real name hub.chiark.culture.gal) has IP address: 192.168.1.1 chiark (real name hub.chiark.culture.gal) has IP address: 192.168.1.1 Likewise, to look up a host by address: $localhostname = gethostbyaddr(inet_aton('127.0.0.1'));
Or to find the true name, using a hostname lookup: $localhostname = gethostbyaddr(inet_aton('localhost'));
While we're examining this particular example, it is worth noting that for finding the hostname of our own system we are better off using the Sys::Hostname module, described later in the chapter. One word of warning – be aware that both gethostbyname and gethostbyaddr reset the pointer used by gethostent in the same way that sethostent does, so they cannot be used in a loop of gethostent calls. The object-oriented override module for the host query functions is Net::hostent. It provides both an object-oriented alternative and on request, a collection of imported scalar variables that are set by each request. Strangely, however, it only provides an object-oriented interface for gethostbyname and gethostbyaddr, not gethostent, so to avoid resetting the pointer of gethostent we have to take a two-stage approach. Here is a short program to list all hosts in the object-oriented style: #!/usr/bin/perl # listobjhost.pl use warnings; use strict; use Net::hostent; use Socket qw(inet_ntoa); my @hosts; while (my $host = gethostent) { push @hosts, $host; } while (my $host = shift @hosts) { $host = gethostbyname($host); print 'Name : ', $host->name, "\n"; print 'Type : ', $host->addrtype, "\n"; print 'Length : ', $host->length, " bytes\n"; print 'Aliases : ', join(', ', @{$host->aliases}), "\n"; print 'Addresses: ', join(', ', map {inet_ntoa($_)} @{$host->addr_list}), "\n\n"; }
1006
Networking with Perl
And again using variables: #!/usr/bin/perl # listfldhost.pl use warnings; use strict; use Net::hostent qw(:FIELDS); use Socket qw(inet_ntoa); my @hosts; while (my $host = CORE::gethostent) { push @hosts, $host; } while (my $host = shift @hosts) { $host = gethostbyname($host); print 'Name : ', $h_name, "\n"; print 'Type : ', $h_addrtype, "\n"; print 'Length : ', $h_length, " bytes\n"; print 'Aliases : ', join(', ', @h_aliases), "\n"; print 'Addresses: ', join(', ', map{inet_ntoa($_)} @h_addr_list), "\n\n"; }
If we only want to import some variables, we can pass them directly in the import list, but we must also import the overrides we want to use if so: use Net::hostent qw($h_name @h_aliases @h_addr_list gethostent);
We use the CORE:: prefix here partly to remind ourselves that gethostent is not being overridden, and partly to protect ourselves in case a future version of the Net::hostent module extends to cover it. gethostbyname and gethostbyaddr also return objects. gethostbyaddr automatically makes use of the Socket module to handle conversions so we can supply an IP address like 127.0.0.1 without needing to worry about whether it will be understood or not. In addition, the Net::hostent module supplies a shorthand subroutine, gethost, which calls either gethostbyname or gethostbyaddress, depending on whether its argument looks like an IP address or a hostname.
Networks Networks are simply groups of IP addresses that have been assigned a network name. For example, for the host file example given earlier, 127 and 192.168.1 are potential groupings that may be given a network: loopback localnet
127.0.0.0 192.168.1.0
Networks are therefore essentially very much like hosts; see below: ($name, $aliases, $type, $netaddr) = getnetent;
1007
Chapter 23
Or, in scalar context: $netname = getnetent;
All these values have the same meanings as the same values returned by gethostent etc., only of course for networks. The $netaddr is now an IP address for the network and unlike hosts, where there is only one, this refers to many. use Socket qw(/inet/); $netip = inet_ntoa($netaddr); $netaddr = inet_aton('192.168');
The setnetent function resets the pointer in the network information file to the start or (in the case of NIS+ lookups) preserves an existing network connection if one happens to be active. Like sethostent it takes a Boolean flag as a parameter, and switches between TCP and UDP connections for network requests. For file-based access, using getnetbyname or getnetbyaddr also resets the pointer of getnetent in the local network file. The endnetent function closes the network information file or network connection. getnetbyname and getnetbyaddr work similarly to their 'host' counterparts, except for network addresses. Like gethostbyname, getnetbyname will do a remote lookup if the host network configuration is set to do that, although this is considerably more rare. $netaddr = getnetbyname('mynetwork'); $netname = getnetbyaddr('192.168.1', AF_INET);
# from 'Socket'
The object-oriented override module for the network query functions is Net::netent. Like Net::hostent, it does not override getnetent, only getnetbyaddr and getnetbyname. Since both these functions reset the pointer of getnetent we have to collect names (or addresses) first, then subsequently run through each one with getnetbyname or getnetbyaddr. Here's a script that dumps out network information in the same vein as the object-oriented host script earlier: #!/usr/bin/perl # getobjnet.pl use warnings; use strict; use Net::netent; use Socket qw(inet_ntoa); my @nets; while (my $net = CORE::getnetent) { push @nets, $net; } while (my $net = shift @nets) { $net = getnetbyname($net); print 'Name : ', $net->name, "\n"; print 'Type : ', $net->addrtype, "\n"; print 'Aliases : ', join(', ', @{$net->aliases}), "\n"; print 'Addresses: ', $net->addr_list, "\n\n"; }
1008
Networking with Perl
Note that this script will happily return nothing at all if we don't have any configured networks, which is quite possible. The field-based version of this script is: #!/usr/bin/perl # getfldnet.pl use warnings; use strict; use Net::netent qw(:FIELDS); use Socket qw(inet_ntoa); my @nets; while (my $net = CORE::getnetent) { push @nets, $net; } while (my $net = shift @nets) { $net = getnetbyname($net); print 'Name : ', $n_name, "\n"; print 'Type : ', $n_addrtype, "\n"; print 'Aliases : ', join(', ', @n_aliases), "\n"; print 'Addresses: ', $n_net, "\n\n"; }
In addition, Net::netent defines the getnet subroutine. This attempts to produce the correct response from whatever is passed to it by calling either getnetbyname or getnetbyaddr depending on the argument. Like gethost, it automatically handles strings that look like IP addresses so we can say: $net = gethost('192.168.1'); print $net->name; # or $n_name;
Network Services All networked hosts have the ability to service network connections from other hosts in order to satisfy various kinds of request; anything from a web page to an email transfer. These services are distinguished by the port number that they respond to; web service (HTTP) is on port 80, FTP is on port 21, SMTP (for email) on port 25, and so on. Rather than hard-code a port number into applications that listen for network connections, we can instead configure a service, assign it to a port number, and then have the application listen for connections on the service port number. On UNIX systems, this association of service to port number is done in the /etc/services file, a typical sample of which looks like this: tcpmux ftp-data ftp telnet smtp time time nameserver whois domain domain finger www www
1/tcp 20/tcp 21/tcp 23/tcp 25/tcp 37/tcp 37/udp 42/tcp 43/tcp 53/tcp 53/udp 79/tcp 80/tcp 80/udp
# TCP port service multiplexer
mail timserver timserver name # IEN 116 nicname nameserver # name-domain server nameserver http # World Wide Web HTTP # Hypertext Transfer Protocol
1009
Chapter 23
A real services file contains a lot more than this, but this is an indicative sample. We can also define our own services, though on a UNIX system they should be on at least port 1024 or higher, since many ports lower than this are allocated as standard ports for standard services. Furthermore, only the administrator has the privileges to bind to ports under 1024. For example: myapp
4444/tcp
myserver
# My server's service port
Each entry consists of a primary service name, a port number and a network protocol, and an optional list of aliases. The getservent function retrieves one of these in turn from the local host configuration, starting from the first. For example, as we showed earlier in the general discussion: #!/usr/bin/perl # listserv.pl use warnings; use strict; while (my ($name, $aliases, $port, $proto) = getservent) { print "$name\t$port/$proto\t$aliases\n"; }
Here name, aliases, and port are reasonably self-evident. The protocol is a numeric value describing the protocol number – 6 for TCP, 7 for UDP, for example. We can handle this number and convert it to and from a name with the protocol functions such as getprotoent and getprotobynumber, which we will come to in a moment. setservent resets the pointer for the next getservent to the start of the services file, if given a true value. endservent closes the internal filehandle. These are similar to their network and host counterparts, but are far less likely to open network connections for queries (DNS and NIS do not do this, but NIS+ conceivably might). Services are related to ports, not addresses, so the function getservbyport returns a list of parameters for the service defined for the named port, or an empty list otherwise. Since ports may be assigned to different services using different network protocols, we also need to supply the protocol we are interested in: $protocol = getprotobyname('tcp'); ($name, $aliases, $port, $proto) = getservbyport(80, $protocol); ($name, $aliases, $port, $proto) = getservbyname('http', $protocol);
Or, in scalar context, for our user-defined service: $name = getservbyport(4444, $protocol); $port = getservbyname('myserver', $protocol);
The object-oriented override module for service queries is Net::servent, and we saw examples of using it in both the object-oriented style and with imported variables earlier. Unlike the host and network modules, Net::servent does override getservent, so we can iterate through services without having to collect names or port numbers into a list first and then running through the result with 'getservbyname or getservbyport.
1010
Networking with Perl
In addition, Net::servent defines the getserv as a convenience subroutine that maps a passed argument onto either getservbyproto or getservbyname depending on whether the passed argument looks numeric or not.
Network Protocols Protocols are defined in terms of numbers at the network level – 0 for IP, 6 for TCP, 17 for UDP and so on. These are designated standards defined in RFCs. Locally, these numbers are associated with a userfriendly name, so instead of referring to 'protocol number 6' we can say 'tcp' instead. For example, here are a few entries from a typical /etc/protocols file on a UNIX server: ip icmp igmp ipencap tcp udp
0 1 2 4 6 17
IP # internet pseudo protocol number ICMP # internet control message protocol IGMP # Internet Group Management IP-ENCAP # IP encapsulated in IP (officially `IP') TCP # transmission control protocol UDP # user datagram protocol
This file is not one we are expected ever to add to; although we can create a protocol if we want. We can, however, interrogate it. We can list the protocol definitions defined for the local system with the getprotoent, which works in the familiar way, reading the first entry in the list and then each subsequent entry every time it is called. In scalar context, it returns the protocol name: $firstprotoname = getprotoent;
In list context it returns a list of values consisting of the primary name for the protocol, the port number, and any aliases defined for it. This short script illustrates how we can use it to dump out all the protocol details defined on the system: #!/usr/bin/perl # listprot.pl use warnings; use strict; while (my ($name, $aliases, $proto) = getprotoent) { print "$proto $name ($aliases)\n"; }
endprotoent closes the internal filehandle, and setprotoent resets the pointer for the next entry to be returned by getprotoent to the start of the file if given a true value, as usual. Also as usual, calling getprotobyname and getprotobynumber also resets the pointer. As with their similarly named brethren, getprotobyname and getprotobynumber return the port number and port name respectively in a scalar context: $proto_name = getprotobynumber($proto_number); $proto_number = getprotobyname($proto_name);
In list context, they return the same values as getprotoent.
1011
Chapter 23
The object-oriented override module for protocols is Net::protoent, which overrides the getprotoent, getprotobyname, and getprotobynumber functions with equivalents that return objects. Here is the object-oriented way of listing protocol information using Net::protoent: #!/usr/bin/perl # listobjproto.pl use warnings; use strict; use Net::protoent; while (my $protocol print "Protocol: print "Name : print "Aliases : }
= getprotoent) { ", $protocol->proto, "\n"; ", $protocol->name, "\n"; ", join(', ', @{$protocol->aliases}), "\n\n";
As with the other Net:: modules that provide object-oriented wrappers for Perl's network information functions, we can also import variables that automatically take on the values of the last call. The equivalent field-based script is: #!/usr/bin/perl # listfldproto.pl use warnings; use strict; use Net::protoent qw(:FIELDS); while (my $protocol print "Protocol: print "Name : print "Aliases : }
= getprotoent) { ", $p_proto, "\n"; ", $p_name, "\n"; ", join(', ', @p_aliases), "\n\n";
In addition, Net::protoent defines the getproto subroutine as a convenient way to produce a valid response from almost any passed value. Numeric values are passed to getprotobynumber; anything else is presumed to be a name and passed to getprotobyname.
Determining the Local Hostname Determining the local hostname for the system can sometimes be a tricky thing. Firstly, a system can happily have more than one network interface and therefore more than one name. Secondly, the means by which the hostname is defined varies wildly from one operating system to another. If we just want to open a network connection to a service or application (such as a web server) running on the local machine, we can often get away with simply connecting to the local loopback IP address. However, this is not enough if we need to identify ourselves to the application at the other end. On most UNIX platforms, we can execute the hostname command to find the name of the host, or failing that name, uname -n, or alternatively we can use the gethostbyname function to try to identify the true name of the local host from one of its aliases. The localhost alias approach should work on any system that has it defined as the hostname for the loopback address in either DNS or the hosts file. Be aware, though, that it may not always return a fully qualified name (hostname.domain).
1012
Networking with Perl
Because of this confusing mess of alternative approaches Perl comes with the Sys::Hostname module which attempts all of these methods and more in turn, with different tests depending on the underlying platform. It provides a single subroutine, hostname, which hides all of these mechanics from us and allows us simply to say: #!/usr/bin/perl # hostname.pl use warnings; use strict; use Sys::Hostname; print "We are: ", hostname, "\n";
This will try to produce the correct result on almost any platform (even, interestingly, Symbian's EPOC) and is probably the best solution for both simplicity and portability.
Summary In this chapter, we covered networking with Perl in some detail. We started by explaining how the Internet protocols work, and how packets are routed from network to network. We then moved on to explaining how to use the Socket and IO::Socket modules to create network clients and servers. Specifically, we covered: Using Socket.pm; we covered how to create simple servers and clients that connect and listen to specified sockets. We showed how to do this using TCP, UDP, and UNIX domain sockets.
❑
Using IO::Socket we repeated every example used above in the simpler format used with the IO::Socket module.
❑
Broadcasting with UDP, and how to send packets to multiple destinations simultaneously.
AM FL Y
❑
TE
Later, we covered how to use forking, threading, and polling to create network servers that can deal with multiple requests from clients. We had examples of each type of server included in the text, which can be adapted to any other projects. In the last section of this chapter, we looked at retrieving different kinds of network information, and discussed the following topics: ❑
Getting host information, including a portable way to discover the local hostname, and how to discover the IP address of a specified hostname, and to acquire any aliases it might have.
❑
Getting network information and discovering what host or IP address is part of what network, and whether network names have been defined.
❑
We described how to discover what network services are available on our system.
❑
Finally, we saw how to examine the network protocols available to our system, even though it is very rare to ever update this information
The examples throughout this chapter performed similar tasks in different ways (including writing the applications in an object-oriented style).
1013
Team-Fly®
Chapter 23
1014
Writing Portable Perl Perl runs on every major operating system, and most of the minor ones, from VOS and Cray to Mac OS and Windows, and, of course, UNIX. Most of the time, Perl code will work the same on any of these platforms, because Perl is one of the most portable programming languages around. It may be difficult to see how a language, or any program written in that language, could work the same across all of these disparate platforms. They are often significantly different in, amongst other things, memory management, file systems, and networking. The developers of Perl have done their best to hide these differences, but it is still possible to write Perl code that runs properly on only one platform. A portable Perl program may be portable only on all UNIX platforms (sometimes, depending on the task, that in itself is difficult or impossible). Most often though, 'portable Perl' is code that will work on all the platforms Perl runs on. Calling fork on Mac OS or symlink on MS Windows are two examples of non-portable code, since the functionality for those functions does not exist on those platforms. However, the great thing about Perl is that the overwhelming majority of Perl code is portable. Remember, no code is universal – no matter what it is, it will always fail on some platform, somewhere. However, where Perl does exist, most Perl code will run; or it can be made to run, if the author of the code keeps some portability guidelines in mind.
Chapter 24
Maybe We Shouldn't Bother? Before we start with any programming task, we should give at least a moment of thought to the scope of the project. Think about where the program will be used, who will use it, and what its lifetime might be. If we are writing a ten line script for the cron daemon to export our CVS tree to our home directory on our Linux box every night, it is probably a waste of time bothering to make sure the program will run on Windows and Mac OS. It is probably even a waste of time to concentrate on making it portable to multiple UNIX platforms, with multiple versions of cvs, and with other CVS repositories and users. It is enough to know how to talk to our CVS server, get our CVS repository, and find our home directory. However, a module placed on CPAN for parsing XML should run anywhere Perl runs, if possible. That's a general task, not specific to any platform, because it may be used by any number of users in any number of ways. It will save ourselves and our users time and headaches if portability is taken care of at the beginning. There are many choices between these extremes – it might be tempting to use Mac-specific file specifications in a module for sending and receiving Apple events in Mac OS, but what happens when someone wants to use the module on Perl under Mac OS X, which uses UNIX-style file specifications? Sometimes it is best to write portably, even if we think it might not be necessary, if there's a chance that someone might be using our code on a different platform.
Newlines Perhaps the most common portability problem is improper use of the newline character. In most operating systems, lines in files are terminated by a specific character sequence. This logical end of line is called the newline, and is represented in Perl, and many other languages, by the symbolic representation \n. Many portability problems are rooted in the incorrect assumption that a newline is the same thing as a line feed (octal 012, the LF in CRLF), which it isn't. The newline has no absolute, portable value; it is a logical concept. Unfortunately, there are plenty of historical reasons why we have a line feed, a carriage return (CR), and a newline. It is reasonable to suspect that if time travel were possible, going back to standardize a single newline character would be a worthwhile goal, but alas, we are stuck with the problem and need to deal with it. UNIX defines newline as the octal number 012. Mac OS uses 015 (carriage return / CR). On Windows, it is represented by two bytes together, 015 and 012 (CRLF). On Windows (and other DOS-like platforms), those bytes are automatically translated to the single newline character when reading and writing a file; this is done by Perl to aid portability. We can get the exact data in the file without translation by calling the binmode operator on the file handle: open $file, "< $binaryfile" or die "Can't open file: $!"; binmode($file);
Since binmode has no effect on other systems, it should be used for cross platform code that deals with binary data (assuming it is known that the data will be binary). Note that because DOS-like platforms will convert two characters into one, using binmode can break the use of seek and tell. We will however, be safe using seek if we stick to seeking to locations we obtained from tell and no others.
1016
Writing Portable Perl
Some of this may be confusing. Here's a handy reference for the ASCII CR and LF characters. Octal
Hex
Decimal
Ctrl
LF (line feed)
012
x0A
10
\cJ
CR (carriage return)
015
x0D
13
\cM
UNIX
DOS/Windows
Mac
\n
LF
CRLF
CR
\r
CR
CR
LF
Often, a file we want to read will be taken from another platform. If the file we are reading is native – that is, uses the same newline as our operating system, then, we can use the common idiom: while (<$file>) { chomp; # ... do something with the data in $_ }
However, that won't work if the file is not native. chomp, <>, and other operators assume that the file is terminated with the native newline, so reading a DOS text file in UNIX will not work properly without some extra work. There are several ways to deal with this. We can preprocess the file before reading it, either by sucking in all the data and doing a global fix, or by changing it on disk before reading it in. A simpler method however, is to change the input record separator variable $/, which determines the definition for newline, and defaults to the native newline. So, to read a DOS file on a UNIX system, we would code something like this: { local $/ = "\015\012"; # temporary change while (<$file>) { # read file with \015\012 chomp; # chomp \015\012 ... } } # $/ reverts to previous value automatically
We may need to use something like this, if we are reading a file from a network file system, or portable media (like CD-ROM), where the file is not native, or if we have a file transferred over the network from one system to another without text translation. A similar problem occurs when reading and writing sockets. Sockets are doubly troublesome because different protocols use different newlines, and some will even allow for more than one. For example, the HTTP protocol requires CRLF in headers, but the content itself may be terminated with any of CR, LF, or CRLF, or may be binary data without newlines at all. Even though the HTTP specification requires CRLF in the headers, an HTTP server might not be compliant, perhaps sending plain LF instead.
1017
Chapter 24
While Perl will automatically use the proper newline for the native platform when handling files, it will, by default, use the UNIX values for sockets and most other forms of I/O. With sockets, it is best to be specific, and use the absolute value instead of the symbolic representation. print $socket "Hi there client!\015\012"; # right print $socket "Hi there client!\r\n"; # wrong
To write to a socket, we need to know what newline is expected by the protocol. Reading from the socket is, however, a little bit more flexible. Most protocols will use either LF or CRLF, so we can adjust on the fly, converting either one to whatever our local newline happens to be. Since both end in LF, we can set $/ to LF. Bear in mind, however, that the Mac uses CR as its newline character so a Mac-based protocol may require that character. If we don't want to remember the absolute values, we can import $CR, $LF, and $CRLF variables from the Socket module. #!/usr/bin/perl # importabs.pl use warnings; use strict; use Socket qw(:DEFAULT :crlf); { local $/ = $LF; while (<$socket>) { s/$CR?$LF/\n/; # not sure if CRLF or LF ... } }
Similarly, functions that return text data (like a function that fetches a web page), may translate newlines before returning the data, so the caller doesn't have to worry about it. sub fetch_page { ... $html =~ s/$CR?$LF/\n/g; return $html; }
Files and File Systems Most platforms structure files in a hierarchical fashion, and support the notion of a unique path to a file. How that path is represented may differ significantly between operating systems. For example, UNIX is one of the few operating systems with the concept of a single root directory. UNIX uses / as, path separator, while DOS-like systems use \. Mac OS uses :. VOS uses the native pathname characters >, <, and %. RISC OS Perl uses . for path separator and : for file systems and disk names. RISC OS, VOS, and DOS-like Perl interpreters, can all emulate the use of /.
1018
Writing Portable Perl
Thankfully, Perl ships with the File::Spec module, which provides simple functions to create file specifications for the current platform: use File::Spec::Functions; # export functions chdir(updir()); # change dir, up one dir $file = catfile(curdir(), 'temp', 'file.txt'); open $fh, "< $file" or die $!;
On UNIX and DOS-like platforms, $file would be ./temp/file.txt. On Mac OS, it would be :temp:file.txt. On VMS, it would be [.temp]file.txt. This is one of those cases where using the module is often much better than trying to figure out all of the options for ourselves. For writing file specifications portably, use File::Spec (see below for more on File::Spec). For parsing file specs portably, use File::Basename: use File::Basename; ($file, $dir) = fileparse($file);
If newlines are the most common portability problem, then non-portable file specifications are not far behind. We should use these modules when we want to be portable and not hardcode paths if possible. This is especially a problem in scripts like Makefile.PLs and test suites, which often assume / as a path separator for subdirectories. Even when on a single platform (if we can call UNIX a single platform), we must remember not to count on the existence or the contents of particular system specific files or directories like /etc/passwd, /etc/sendmail.cf, /etc/resolv.conf, or even /tmp/. For example, /etc/passwd may exist but may not contain the encrypted passwords, because the system is using some form of enhanced security; or it may not contain all the accounts, because the system is using NIS. Instead it is better to use getpwnam and its relatives, to get information about a user. If code does need to rely on such a file, we should include a description of the file and its format in the code's documentation, then we make it easy for the user to override the default location of the file. We must not have two files of the same name within different cases, like test.pl and Test.pl, as many file systems are case insensitive. Whitespace in filenames is tolerated on most systems, but not all, and many systems (DOS, and VMS for example.) cannot have more than one . in their filenames. Others only allow a limited subset of characters in names. If filenames are limited to word characters (az, A-Z, 0-9, _), plus . for the extension, this should be sufficient on most systems. We should also keep them to the 8.3 convention for maximum portability, since some older platforms and file systems cannot use files that do not fit the convention. Likewise, when using the AutoSplit module, we should try to keep our functions to 8.3 naming and case-insensitive conventions; or, at the least, make it so that the resulting files have a unique, case insensitive, first eight characters. When a function is autosplit, it is saved to disk using the name of the function, and if we have functions called getSomething and getSomethingElse, then the subroutine names will be truncated and will therefore become indistinguishable. If it is possible that a filename contains strange characters, it is safest to open it with sysopen instead of open. The open function can translate characters like >, <, and |, which may be the wrong thing to do (of course, it may also be the right thing). If we do use open, and the filename in a variable is unknown, don't assume > won't be the first character of the filename. We must always use < explicitly to open a file for reading, unless we want the user to be able to specify a pipe open.
1019
Chapter 24
open(FILE, $file) or die $!; # wrong open(FILE, "< $file") or die $!; # right
Beyond path differences, there are many other fundamental differences between file systems. Some don't support hard links (link) and symbolic links (symlink, readlink, lstat, -l). Some don't support the access to, creation of, or changing of timestamps for files. Modifying timestamps is usually supported, but may have a granularity of something other than one second (FAT uses two seconds).
Portable File Handling with 'File::Spec' File::Spec provides a standard toolkit of filename conversion methods that work across all the common platforms that support Perl. The actual implementation is provided by a submodule that implements the details of file naming on the platform in question, as determined by the value of the special variable $^O. On a UNIX system for example, File::Spec automatically uses the File::Spec::UNIX module. The platforms supported by File::Spec are: '$^O'
'File::Spec' module
MacOS
File::Spec::Mac
os2
File::Spec::OS2
MSWin32
File::Spec::Win32
VMS
File::Spec::VMS
All others
File::Spec::UNIX
File::Spec provides two interfaces. The standard object interface imports no functions but provides class methods, which we can call in the usual manner, for example: use File::Spec; ... $newpath = File::Spec->canonpath($path);
For translations, we can also make use of two File::Spec submodules at the same time. For example, the following subroutine converts an arbitrary Macintosh pathname into a Windows one (see the table below for an explanation of the methods used in this example): use File::Spec::Win32; use File::Spec::Mac; sub mac2win { $mac_path = shift; @ary = File::Spec::Mac->splitpath $mac_path; $ary[1] = File::Spec::Win32->catdir(File::Spec::Mac->splitdir $ary[1]); return File::Spec::Win32->catpath @ary; }
1020
Writing Portable Perl
Alternatively, we can use File::Spec's methods as functions if we use the File::Spec::Functions module instead: use File::Spec::Functions; ... $newpath = canonpath($path);
Not all of File::Spec's methods are exported automatically by File::Spec::Functions. By default, the functions that are exported are: canonpath, catdir, catfile, curdir, file_name_is_absolute, no_upwards, path, rootdir, and updir. The functions that are exported only if we ask for them are: abs2rel, catpath, devnull, rel2abs, splitpath, splitdir, tmpdir. For example, to get the devnull and tmpdir functions we would use: use File::Spec::Functions qw(devnull tmpdir);
File::Spec provides eighteen routines for the conversion and manipulation of filenames: Subroutine
Action
canonpath($path)
Returns the filename cleaned up to remove redundant elements such as multiple directory separators or instances of the current directory (as returned by curdir), and any trailing directory separator if present. It does not resolve parent directory elements; we use no_upwards for that. For example: $canonical = canonpath($path);
catdir(@dirs)
Concatenates a list of directory names to form a directory path, using the directory separator of the platform. For example: $path = catdir(@directories);
catfile/join(@dirs, $file)
Concatenates a list of directory names followed by a filename to form a file path, using the directory separator of the platform. On platforms where files and directories are syntactically identical, (DOS/Windows and UNIX both fall into this category), this produces exactly the same result as catdir. join is an alias for catfile. For example: $filepath = join(@dirs, $filename);
curdir
Returns the filename of the current directory, . on DOS/Windows and UNIX systems.
devnull
Returns the filename of the null device, which reads but ignores any output sent to it. This is /dev/null under UNIX. Table continued on following page
1021
Chapter 24
Subroutine
Action
rootdir
Returns the filename of the root of the filesystem and the prefix to filenames that makes them absolute rather than relative. This is / on UNIX and \ on DOS/Windows. This is also frequently the name of the directory separator.
tmpdir
Returns the first writable directory found in the twoelement list $ENV{TMPDIR} and /tmp. Returns ' ' if neither directory is writable.
updir
Returns the filename for the parent directory, .. on many platforms including DOS/Windows and UNIX.
no_upwards($path)
Resolves the supplied filename to remove any parent directory elements, as returned by updir, and returns it. It does not resolve symbolic links. However, we can use abs_path or realpath from the File::Path module for that. It can be used in combination with canonpath above. For example: $path = no_upwards($path); $path = canonpath($path);
case_tolerant
Returns true if the case of filenames is not significant, or false if it is.
file_name_is_absolute ($path)
Returns true if the filename is absolute, that is it starts with an anchor to the root of the filesystem, as returned by rootdir. For example: $path = rel2abs($path) unless file_name_is_absolute($path);
path
Returns the value of the environment variable PATH – identical to $ENV{'PATH'}.
splitpath($path, 0|1)
Returns a three element list consisting of the volume, directory path, and filename. On systems that do not support the concept of a 'volume', such as UNIX, the first element is always undef. For platforms like DOS/Windows, it may be something like C: if the supplied pathname is prefixed with that volume identifier. On platforms that do not distinguish file names from directory names, the third element is simply the last element of the pathname, unless the second Boolean argument is true, or the path ends in a trailing directory separator (e.g /), a current directory name (e.g. .), or a parent directory name (e.g. ..). ($vol, $dirs, $file) = splitpath($path, 1);
The directory path can be further split into individual directories using splitdir.
1022
Writing Portable Perl
Subroutine
Action
splitdir($dirs)
Returns a list of directories split out from the supplied directory path, as returned in the second element of the list produced by splitpath. It uses the directory separator of the platform. Note that splitdir requires directories only, so we use splitpath to remove the volume and file elements on platforms that support those concepts. For example: @dirs = splitdir($dirs);
Leading, trailing, and multiple directory separators will create empty strings in the resulting list. Using canonpath first will prevent all but the leading empty string. catpath($vol, $dirs, $file)
Combines a three element list containing a volume, directory path, and filename, and returns a complete path. On platforms that do not support the concept of volumes, the first element is ignored. For example: $path = catpath($vol, $dirs, $file);
The directory path can be generated by catdir or alternatively determined from another pathname via splitpath. Takes an absolute pathname and generates a relative pathname, relative to the base path supplied. If the base path is not specified, it generates a pathname relative to the current working directory as determined by cwd from the File::Path module. The volume and filename of the basepath are ignored. For example:
AM FL Y
abs2rel($relativepath, $base)
$relpath = abs2rel($abspath);
Combines a relative pathname and a base path and returns an absolute pathname based on the combination of the two. If the base path is not specified, the current working directory, as determined by cwd from the File::Path module, is used as the base path. The volume and filename of the basepath are ignored. For example:
TE
rel2abs($relativepath, $base)
$abspath = rel2abs($relname);
Endianness and Number Width Different types of CPUs store integers and floating-point numbers in different orders (called endianness) and widths (32-bit and 64-bit being the most common today). Any time that binary data is read or written with different architectures involved, these differences can become an issue. Most of the time this occurs when writing files with pack. For instance, we can write four bytes representing the hex number 0x12345678 (305419896 in decimal) to a file in this way: print $file pack 'l', 0x12345678;
1023
Team-Fly®
Chapter 24
We can later read it back in like this: $line = <$file>; print unpack 'l', $line;
This prints the decimal value 305419896. However, if this is done on an Intel x86 computer and the file is transferred over to a Motorola 68K computer, and we then read it in with the same program, we would get 2018915346, because the order of the bytes is reversed; instead of 0x12345678 we get 0x78563412. To avoid this problem in network connections, the pack and unpack functions have the n (16-bit) and N (32-bit) formats for network order. These are guaranteed to be portable, as they are defined as big-endian, regardless of hardware platform. Differing widths can cause truncation, even between platforms of equal endianness. The platform of shorter width loses the upper parts of the number. Most platforms can only count up to 2^(32-1). If we have a 64-bit platform and store 2^42 with pack (using the q and Q formats), we will lose badly if we try to unpack it on a 32-bit platform. Even seemingly innocuous pack formats like I, can be dependent on the local C compiler's definitions, and may be nonportable. It's best to stick with n and N when porting between platforms. One can circumnavigate the problems of width and endianness by representing the data unambiguously, like writing numbers out in plain text. Using a module like Data::Dumper or Storable, can also ensure the correct values are maintained.
System Interaction Perl provides a nice environment to program in, but also relies on the system, and is limited by it. There are certain things we just shouldn't do in order to be portable: ❑
Don't delete or re-name open files. Some platforms won't allow it. untie or close the file before doing anything to it
❑
Don't open the same file more than once at the same time for writing. Some platforms place a mandatory lock on files opened for writing
❑
Don't count on filename globbing. Use opendir, readdir, and closedir instead. As of Perl 5.6, the default glob is implemented with the portable File::Glob module by default; however, some people might not be using Perl 5.6, or might be using Perl built to use the old style csh glob instead of File::Glob. It may be best to use the built-in directory functions
❑
Don't count on per-program environment variables or current directories, such as $TEMP or $HOME
❑
Don't count on a specific environment variable existing in %ENV. Don't count on %ENV entries being case sensitive, or even case preserving
❑
Don't count on user accounts, whether it is specific information about a user, or the existence of users at all
❑
Don't count on signals or %SIG for anything
❑
Don't count on specific values of $!
1024
Writing Portable Perl
Platforms that rely primarily on a Graphical User Interface (GUI) for user interaction do not all provide a command line. A program requiring a command line interface might not work everywhere. In most cases this is probably for the user of the program to deal with, so don't stay up late worrying about it too much. If a Mac developer wants to use our command line tool, they'll convert it to a droplet or rewrite it, so it will be usable. Even platforms that do have a command line might not be able to execute programs in the same way UNIX does, by putting the path to perl after #! on the first line and making the file executable.
Inter-Process Communication (IPC) Whereas system interaction involves a big list of things not to do to retain maximum portability, in general, inter-process communication is something we just don't do at all. Don't directly access the system in code, which is meant to be portable. That means, no system, exec, fork, pipe, ``, qx//, open with a pipe, or any of the other things that makes being a Perl and UNIX hacker really fun (for more on IPC see Chapter 22). Commands that launch external processes are generally supported on most platforms (though many of them do not support any type of forking). The problem with using them arises from where we invoke them. External tools are often named differently on different platforms, they may not be available in the same location, might accept different arguments, can behave differently, and often present their results in a platform dependent way. Consequently, we should seldom depend on them to produce consistent results. Then again, if we are calling netstat -a, we probably don't expect it to run on both UNIX and CP/M. Opening a pipe to sendmail is one especially common bit of Perl code: open(my $mail, '|/usr/sbin/sendmail -t') or die "cannot fork sendmail: $!";
This is fine for systems programming when sendmail is known to be available, but it is not fine for many non UNIX systems, and even some UNIX systems that may not have sendmail installed, or may have it installed elsewhere. If a portable solution is needed, see the various distributions on CPAN that deal with it. Mail::Mailer and Mail::Send in the MailTools distribution are commonly used and provide several mailing methods including mail, sendmail, and direct SMTP (via Net::SMTP) if a mail transfer agent is not available. Mail::Sendmail is a stand-alone module that provides simple, platform independent mailing. Note that the UNIX System V IPC (msg*, sem*, shm*) is not available even on all UNIX platforms, so we should also try to avoid using those. Instead of directly accessing the system, use a module if one is available, that may internally implement the functionality with platform-specific code, but expose a common interface.
External Subroutines (XS) XS, explored in Chapter 21, is a language used to create an extension interface between Perl and some C library that we may wish to use with Perl. While XS code can usually be made to work with any platform; dependent libraries, header files, etc., might not be readily available or portable, or the XS code itself might be platform-specific, just as Perl code might be. If the libraries and headers are portable, then it is normally reasonable to make sure the XS code is portable, too.
1025
Chapter 24
A different type of portability issue arises when writing XS code – availability of a C compiler on the end user's system. Many Mac OS or Windows users might not have compilers, and those compilers will sometimes be expensive, hard to obtain, hard to use, or all of the above; and the initial setup to get the source working for the build may be prohibitive to many users. Furthermore, C brings with it its own portability issues, and writing XS code will expose us to some of those. Writing purely in Perl is an easier way to achieve portability.
Modules In general, the standard modules, those included with the standard Perl distribution, work across platforms. Notable exceptions are the platform specific modules (like ExtUtils::MM_VMS) and DBM modules. There is no one DBM module available on all platforms. SDBM_File and the others are generally available on all UNIX and DOS like ports, but not in MacPerl, where only NBDM_File and DB_File are available. The good news is, that at least some DBM modules should be available, and AnyDBM_File will use whichever module it can find. Of course, then the code needs to be fairly strict, dropping to the lowest common denominator (for example, not exceeding 1K for each record, and not using exists on a hash key) so that it will work with any DBM module. One very useful module for achieving portability in programs is the POSIX module. It provides nearly all the functions and identifiers found in the POSIX 1003.1 specification. Now, there is no guarantee that all of the functions will work on a given platform. It depends on how well the platform has implemented POSIX 1003.1. Most of it will work, however, and can provide a good way to achieve portability. Many of the modules on the CPAN are portable. However, many are not. Many of them could be portable, but are not, simply because the author only had access to one platform. Sometimes the module will have noted in its documentation what platforms it will work on, but one of the best places to look is the CPAN Testers web site (http://testers.cpan.org/). CPAN Testers is a group of volunteers that tests modules on CPAN and reports (usually PASS or FAIL) for the given module with a given distribution of Perl on a given platform. This is useful for module authors, not just so that they can get help in making the module as portable as possible, but also as a sanity check to make sure their module works even on the platforms they intended it to work on. The CPAN Testers results are cross-referenced on the CPAN Search site (http://search.cpan.org/), so when we find a module on that site, we are told which platforms it has been tested on, and what the results were. CPAN Search is a great tool in its own right, providing a central location for information about the various module distributions on CPAN.
Time and Date The system's notion of time of day and calendar date is controlled in widely different ways. We can't assume the time zone is stored in $ENV{TZ}, and even if it is, we can't assume that we can control the time zone through that variable.
1026
Writing Portable Perl
We shouldn't assume that the epoch starts at 00:00:00, January 1, 1970, because that is OS- and implementation specific. It is better to store a date in an unambiguous representation. The ISO-8601 standard defines YYYY-MM-DD as the date format. A text representation (like 1979-07-12) can be easily converted into an OS specific value using a module like Date::Parse. An array of values, such as those returned by localtime, can be converted to an OS specific representation using Time::Local. Be careful, however: converting between unambiguous representations of times and dates can be very slow, so, if the calculations are internal to the program, then it is usually better to use the native values. When calculating specific times, such as for tests in time or date modules, it may be appropriate to calculate an offset for the epoch. require Time::Local; $offset = Time::Local::timegm(0, 0, 0, 1, 0, 70);
The value for $offset in UNIX will be 0, but in Mac OS it will be some large number. $offset can then be added to a UNIX time value to get what should be the proper value on any system: $nativetime = $unixtime + $offset;
Character Sets and Character Encoding We should assume little about character sets. We must assume nothing about numerical values (ord, chr) of characters: the decimal value of a may be 97 in ASCII, but may be something entirely different in another character set, like EBCDIC. We shouldn't assume that the alphabetic characters are encoded contiguously (in the numeric sense). We shouldn't assume anything about the ordering of the characters. The lowercase letters may come before or after the uppercase letters; the lowercase and uppercase may be interlaced so that both a and A come before b; the accented and other international characters may be interlaced so that ä comes before b. Furthermore, it may be unreasonable to assume that a single character is one byte. In various character sets, a character may constitute more than one byte. Unicode is becoming more prevalent, and its characters also often occupy more than one byte, too. If Unicode text will be used, then we use the UTF8 directive in Perl 5.6. With use utf8, regular expressions and many string functions will treat the characters in Unicode text as characters instead of individual bytes. @bytes = split //, "\x{263A}"; # 3 elements in @bytes { use utf8; @chars = split //, "\x{263A}"; # 1 element in @char }
Unicode handling in Perl 5.6 is still incomplete and subject to change in future revisions. See Chapter 25 for more on Unicode.
1027
Chapter 24
Internationalization If we may assume POSIX (a rather large assumption), we may read more about the POSIX locale system from the perllocale and POSIX.pm documentation. The locale system at least attempts to make things a little bit more portable, or at least more convenient and native-friendly for non english users. The system affects character sets and encoding, regular expressions, sort order, date and time formatting, and more. Simply adding use locale to the top of our program will make it respect the behavior of the current locale, if set. While the locale directive will do most of the work for us, one thing to keep especially in mind is that when using regular expressions, we don't use [a-zA-Z] to match letters and [0-9] to match numerals; instead, we use meta characters like \w and \d, or the POSIX character classes like [:alpha:] and [:digit:]. Exactly what constitutes an alpha or a numeric changes depending on the locale, so we let Perl do the work for us. Internationalization is examined in Chapter 26.
System Resources If our code is destined for systems with severely constrained (or missing) virtual memory systems, then we want to be especially mindful of avoiding wasteful constructs such as: @lines = ;
# bad
while () {$file .= $_} # sometimes bad $file = join('', ); # better
The difference between the last two constructs may appear unintuitive to some people. The first repeatedly grows a string, whereas the second allocates a large chunk of memory in one go. On some systems, the second is more efficient that the first.
Security Most multiuser platforms provide basic levels of security, usually implemented at the filesystem level. Some, however, do not. Thus the notion of user ID, or home directory, or even the state of being logged-in, may be unrecognizable on many platforms. If we write programs that are security-conscious, it is usually best to know what type of system we will be running under so that we can write code explicitly for that platform (or class of platforms).
Style For those times when it is necessary to have platform specific code, consider keeping the platform specific code in one place; this will make porting to other platforms easier. We might keep the platformspecific code in its own module, or have a separate module for each platform, or in the base code, with just different subroutines or code blocks for each case.
1028
Writing Portable Perl
We use the Config module and the special variable $^O to differentiate between platforms (as described in 'Platforms' below), if necessary. However, we should probably only separate code based on platform, if we know for sure that the different platforms have different requirements. For instance, the File::Spec module loads another platform specific module depending on what platform is running, to construct proper file specifications for that platform. Its code looks something like this: %module = ( MacOS => 'Mac', MSWin32 => 'Win32', os2 => 'OS2', VMS => 'VMS' ); $module = 'File::Spec::' . ($module{$^O} || 'Unix'); eval "require $module" or die $@; @ISA = ($module);
In the above case, and in others, the choice is clear-cut: we know exactly what behavior a specific platform will require, and select the code based on which platform is in use. However, sometimes we don't know, or would be better off not guessing, which platforms support a given block of code. For example, the MS Windows operating system does not support fork. However, Perl on Windows, as of Perl 5.6, can emulate fork. So, if we had code like this: die "Sorry, no fork! \n" if $^O eq 'MSWin32';
we would be excluding users of Perl 5.6 unnecessarily, while catching users of older versions of Perl. If a specific function is required, we can check for the existence of that function by trying to execute it with eval: eval {fork()}; die $@ if $@ && $@ =~ /(?:not |un)implemented/;
There are problems with this method. The text returned by $@ may differ from platform to platform, or even function to function on one platform. Furthermore, a function may not return unimplemented until after it is processed, so if we don't give it the correct number of arguments, we can get a prototype error. In addition, of course, we may not want to actually use the function to make sure it is there. Therefore, we would need to pass it arguments that are acceptable to the prototype of the function, and it will either do nothing, or execute in the way we want. So, if we want to use getpwuid to get a user name from a user ID, and die if the function is unimplemented, we can try: eval {getpwuid(0)}; # 'getpwuid' with argument triggers prototype error die $@ if $@ && $@ =~ /(?:not |un)implemented/; my $user = getpwuid($uid);
Or: $user = eval {getpwuid($uid)}; die $@ if $@ && $@ =~ /(?:not |un)implemented/;
1029
Chapter 24
We might, of course, not wish to die if it is unimplemented, but we may wish to just ask for a user name if $user is false. Of course, there are many reasons why it could be false. It could be that there is no user for that user id, or that getpwuid is not implemented, or that there is some bug in the OS, or a broken /etc/passwd file. We might not care why $user is false, we may only care about getting a name, so if it is false, we just prompt for a user name: $user = eval {getpwuid($uid)}; unless ($user) { print 'What is the user name?'; chomp($user = <STDIN>); }
We could extend this a bit further. Even though this code already works on UNIX, MS Windows, Mac OS, and just about any other Perl platform, we might want to bring up a dialog box in Mac OS. $user = eval {getpwuid($uid)} || query('What is the user name?'); sub query { $question = @_; $answer; if ($^O eq 'MacOS') { $answer = MacPerl::Ask($question); } else { print $question, ' '; chomp($answer = <STDIN>); } return $answer; }
To add code for a Windows dialog box, we just add another test, this time for $^O eq 'MSWin32', and bring up a Windows dialog if it is true. Until then, Windows will just use the default. This is nice because it is not only very portable, but it is easy to make future changes, because the code is well encapsulated. We must be careful not to depend on a specific output style for errors, such as when checking $! after a system call or $@ after an eval. Some platforms expect a certain output format, and Perl on those platforms may have been adjusted accordingly. Most specifically, we don't anchor a regular expression when testing an error value, because the error might be wrapped by the system in some other text. For example: print $@ =~ /^Expected text/ ? 'OK' : 'Not OK'; print $@ =~ /Expected text/ ? 'OK' : 'Not OK';
# wrong # right
We should be especially mindful to keep all of these issues in mind in the tests we supply with our module or program. Module code may be fully portable, but its tests might not be. This often happens when tests spawn off other processes or call external programs to aid in the testing, or when (as noted above) the tests assume certain things about the file system and paths.
1030
Writing Portable Perl
Platforms As of version 5.002, Perl is built with a $^O variable that indicates the operating system for which it was built. This was implemented to help speed up code that would otherwise have to use the Config module and get the value of $Config{osname}. Of course, to get more detailed information about the system, looking into %Config is certainly recommended. It might be useful for us to look at a platform's $^O or $Config{archname} value. This is often useful for setting platform specific values. Below is an incomplete table of some of the more (or less) popular platforms. Note that how $^O and $Config{archname} are determined is platform dependent, and in some cases just hard-coded by the person who builds for that platform. Operating System
'$^O'
'$Config{archname}'
AIX
Aix
aix
Amiga DOS
amigaos
m68k-amigos
BSD/OS
bsdos
i386-bsdos
Darwin
darwin
darwin dgux
dgux AViiON-dgux DYNIX/ptx
dynixptx
i386-dynixptx
FreeBSD
freebsd
freebsd-i386
HP-UX
Hpux
PA-RISC1.1
IRIX
Irix
irix
Linux
Linux
arm-linux
Linux
linux
i386-linux, i586-linux, i686-linux
Linux
linux
ppc-linux
Mac OS
MacOS
MachTen PPC
machten
Powerpc-machten
MPE/iX
mpeix
PA-RISC1.1
MS-DOS
dos
NeXT 3
next
next-fat
NeXT 4
next
OPENSTEP-Mach
openbsd
openbsd
i386-openbsd
OS/2
os2 Table continued on following page
1031
Chapter 24
Operating System
'$^O'
'$Config{archname}'
OS/390
os390
os390
OS400
os400
os400
OSF1
dec_osf
alpha-dec_osf
PC-DOS
dos
POSIX-BC
posix-bc
BS2000-posix-bc
reliantunix-n
svr4
RM400-svr4
RISC OS
riscos
SCO_SV
sco_sv
i386-sco_sv
SINIX-N
svr4
RM400-svr4
sn4609
unicos
CRAY_C90-unicos
sn6521
unicosmk
t3e-unicosmk
sn9617
unicos
CRAY_J90-unicos SunOS
solaris
i86pc-solaris
SunOS
solaris
sun4-solaris
SunOS4
sunos
sun4-sunos
VM/ESA
vmesa
Vmesa
VMS
VMS
VOS
VOS
Windows 95
MSWin32
MSWin32-x86
Windows 98
MSWin32
MSWin32-x86
Windows NT
MSWin32
MSWin32-x86, MSWin32-ALPHA, MSWin32-ppc
UNIX Perl works on a bewildering variety of UNIX and UNIX-like platforms (for example, most of the files in the hints directory in the source code kit). On most of these systems, the value of $^O is determined either by lowercasing and stripping punctuation from the first field of the string returned by typing uname -a (or a similar command) at the shell prompt, or by testing the file system for the presence of uniquely named files such as a kernel or header file. Since the value of $Config{archname} may depend on the hardware architecture, it can vary more than the value of $^O.
1032
Writing Portable Perl
DOS and Derivatives Perl has long been ported to Intel-style microcomputers running under systems like PC-DOS, MS-DOS, OS/2, and most Windows platforms. Users familiar with COMMAND.COM or CMD.EXE style shells should be aware that there may be subtle differences in using \ or / as a path separator. System calls accept either / or \ as the path separator; however, many command line utilities of DOS vintage treat / as the option prefix, so may get confused by filenames containing /. Aside from calling any external programs, / will work just fine, and probably better, as it is more consistent with popular usage, and avoids the problem of remembering what characters to escape in our filename strings, and what not to. So we should stick with / unless we know that we need to use \. The DOS FAT file system can only accommodate 8.3 filenames. FAT32, HPFS (OS/2), and NTFS (NT) file systems can have longer filenames, and are case insensitive but case preserving. DOS also treats several filenames as special, such as AUX, PRN, NUL, CON, COM1, LPT1, LPT2, etc. Unfortunately, these filenames won't even work if we include an explicit directory prefix. It is best to avoid such filenames, if we want our code to be portable to DOS and its derivatives. Unfortunately, it's hard to know what these all are.
Mac OS
AM FL Y
Users of these operating systems may also wish to make use of scripts such as pl2bat.bat, pl2cmd, or the commercial Perl2Exe to put wrappers around scripts. ActiveState, the company that releases ActivePerl, also wrote the Perl Package Manager, the system used primarily by Windows systems to get Perl modules that we mentioned in Chapter 1 of this book. This includes prebuilt binaries of modules that need a C compiler. See the documentation for PPM in ActivePerl, and download packages from http://www.activestate.com/packages/.
Mac OS is in some ways the most problematic platform, since there is no native, built-in command line. MacPerl, the version of Perl for Mac OS, does have several facilities for working around this limitation, such as droplets and run-times.
TE
For more information on how to do really cool stuff in MacPerl, check out the various MacPerl mailing lists on http://www.macperl.org/, or the authoritative reference on MacPerl, MacPerl: Power and Ease, from Prime Time Freeware. In the MacPerl application, we can't run a program from the command line; programs that expect @ARGV to be populated can be edited with something like the following, which brings up a dialog box asking for the command line arguments. if (!@ARGV) { @ARGV = split /\s+/, MacPerl::Ask('Arguments?'); }
A MacPerl script saved as a droplet will populate @ARGV with the full pathnames of the files dropped onto the script, just as a Perl script in UNIX called with filenames passed to it will have those names passed into @ARGV.
1033
Team-Fly®
Chapter 24
A command line interface does exist for Mac OS. Macintosh Programmer's Workshop (MPW) is currently a free development environment from Apple. MacPerl was first introduced as an MPW tool, and MPW can be used like a shell: perl myscript.plx some arguments MPW is dissimilar to a UNIX shell, having its own conventions, commands, and behaviors, but the basic concepts are the same. ToolServer is another free application from Apple that provides access to MPW tools from MPW and the MacPerl application, which allows MacPerl programs to use system, backticks, and even piped open. Any module requiring XS compilation is right out for most Mac users, because compilers on Mac OS are not very common (and often expensive). Some XS modules that can work with MacPerl are built and distributed in binary form on the CPAN, and are available from MacPerl Module Porters at http://pudge.net/mmp/. MacPerl is currently at version 5.004. MacPerl 5.6 is currently in development, but any features of Perl from 5.004_01 and later, such as the new open my $file idiom, will not work under the current version of MacPerl. Directories in Mac OS are specified with any path including the volume, such as volume:folder:file being absolute, and any others, including :file, being relative. Files are stored in the directory in alphabetical order. Filenames are limited to 31 characters, and may include any character except for Null and :, which is reserved as the path separator. Mac OS has no concept of the UNIX flock, so flock is not implemented. Mac OS will put a mandatory lock on any file opened for writing (in most cases), and we can do a Mac OS lock on a file with the functions FSpSetFLock and FSpRstFLock in the Mac::Files module. Also, chmod(0444, $file) and chmod(0666, $file) (implementations of the UNIX chmod command using octal switches to specify the permissions) are mapped to FSpSetFLock and FSpRstFLock. Mac OS is the proper name for the operating system, but the value in $^O is Mac OS. To determine architecture, version, or whether Perl is running as the MacPerl application or the MPW tool, check: ($version) = $MacPerl::Version =~ /^(\S+)/; $is_app = $MacPerl::Version =~ /App/; $is_tool = $MacPerl::Version =~ /MPW/; $is_ppc = $MacPerl::Architecture eq 'MacPPC'; $is_68k = $MacPerl::Architecture eq 'Mac68K';
Mac OS X, based on NeXT's OpenStep OS, can run MacPerl natively, under the Classic environment. The new Carbon API, which provides much of the Classic API under the new environment, may be able to run a slightly modified version of MacPerl in the new environment, if someone wants it in the future. Mac OS X and its Open Source brother, Darwin, both run UNIX Perl natively. It is unclear what the $^O value will be for Mac OS X, since it is not yet released, but only available as a beta, but in the betas, it too, was darwin. The MacPerl toolbox modules (Mac::Files, Mac::Speech, etc.) may be ported to Mac OS X at some point, so many Mac-specific scripts will work under either Mac OS or Mac OS X.
1034
Writing Portable Perl
Other Perls Perl has been ported to many platforms that do not fit into any of the categories listed above. Some, such as AmigaOS, Atari MiNT, BeOS, HP MPE/iX, QNX, Plan 9, VOS, and VMS, have been well integrated into the standard Perl source code kit. We may need to see the ports directory on CPAN for information, and possibly binaries, for the likes of aos, Atari ST, LynxOS, RISCOS, Novell Netware, Tandem Guardian, etc. Some of these OSes may fall under the UNIX category.
Function Implementations Listed below are functions that are either completely unimplemented, or else have been implemented differently on various platforms. Following each description will be, in parentheses, a list of platforms to which the description applies. Larry Wall once noted that, 'Perl is, in intent, a cleaned up and summarized version of that wonderful seminatural language known as UNIX.' Almost everything in core Perl will work in UNIX, so we note only those places where some platform might differ from the expected behavior for UNIX. When in doubt, we can consult the platform-specific README files in the Perl source distribution, and any other documentation resources accompanying a given port. We should be aware, moreover, that even among UNIX-like systems there are variations. Function
Platform Variations
binmode(FILEHANDLE)
Win32: The value returned by tell may be affected after the call, and the file handle may be flushed. Mac OS, RISC OS: Meaningless. VMS: Reopens the file and restores the pointer; if function fails, the underlying file handle may be closed, or the pointer may be in a different position.
chmod(LIST)
Win32: Only good for changing 'owner' read write access. 'Group', and 'other' bits are meaningless, as is setting the executable bit. Mac OS: Only limited meaning. Disabling/enabling write permission is mapped to locking/unlocking the file. RISC OS: Only good for changing 'owner' and 'other' read/write access. VOS: Access permissions are mapped onto VOS access control list changes.
chown(LIST)
Win32: Does nothing, but won't fail. Mac OS, Plan9, RISC OS, VOS: Not implemented. Table continued on following page
1035
Chapter 24
Function
Platform Variations
chroot(FILENAME)
Win32, Mac OS, VMS, Plan9, RISC OS, VOS, VM/ESA: Not implemented.
chroot(FILENAME) crypt(PLAINTEXT, SALT)
Win32: In rare cases, may not be available if library or source was not provided when building Perl. VOS: Not implemented.
dbmclose (HASH)
VMS, Plan9, VOS: Not implemented.
dbmopen(HASH, DBNAME, MODE)
VMS, Plan9, VOS: Not implemented.
dump(LABEL)
Win32: Not implemented. Mac OS, RISC OS: Not useful. VMS: Invokes VMS debugger.
endpwent
Win32, Mac OS, MPE/iX, VM/ESA: Not implemented.
exec(LIST)
Mac OS: Not implemented. VM/ESA: Implemented via Spawn.
Fcntl(FILEHANDLE, FUNCTION, SCALAR)
Win32, VMS: Not implemented.
flock(FILEHANDLE, OPERATION)
Win32: Available only on Windows NT (not on Windows 95). Mac OS, VMS, RISC OS, VOS: Not implemented.
fork
Win32, Mac OS, AmigaOS, RISC OS, VOS, VM/ESA: Not implemented. As of Perl 5.6, however, the Win32 port does have support for fork.
getlogin
Mac OS, RISC OS: Not implemented.
getpgrp(PID)
Win32, Mac OS, VMS, RISC OS, VOS: Not implemented.
getppid
Win32, Mac OS, VMS, RISC OS: Not implemented.
getpriority(WHICH, WHO)
Mac OS, Win32, VMS, RISC OS, VOS, VM/ESA: Not implemented.
getpwnam(NAME)
Win32, Mac OS: Not implemented. RISC OS: Not useful.
getgrnam(NAME)
Win32, Mac OS, VMS, RISC OS: Not implemented.
getnetbyname(NAME)
Win32, Mac OS, Plan9: Not implemented.
getpwuid(UID)
Win32, Mac OS: Not implemented. RISC OS: Not useful.
1036
Writing Portable Perl
Function
Platform Variations
getgrgid(GID)
Win32, Mac OS, VMS, RISC OS: Not implemented.
getnetbyaddr(ADDR, ADDRTYPE)
Win32, Mac OS, Plan9: Not implemented.
getprotobynumber(NUMBER)
Mac OS: Not implemented.
getservbyport(PORT, PROTO)
Mac OS: Not implemented.
getpwent
Win32, Mac OS, VM/ESA: Not implemented.
getgrent
Win32, Mac OS, VMS, VM/ESA: Not implemented.
gethostent
Win32, Mac OS: Not implemented.
getnetent
Win32, Mac OS, Plan9: Not implemented.
getprotoent
Win32, Mac OS, Plan9: Not implemented.
getservent
Win32, Plan9: Not implemented.
setpwent
Win32, Mac OS, RISC OS, MPE/iX: Not implemented.
setgrent
Win32, Mac OS, VMS, MPE/iX, RISC OS: Not implemented.
sethostent(STAYOPEN)
Win32, Mac OS, Plan9, RISC OS: Not implemented.
setnetent(STAYOPEN)
Win32, Mac OS, Plan9, RISC OS: Not implemented.
setprotoent(STAYOPEN)
Win32, Mac OS, Plan9, RISC OS: Not implemented.
setpgrp(PID, PGRP)
Win32, Mac OS, VMS, RISC OS, VOS: Not implemented.
setpriority(WHICH, WHO, PRIORITY) setservent(STAYOPEN)
Win32, Plan9, RISC OS: Not implemented.
setsockopt(SOCKET, LEVEL, OPTNAME, OPTVAL)
Mac OS, Plan9: Not implemented.
endpwent
Win32, Mac OS, MPE/iX, VM/ESA: Not implemented.
endgrent
Win32, Mac OS, MPE/iX, RISC OS, VM/ESA, VMS: Not implemented.
endhostent
Win32, Mac OS: Not implemented.
endnetent
Win32, Mac OS, Plan9: Not implemented.
endprotoent
Win32, Mac OS, Plan9: Not implemented.
endservent
Win32, Plan9: Not implemented. Table continued on following page
1037
Chapter 24
Function
Platform Variations
getsockopt(SOCKET, LEVEL, OPTNAME)
Mac OS, Plan9: Not implemented.
glob(EXPR)
Win32: Features depend on external perlglob.exe or perlglob.bat. May be overridden with something like File::DosGlob, which is recommended.
glob
Mac OS: Globbing built-in, but only * and ? meta characters are supported. RISC OS: Globbing built-in, but only * and ? meta characters are supported. Globbing relies on operating system calls, which may return filenames in any order. As most file systems are case insensitive, even 'sorted' filenames will not be in case sensitive order. ioctl(FILEHANDLE, FUNCTION, SCALAR)
Win32: Available only for socket handles, and it does what the ioctlsocket call in the Winsock API does. VMS: Not implemented. RISC OS: Available only for socket handles.
kill(SIGNAL, LIST)
Win32: Unlike UNIX platforms, kill(0, $pid) will actually terminate the process. Mac OS, RISC OS: Not implemented, hence not useful for taint checking.
link(OLDFILE, NEWFILE)
Win32: Hard links are implemented on Win32 (Windows NT and Windows 2000) under NTFS only. Mac OS, MPE/iX, VMS, RISC OS: Not implemented. AmigaOS: Link count not updated because hard links are not quite that hard (They are sort of halfway between hard and soft links).
lstat(FILEHANDLE)
Win32: Return values may be bogus.
lstat(EXPR)
VMS, RISC OS: Not implemented.
lstat msgctl(ID, CMD, ARG) msgget(KEY, FLAGS)
Win32, Mac OS, VMS, Plan9, RISC OS, VOS: Not implemented.
msgsnd(ID, MSG, FLAGS) msgrcv(ID, VAR, SIZE, TYPE, FLAGS) open(FILEHANDLE, EXPR) open(FILEHANDLE)
Win32, Mac OS, RISC OS: open to |- and -| are unsupported. Mac OS: The | variants are supported only if ToolServer is installed.
1038
Writing Portable Perl
Function
Platform Variations
pipe(READHANDLE, WRITEHANDLE)
Mac OS: Not implemented. MiNT: Very limited functionality.
readlink(EXPR)
Win32, VMS, RISC OS: Not implemented.
readlink semctl(ID, SEMNUM, CMD, ARG)
Win32, Mac OS, VMS, RISC OS, VOS: Not implemented.
semget(KEY, NSEMS, FLAGS) semop(KEY, OPSTRING) shmctl(ID, CMD, ARG)
Win32, Mac OS, VMS, RISC OS, VOS: Not implemented.
shmget(KEY, SIZE, FLAGS) shmread(ID, VAR, POS, SIZE) shmwrite(ID, STRING, POS, SIZE) socketpair(SOCKET1, SOCKET2, DOMAIN, TYPE, PROTOCOL)
Mac OS, Win32, VMS, RISC OS, VOS, VM/ESA: Not implemented.
stat(FILEHANDLE)
Win32: device and inode are not meaningful.
stat(EXPR)
MacOS: mtime and atime are the same, and ctime is creation time instead of inode change time.
stat
VMS: device and inode are not necessarily reliable. RISC OS: mtime, atime and ctime all return the last modification time. Device and inode are not necessarily reliable. symlink(OLDFILE, NEWFILE)
Win32, VMS, RISC OS: Not implemented.
syscall(LIST)
Win32, Mac OS, VMS, RISC OS, VOS, VM/ESA: Not implemented.
sysopen(FILEHANDLE, FILENAME, MODE, PERMS)
Mac OS, OS/390, VM/ESA: The traditional 0, 1, and 2 modes are implemented with different numeric values on some systems. Though the flags exported by Fcntl (O_RDONLY, O_WRONLY, O_RDWR) should work everywhere. Table continued on following page
1039
Chapter 24
Function
Platform Variations
system(LIST)
Win32: As an optimization, may not call the command shell specified in $ENV{PERL5SHELL}. system(1, @args) spawns an external process and immediately returns its process designator, without waiting for it to terminate. Return value may be used subsequently in wait or waitpid. Mac OS: Only implemented if ToolServer is installed. RISC OS: There is no shell to process meta characters, and the native standard is to pass a command line terminated by \n, \r or \0 to the spawned program. Redirection, such as > foo, is performed (if at all) by the runtime library of the spawned program. system LIST will call the UNIX emulation library's exec emulation, which attempts to provide emulation of the stdin, stdout, stderr in force in the parent, providing the child program uses a compatible version of the emulation library. scalar will call the native command line direct and no such emulation of a child UNIX program exists. Mileage will vary. MiNT: Far from being POSIX compliant. Because there may be no underlying /bin/sh, system trying to work around the problem by forking and execing the first token in its argument string. Handles basic redirection (< or >) on its own behalf.
times
Win32: 'cumulative' times will be bogus. On anything other than Windows NT, 'system' time will be bogus, and 'user' time is actually the time returned by the clock function in the C runtime library. Mac OS: Only the first entry returned is non zero. RISC OS: Not useful.
truncate(FILEHANDLE, LENGTH) truncate(EXPR, LENGTH)
Win32: If a FILEHANDLE is supplied, it must be writable and opened in append mode (use open(FH, '>filename')> or sysopen(FH,...,O_APPEND|O_RDWR). If a filename is supplied, it should not be held open elsewhere. VMS: Not implemented. VOS: Truncation to zero length only.
umask(EXPR)
Returns undef where unavailable, as of version 5.005.
umask
AmigaOS: umask works, but the correct permissions are set only when the file is finally closed.
1040
Writing Portable Perl
Function
Platform Variations
utime(LIST)
Win32: May not behave as expected. Behavior depends on the C runtime library's implementation of utime, and the file system being used. The FAT file system typically does not support an access time field, and it may limit timestamps to a granularity of two seconds. Mac OS, VMS, RISC OS: Only the modification time is updated.
wait waitpid(PID, FLAGS)
Win32: Can only be applied to process handles returned for processes spawned using system(1, ...). Unless we are using Perl 5.6 where there's support for fork. Mac OS, VOS: Not implemented. RISC OS: Not useful.
Summary In this chapter, we looked at many portability issues that we should be aware of if we want to write cross-platform programs. We have covered differences between various platforms in the way they handle newlines, filenames and using the File::Spec module, endianness, system interaction, IPC, XS, modules, time and date, character sets and character encoding, internationalization, system resources, security, and style. We have also explored Perl's support for particular platforms, namely UNIX, DOS, MS Windows, and Mac. Finally, we looked at a number of Perl functions with peculiar implementations, which are not implemented, or which are meaningless on non-UNIX platforms.
1041
Chapter 24
1042
Unicode
AM FL Y
Symbols are used to convey and to record meaning. We have been using them since time immemorial – they have been the building blocks of communication, from the earliest inscriptions on cave walls to the multitude of scripts that are used today. The language that the modern PC can deal with at the lowest level has only two symbols, traditionally represented by us as zeros and ones. People in every culture have programmed the computer to interact with them using their own set of symbols. Certain combination of zeros and ones were invented to stand for letters of the alphabet, monetary units, arithmetic symbols – things that made sense to humans.
TE
Just as most microprocessors would like to see a language of fixed high and low voltages being spoken on their data buses instead of some fancy noise, humans would like to see the computer respond to them in their own native language. Things were fine until recently, when most of the people using computers belonged to the English-speaking world, or had previously taught themselves English. The Internet brought about a paradigm shift, with people and machines across cultures interacting with each other. Half a decade ago, a company could not have had its web pages in multiple languages rendered correctly on browsers all over the world. Due to the variety of ways in which information in various languages could be stored, it was also difficult for a program to deal with multiple language input. It used to be impossible to have an online document containing multiple scripts, for example, an English newspaper containing a Japanese haiku or a quote from a speech in Russian. We are looking at two problems here: ❑
We have many different ways of representing our symbols (character codes and encoding).
❑
We have many different sets of symbols (languages).
Team-Fly®
Chapter 25
Whom Does It Affect? This affects almost everyone involved with creating software to be used across language boundaries . The world is a big place and most of it is still untouched by information technology. In addition, people will prefer software that uses their native symbols and language – not someone else's! Even today, the reaches of the Internet and the World Wide Web are not really utilized if the audience is limited only to people who share a particular language – it does not make economic sense to people doing business using various mechanisms running on computers and over networks. Therefore, by extension, it affects the creators of those mechanisms.
What Are the Solutions? These problems were recognized quite some time ago and now we have many solutions that have been implemented and are evolving, most of them working together to make sure that data is usable by everyone. We have technologies such as XML, locales and Unicode to help us internationalize our applications and web pages. We don't have any magic fixes yet, but we are moving towards the solution. What we are going to look at in this chapter is a solution to the first problem mentioned above, and how Perl can help. We also take a look at a small example of how Perl can also help solve the second problem using Unicode.
Characters and Unicode 101 There is a lot of confusion in this field since the terminology varies and definitions are not universally accepted. Let us take a look at some of these. A set or a collection of distinct characters is called a character repertoire. It is defined by giving a name for each character and a visible presentation as an example. A sample of a character repertoire may look like: Character
Value
The English lowercase alphabet a.
a
The English lowercase alphabet b.
b
:
:
The English upper case alphabet Z.
Z
A table that establishes one-to-one correspondence between characters from a repertoire and a table of non-negative integers is called a character code or a character set. The integer corresponding to a character is also called its code position since it also serves as an index in the character set. A sample character set may look like this (using decimal values): The English lowercase alphabet y
121
The English lowercase alphabet z
122
The character {
123
The character |
124
1044
Unicode
The most well known character set so far has been the one defined by the ASCII standard. Many character sets were developed later to cater for a growing variety of character repertoires with a growing number of characters. The character sets specified by the ISO 8859 family of standards extend the ASCII character set in different ways for different cultures and their character repertoires. However, all of them are supersets of ASCII in the sense that code positions 0-127 contain the same characters as in ASCII. Furthermore, positions 128-159 are unused (reserved for control characters) and positions 160-255 are used differently in different members of the ISO 8859 family. For example the ISO 8859-1 character set extends ASCII to cover West European characters – a repertoire identified as 'Latin alphabet No. 1.' ISO 8859-2 does the same for Central and East European characters – 'Latin alphabet No. 2,' and so on to ISO 8859-15. Standard
Name of alphabet
Characterization
ISO 8859-1
Latin alphabet No. 1
'Western,' 'West European'
ISO 8859-2
Latin alphabet No. 2
'Central European,' 'East European'
ISO 8859-3
Latin alphabet No. 3
'South European;' 'Maltese and Esperanto"
ISO 8859-4
Latin alphabet No. 4
'North European'
ISO 8859-5
Latin/Cyrillic alphabet
(for Slavic languages)
ISO 8859-6
Latin/Arabic alphabet
(for the Arabic language)
ISO 8859-7
Latin/Greek alphabet
(for modern Greek)
ISO 8859-8
Latin/Hebrew alphabet
(for Hebrew and Yiddish)
ISO 8859-9
Latin alphabet No. 5
'Turkish'
ISO 8859-10
Latin alphabet No. 6
'Nordic' (Sámi, Inuit, Icelandic)
ISO 8859-11
Latin/Thai alphabet
(for the Thai language)
(Part 12 has not been defined) ISO 8859-13
Latin alphabet No. 7
'Baltic Rim'
ISO 8859-14
Latin alphabet No. 8
'Celtic'
ISO 8859-15
Latin alphabet No. 9
An update to Latin1 with a few modifications and the addition of the Euro currency symbol. Nicknamed Latin0
The Windows character set, more properly called Windows code page 1252, is similar to ISO 8859-1, but uses positions 128-159 (reserved for control characters in ISO 8859-1) for printable characters. MS-DOS had its own character sets called code pages. CP 437, the original American code page, and CP 850, extended for the West European languages, have largely the same repertoire as ISO 8859-1 but with characters in different code positions. Various character sets may be found all over the world, such as ISCII in India, Big5 in Taiwan, and JIS X 0208:1997 in Japan. Now back to our two problems. The first step towards solving the first problem would be to have a single character set or a table that served all of the character repertoires in the world. There have been two major efforts towards solving this problem.
1045
Chapter 25
The ISO 10646 project of the International Organization for Standardization (ISO, counter-intuitively) defines the mother of all character sets – the Universal Character Set (UCS). The ISO 10646 standard defines a 32-bit character set, with only the first 65534 positions assigned as of now. This subset of the UCS is also known as the Basic Multilingual Plane, or Plane 0. It has enough character space to encompass practically all known languages. This not only includes Latin, Greek, Arabic, Hebrew, Armenian, Georgian, but also encompasses scripts such as Devanagari, Bengali, Gurumukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Khmer, Lao, Runic, Cherokee, Sinhala, Katakana, Hiragana, Hangul and Chinese, Japanese and Korean ideographs. The UCS also has a lot of space outside the Basic Multilingual Plane for hieroglyphics and some selected artistic scripts such as Klingon, Cirth, and Tengwar, which may be implemented in the future. This space in the UCS was reserved for other 'planes' that could be switched by using escape sequences. The other effort was the Unicode Project organized by a consortium of multilingual software development houses. The Unicode Standard, published by the Unicode Consortium also has information other than just the character sets, for example algorithms for handling bi-directional texts, say, a mixture of Latin and Hebrew (Hebrew is written right to left). Around 1991, the ISO and the Unicode Consortium agreed that it would make more sense to have one universal standard instead of two different ones, and so now, the Unicode Standard defines a character repertoire and character set corresponding to Plane 0 of the UCS, or the Basic Multilingual Plane. Unicode characters are represented as U+xxxx where xxxx is the code position in hexadecimal. For example, the code for a white smiley face is written as U+263A. That solved half of the first problem. The other half is to do with the representation of those non-negative integers for storage and transmission encoding. Data is usually composed of octets or bytes of 8 bits. For ASCII, 7 bits were enough to represent a code position, for other larger character sets like those for the Chinese, Japanese and Korean languages, or for the UCS, this is grossly insufficient. Therefore, we have many encoding schemes using more than a byte to represent a character. Now, for Universal Character Set, the two most obvious choices would be to use a fixed size of two or four bytes since any one of the 2^32 characters can be accommodated in 4 bytes, or 32 bits. These encoding schemes are known as UCS-2 and UCS-4 respectively. However there are problems associated with them. Most importantly, programs expecting ASCII as input would fail even for plain (ASCII repertoire) text, since they have not been designed to handle multi-byte encoding. Plain ASCII repertoire text would unnecessarily take up twice or four times the space required. Furthermore, strings encoded in UCS-2 or UCS-4 may contain characters like \, 0 or / having special significance in filenames in various operating systems. Therefore, UCS-2 or UCS-4 would not work well as a universal encoding scheme. Annex R of the ISO 10646-1 standard and RFC 2279 define a variable length encoding scheme called UTF-8. In this scheme, a character from the ASCII repertoire (U+0000 to U+007F) looks like an ASCII character encoded in a single byte (such as 0x00 to 0x7F). Any higher character is encoded as a sequence of 2 to 6 bytes, according to a scheme where the first byte has the highest bit set and gives the count for the number of bytes to follow for that character. More information about UTF-8 can be acquired from http://www.ietf.org/rfc/rfc2279.txt.
Data in Perl From Perl 5.6 onwards, strings are not treated as sequences of bytes but instead as sequences of numbers in the range 0 to (2^32)-1, using an encoding scheme based on UTF-8. Perl may store a string containing characters between 0x00 to 0x7F with fixed 8-bit encoding, but will promote it to UTF-8 if it is involved in an operation with another UTF-8 encoded string. Therefore, it converts between UTF-8 and fixed 8 bit encoding wherever necessary. As a result, we should not rely upon the encoding used by Perl internally. The whole idea of Unicode support in Perl is to spare us the burden of such details.
1046
Unicode
Unless told otherwise, a typical operator will deal with characters, or sequences of characters. We can explicitly tell Perl to treat strings like byte sequences by surrounding the concerned operator with the scope of a use bytes directive. We have to explicitly tell Perl to allow UTF-8 encoded characters within literals and identifier strings by the use utf8 directive. Variable names could thus be in Hindi, Thai, Japanese, one's own native language, or any of the many supported symbols. For example:
ийклмн = "Hello Unicode Chess Pieces";
$
If UTF-8 is in use, string literals in Unicode can be specified using the \x{} notation: $string = "\x{48}\x{49}\x{263A}";
Note: use utf8 is a temporary device. Eventually it will be implicit when Unicode support is completely transparent to the user. However, even if we do not explicitly tell Perl anything about our strings, the operators will still treat them correctly. That is, all operands with single byte encoded characters will remain so, until they get involved in expressions or operations with anything having a wider encoding. Then, all programs that expected 8 bit encoded ASCII and gave out 8 bit encoded ASCII, should still work as they did before. If an ASCII encoded operand is involved in an operation with an UTF-8 encoded operand, the 8-bit encoded operand is promoted to UTF-8.
Unicode and Regular Expressions Perl 5.6 has Unicode support for regular expressions. So, how does one specify Unicode characters in a regular expression? Unicode characters may be represented as \x{hex number here} so that matching data with /\x{05D0}/ is the same as matching it with Hebrew letter Ǣ. In order to do this, we need to use the use utf8 directive to tell Perl that it may encounter Unicode characters in a literal. Also, even if the Unicode character lies between 0x80 and 0xFF, it must be enclosed in braces in order to be recognized as Unicode; \x80 is not the same as the Unicode character \x{80}. However, unless one writes poetry in hexadecimal, seeing many of those sprinkled throughout regular expressions will not be a pretty sight. This is where the use charnames directive comes in handy. The name specified for the character by the Unicode standard, enclosed within braces in a named character escape sequence \N{}, may be used instead of the code itself. For example, if we wanted to match the character for a white chess king \x{2654} we would use the following: use utf8; use charnames ":full"; # use Unicode with full names $chess_piece = "Opponents move \N{WHITE CHESS KING}"; $chess_piece =~ /\N{WHITE CHESS KING}/; # this will match
Short names can also be used instead of full names: use utf8; use charnames ":full"; # use Unicode with full names print " \N{CYRILLIC CAPITAL LETTER PSI} is a psi in Cyrillic \n"; use charnames ":short"; # use Unicode with short names print " \N{Cyrillic:Psi} is also a capital psi in Cyrillic \n";
1047
Chapter 25
We can also restrict the language: use utf8; use charnames qw(cyrillic); # restricts charnames to the cyrillic script print " \N{Psi} is also a capital psi in Cyrillic";
We can match characters based on Unicode character properties. A property is used as a shortcut for character classes in regular expressions. The properties are usually of the form \p{IsPROPERTY} and its complement \P{IsPROPERTY}. For example, we can use \p{IsCntrl} to match any character belonging to the set of control characters, whereas \P{IsCntrl} matches any character belonging to its complement set – any character that is not a control character is matched: if ($var =~ /\p{IsCntrl}/) {print "control character found";}
The Unicode Consortium publishes data associated with these properties in machine-readable format. This data is used to create the tables and files Perl uses in order to have properties that can be used as character classes. There are certain standard Unicode properties where no character can belong to more than one category. A few examples of standard Unicode properties are: ❑
IsLu – An uppercase letter.
❑
IsNd – A decimal digit number.
❑
IsPd – A dash (punctuation character).
Perl also has certain composite properties implemented as combinations of basic standard Unicode properties, other composite properties and other basic pattern elements. For example: ❑
IsASCII – Equivalent to the character class [\x00-\x7f].
❑
IsSpace – Equivalent to the character class [\t\n\f\r\p{IsZ}] where \p{IsZ} is given below.
❑
IsZ – Same as the set consisting of IsZl (line separator), IsZp (paragraph separator) and IsZs (space separator).
Unicode block properties can be used to determine if a certain character belongs to a certain script, or to test the range of a script. The block properties are of the form \p{InSCRIPTNAME}, its complement being \P{InSCRIPTNAME}, where SCRIPTNAME may be replaced by Arabic, Armenian, Bopomofo and so on. So we could write: if ($var =~ /\P{InHebrew}) {warn "I only understand the Hebrew script!";}
A complete list of the properties available on our system can be found by exploring the contents of the Unicode subdirectory, where the Perl library files are stored. To find out where these libraries are stored we can issue this command at the prompt: > perl -V
1048
Unicode
All the files corresponding to the InSCRIPTNAME properties can be found in OUR_PERL_LIBPATH\unicode\In\. We can have a look at the files that actually implement these properties. For example, let us look at one such file, Devanagari.pl residing in this subdirectory: # !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # This file is built by mktables.PL from e.g. Unicode.300. # Any changes made here will be lost! return <<'END'; 0900 097F END
The second last line is important in the implementation. This file simply says return the range delimiters (in hex) for the block that constitutes the Devanagari script.
Writing Our Own Character Property If for some reason we need a character property that we cannot locate, it is possible for us to write our own. We do this by writing a subroutine using the name of the property desired. In a manner similar to that we saw in the file Devanagari.pl, we return a list of characters, or character ranges (in hexadecimal), we want the property to match, one per line. Existing property names may also be used instead of returning the ranges in hexadecimal. For example: +utf8::InBasicLatin
would be the same as: 0000
007F
Now, if we want a property called InBasicLatinOrGreek (Greek and Latin, except that here we are talking about the BasicLatin script and not the language) this is what must be done: sub IsBasicLatinOrGreek { return <<'END'; +utf8::InGreek +utf8::InBasicLatin END }
For specifying a character that should not match a certain property, we prefix the property with a bang (!) to signify negation, or use the - sign. So if we want a property to match a character that is not Greek, say, we should create a subroutine called IsUnderstandable (assuming that we understand all other scripts) as follows: sub IsUnderstandable { return <<'END'; !utf8::InGreek END }
1049
Chapter 25
Or using the alternative method: sub IsUnderstandable { return <<'END'; -utf8::InGreek END }
These scripts create a character property that matches any character that is not Greek to us (or anyone who agrees to the Greek script as defined by the Unicode standard). The tr operator now supports the conversion of Latin-1 to UTF-8 and vice versa: tr/\x00-\xff//CU; # convert $_ from Latin-1 into Unicode tr/\x00-\x{ff}//UC; # convert $_ from Unicode into Latin-1
Another question that could arise is how Perl decides whether to match data as bytes or as characters. Well the answer is that if a regexp contains Unicode characters, then the string to be matched is treated like a sequence of characters, otherwise it is treated as a sequence of bytes. This is done when the pattern is compiled. The use utf8 pragma may be used to tell the engine to match a Unicode character for a period (.) in an expression. The \X metasymbol tells the engine to match a combining character sequence, consisting of a base character followed by one or more mark characters (diacritical marks like the umlaut or a cedilla). We can force the engine to match a byte even in the middle of a UTF-8 string by using the \C escape sequence, which matches any byte from values 0 to 255. Postponing the decision until runtime when the matching happens is something that is in the 'to do' list for Perl as of 5.6. We are still waiting for support for matching characters, if we get some in the string to be matched.
Bi-directional Scripts Languages such as Arabic, Hebrew, Farsi, Urdu, and Yiddish are written from right to left. However, embedded segments of text in other languages or numeric data, such as phone numbers, are written from left to right. Since it is possible to have segments of data written in different directions, the scripts of these languages are called bi-directional (or bidi) scripts. For example: HHHHHHHH HHHH HHHH <------- <--- <---
phone 12345 ----> ---->
HHHHHHHHHHHHHHHH <---------------
Here we have the Hebrew characters, H, written right to left and Latin characters and numerals written left to right. The --> and <-- represent the directions of the text. In these scripts, the data may be stored in two ways, either with characters in visual or physical order. This means having the characters ordered as they are presented on the screen or with characters stored in the logical order, in other words the order in which they are read and spoken.
1050
Unicode
Since it would be much easier to process data stored in the way it is read and spoken, the logical or physical order is preferred. This would be clear after looking at the following, interesting exercise: Try searching for a word in a string if it were stored in the order opposite to the order in which it is spoken! Now try the same thing with the following sentence: !nekops si ti hcihw ni redro eht ot etisoppo redro eht ni derots erew ti fi gnirts a ni drow a rof gnihcraes yrT The Unicode standard specifies an algorithm called the bidi algorithm, according to which the rendering process is responsible for switching from the logical order of the characters, to the visual order for displaying the glyphs. Perl provides character properties that classify characters based on their bidi properties. A complete list of bidi properties on our system can be determined by looking at the files in OUR_PERL_LIBPATH\unicode\Is\ and so the property names corresponding to them are IsBidiAN, IsBidiB, IsBidiCS, and so on. People who perform bi-directional rendering can use these character properties for things such as determining whether a character is to be rendered left to right or right to left.
Rendering bidi There are a few other things that need to be known in order to understand how rendering of the bidi scripts is handled. A detailed description of the bidi algorithm is available in an annex to the Unicode standard, called Technical Report #9 (available from http://www.unicode.org/). This report defines a few special codes, called Directional Formatting Codes, to associate segments of text with the directions in which they should be rendered. These codes affect only the directional properties of characters, and, thus, should be ignored for all other processing or analysis of the text except rendering for display. Some of the basic codes and what they mean are as follows: ❑
LRE (left to right embedding) – The following text is embedded left to right.
❑
RLE (right to left embedding) – The following text is embedded right to left.
❑
RLO (right to left override) – Force following characters to be right to left.
❑
LRO (left to right override) – Force following characters to be left to right.
❑
PDF (pop directional format) – Restore the direction to what it was before the last LRE, RLE, RLO, LRO. The word pop is used since the bidi algorithm requires the usage of a stack where Directional Formatting Codes are pushed as they are encountered.
An easy way to embed right to left text within, say, English text, is to enclose the text between an RLE code and a PDF code: eeeee eeeeee eeeee RLE RIGHT-TO-LEFT-TEXT PDF eeeeee eeeeeee eee eee
1051
Chapter 25
There are some other implicit formatting codes – implicit in the sense that they do not result in the creation of a directional boundary. This will be clear with the help of a simple example. For the purposes of this example, assume that all characters shown in UPPER CASE are in the Arabic script. Suppose that a word processor stores a sentence in the logical order, typed in as follows: "WELCOME TO THE DESERT!", said the camel, in Arabic, to the world.
In the absence of any formatting codes, implicit bidi re-ordering would render it thus: "TRESED EHT OT EMOCLEW!", said the camel, in Arabic, to the world. Note that the exclamation mark was not treated as a part of the Arabic sentence. To rectify this, the word processor could have stored it within an explicitly created directional boundary as follows: "WELCOME TO THE DESERT!", said the camel, in Arabic, to the world.
This would have been rendered as we wanted it: "!TRESED EHT OT EMOCLEW", said the camel, in Arabic, to the world. Now, the creation of an explicit directional boundary (and the stack pushing and popping that goes with it) could have been avoided by adding an implicit formatting code (RLM – right to left directional mark) after the exclamation mark. Then the sentence is stored as follows: "WELCOME TO THE DESERT!", said the camel, in Arabic, to the world.
This would also have been rendered as desired: "!TRESED EHT OT EMOCLEW", said the camel, in Arabic, to the world. Every Unicode character has been assigned a Bi-directional Character Type that determines one or more of the following: ❑
How it is to be rendered (directly important in the absence of any explicit directional formatting codes).
❑
If its directional properties are affected by that of the adjacent characters (in other words if they belong to the weaker bidi character types). Examples include plus sign, minus sign, degree, currency symbols, colon, comma, full stop, numerals, and so on.
❑
Whether it has anything to do with direction or not – if it is a neutral character. Examples include paragraph separator, tab, line separator, form feed, general punctuation, space, and so on.
There are various embedding levels that specify how deeply the associated text has been embedded. For example, consider the following sentence: This sentence in Latin script has a "QUOTE IN ARABIC SCRIPT" embedded within it.
1052
Unicode
The text outside the quotes has an embedding level of 0, whereas the text within the quotes (shown in ALL CAPITALS and assumed to be in the Arabic script) has an embedding level of 1. Each level has a default direction called the embedding direction. This direction is L (left to right) if the level number is even and R (right to left) if the level number is odd. Every paragraph has a default embedding level, and thus a default direction associated with it. This is also called the base direction of the paragraph. For example: A paragraph with a beginning like this in the Latin script would have a default embedding level as Level 0, and hence its base direction would be left to right.
What the bidi Algorithm Does The bidi algorithm uses all these formatting codes and embedding levels for analyzing text to decide how it should be rendered. Here is briefly how it goes about doing it: It breaks up the text into paragraphs by locating the paragraph separators. This is necessary because all the directional formatting codes are only effective within a paragraph. Furthermore this is where the base direction is set. The rest of the algorithm treats the text on a paragraph-by-paragraph basis.
❑
The directional character types and the explicit formatting codes are used to resolve all the levels of embedding in the text.
❑
The text is then broken up into lines, and the characters are re-ordered on a line-by-line basis for rendering on the screen.
Perl and bidi
AM FL Y
❑
Since Perl is a language frequently used for text processing, it is natural that Perl should have bidi capabilities. We have an implementation of the bidi algorithm on Linux that can be used by Perl.
TE
We require a C library named FriBidi, which is basically a free implementation of the bidi algorithm, written by Dov Grobgeld. A Perl module has also been written by the same author, acting as an interface to the C library and is available as FriBidi-0.03.tar.gz from http://imagic.weizmann.ac.il/~dov/freesw/FriBidi. The FriBidi module enables us to do the following: ❑
Convert an ISO 8859-8 string to a FriBidi Unicode string: iso8859_8_to_unicode($string);
❑
Perform a logical to visual transformation. In other words, run the string obtained above through the bidi algorithm: log2vis($UniString, $optionalBaseDirection); This calculates the base direction if not passed as the second argument, returns the re-ordered string in scalar context, and additionally returns the base direction as the second element of the list in an array context.
1053
Team-Fly®
Chapter 25
❑
Convert the string obtained above to an ISO 8859-8 character set: unicode_to_iso8859_8($toDisplay); This makes sure that it is in a 'ready-to-display' format, assuming the terminal can display ISO 8859-8 characters (such as xterm).
❑
Translate a string from a FriBidi Unicode string to capRTL and vice versa: caprtl_to_unicode($capRTLString); unicode_to_caprtl($fribidiString); The capRTL format is where the CAPITAL LETTERS are mapped as having a strong right to left character property (RTL). This format is frequently used for illustrating bidi properties on displays with limited ability, such as ASCII-only displays.
The following is a small example to demonstrate FriBidi's capabilities. First, we create a small file with the following text, named bidisample: THUS, SAID THE CAMEL TO THE MEN, "...there is more than one way to do it." AND THE MEN REPLIED "...now we see what you mean by bidi", RISING WITH CONTENTMENT WRIT ON THEIR FACES. This is the code to render the above file in bidi fashion: #!/usr/bin/perl # bidirender.pl use warnings; use strict; use FriBidi; my ($uniStr, $visStr, $outStr); open (BIDISAMPLE,"bidisample"); while(){ chop; # remove line separator $uniStr = caprtl_to_unicode ( $_ ); # convert line to FriBidi string $visStr = log2vis ( $uniStr ); # run it through the bidi algorithm $outStr = unicode_to_caprtl ( $visStr ); # convert it back to format # that can be displayed on # usual ASCII terminal print $outStr,"\n"; }
> perl bidirender.pl "theres more than one way to do it..." ,NEM EHT OT LEMAC EHT DIAS SUHT ,"now we see what you mean by bidi..." DEILPER NEM EHT DNA .SECAF RIEHT NO TIRW TNEMTNETNOC HTIW GNISIR
1054
Unicode
Perl, I18n and Unicode Now let us take a brief look at a solution to the problem of language barriers. A more extensive view of internationalization can be found in Chapter 26. Unicode helps us out in this matter, by providing a uniform way of representing all possible characters of all the living languages in this world. We are about to see how easy it is to enable people all over the world to understand what we are saying in their own language. This example may be tried out by anyone with a day or two of Perl experience. Although it is in no way complete, with no error checking and pretense of handling any real-world complexity, it demonstrates the ease with which Perl handles Unicode. Let us imagine the following scenario: An airport wants to have information kiosks at various locations outside the arrival lounge for foreign tourists. They need the information to be displayed in Arabic, Japanese, Russian, Greek, English, Spanish, Portuguese, and a whole host of other languages. They would like the kiosks to enable the user to view information about the city, events, weather, flight schedule, sight-seeing tours, and also be able to make and confirm reservations in affiliated hotels. Our task here is obviously to create a Perl program that is able to handle Unicode and, therefore, to an extent, solve this problem. The first thing we need to do is create a template HTML file containing a few HTML tags, but with the text replaced by text 'markers' – M1, M2, M3, and so on. We one file for each language, in the following format (obviously, all the files should contain Unicode text encoded in UTF-8): M1:charset "string corresponding to charset" M2:title "string corresponding to title" M3:heading "string corresponding to heading" M4:text "text string" To put the task in another way, we need to write a program that takes in the language name as the input and accordingly generates a file called outfile.html by filling in the template file with the strings in the language requested. The outputted file should be UTF-8 encoded Unicode. This involves a few things such as installing Unicode fonts, installing a Unicode editor, creating template HTML files, writing scripts, and so on. Let us look at these step-by-step.
Installing Unicode Fonts For UNIX with the X Window System and Netscape Navigator, information regarding Unicode fonts for X11 in general can be found on http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html. The latest version of the UCS fonts package is available from http://www.cl.cam.ac.uk/~mgk25/download/ucsfonts.tar.gz. For Windows with IE 5.5, Unicode fonts can be selected during installation or can be downloaded from http://www.microsoft.com/typography/multilang/default.htm. Another good place for links to fonts is http://www.ccss.de/slovo/unifonts.htm.
Installing a Unicode Editor For UNIX with the X Window System Yudit is a good choice of an editor that supports UTF-8. Available from http://www.yudit.org/. For Windows 95 and 98 Sharmahd Computing's UniPad is a good editor, available from http://www.sharmahd.com/unipad/. For Windows NT and 2000, Notepad is able to handle Unicode.
1055
Chapter 25
Creating the HTML Template Now we can go about creating the template HTML files and the string resource files. The next script is simply an HTML template, called templateLeft.html that we will use with our program: <meta http-equiv="Content-Type" Content="text/html" Charset=M1:charset> M2:title M3:heading
M4:text
Note that this is intuitively correct for languages written from left to right but for languages written the other way, we need a modified template to follow suit. So for languages such as Arabic, we can simply right-justify the displayed text using the ALIGN attribute and setting its value to RIGHT. This should be done for all text in the body of the document to be displayed in the correct direction, that is from right to left. This is the template templateRight.html that we will use with right-to-left languages: <meta http-equiv="Content-Type" Content="text/html" Charset=M1:charset> M2:title M3:heading
M4:text
This next image is the sample string resource file for the English language using UniPad:
1056
Unicode
The following image is a screenshot of a sample string file for the Arabic language. Note the rendering of Arabic text from left to right:
Processing the Resource Files The fourth stage in our solution to the problem is creating the Perl script. This script will process the resource files it is given and generate the localized pages: #!/usr/bin/perl # Xlate.pl use warnings; use strict; my ($langname,$filename, $marker, $mark, $value, $wholefile, $thisval, $template, %valueof); print "Enter the language for the output html file: \n"; $langname = lc<>; # get language name and turn it into lowecase chomp $langname; $filename = $langname . ".str"; open(LANGFILE, "$filename");
# generate filename from language name
#read in the markers & values in a hash while() { chomp($_); ($marker, $value) = split("\t", $_); $valueof{$marker} = $value; } close(LANGFILE); # use the correct template if ($langname =~ /arabic|hebrew/) { $template = 'templateRight.html'; } else {$template = 'templateLeft.html'} open(TMPLT,$template); open(OUTFILE, ">$langname.html"); $wholefile=join('', );
# slurp entire file into a string
1057
Chapter 25
close TMPLT; foreach $mark (keys %valueof) { $thisval = $valueof{$mark}; # get the value related to the marker $wholefile =~ s/$mark/$thisval/g; # do the replacement } print OUTFILE $wholefile; # write out complete langname.html file print "output written to $langname.html \n"; close OUTFILE;
This is the big surprise – the script looks too simple. In fact, no extra processing is required to handle Unicode.
Running the Script Now we can execute the code in the usual way and provide the language required: > perl Xlate.pl Enter the language for the output html file: ENGLISH output written to english.html > perl Xlate.pl Enter the language for the output html file: Arabic output written to arabic.html
The Output Files After running the script and having it produce the *.html files, we can open them in a browser and see what has been written. The following is a screenshot of the english.html file generated by the script:
1058
Unicode
Next, is the arabic.html file generated by the script. Note the direction of Arabic script as rendered by the browser:
There are more examples of the same phrase written in different languages at http://www.trigeminal.com/samples/provincial.html, thanks to Michael Kaplan. A few more such sites are http://www.columbia.edu/kermit/utf8.html (hosted by the Kermit Project), http://www.unicode.org/unicode/standard/WhatIsUnicode.html (hosted by the Unicode Consortium) and http://hcs.harvard.edu/~igp/glass.html (hosted by the IGP). This simple method of replacing text markers is still widely used. However, localizing a large web site takes much more than just being able to handle Unicode strings. Things such as cultural preferences, date format, currency (which are covered in Chapter 26) need to be taken into consideration. This means we should probably turn to using methods such as HTML::Template, HTML::Mason, HTML::Embperl or maybe something like XML::Parser in order to create an industrial strength multilingual site.
Work in Progress There are still a few things about Unicode support in Perl that are under development. For instance, it is not possible right now to determine if an arbitrary piece of data is UTF-8 encoded or not. We cannot force the encoding to be used when performing I/O to anything other than UTF-8, and will have a problem if the pattern during a match does not contain Unicode, but the string to be matched at runtime does. Also the use utf8 pragma is on its way out. In order to follow the current state of Unicode support in Perl, one can join the Perl-unicode mailing list by sending a blank message to [email protected] with a subject line saying subscribe perl-unicode.
1059
Chapter 25
Summary All said and done, we should not need to worry ourselves about the support of Unicode within Perl unless we really have to. All code we create will work just as well as it did before Unicode support came on the scene. So we only need to use it when dealing with foreign scripts for example, but more on using Perl around the world in Chapter 26. In this chapter we have looked at the details concerning the use of Perl with non-standard characters such as symbols or some foreign writings. We began with the problems we face as Perl is used across the globe and then we looked at what can be done to provide solutions. As a summary, we have: ❑
Seen how people have tackled the issue of providing an international coding system.
❑
Looked at how Unicode can be used in regular expressions and tried our hand at writing our own character property.
❑
Demonstrated how Perl can be used to deal with texts in languages that are written from right to left as opposed to left to right such as English.
❑
Provided a real world example of how we can deal with language barriers across the world.
1060
Unicode
1061
Chapter 25
1062
Locale and Internationalization
AM FL Y
Not everybody understands English, and not everybody expresses dates in the standard US format for example, 2/20/01. Instead, people worldwide speak hundreds of different languages and express themselves in almost as many alphabets. Furthermore, even if they do speak the same language, they may have different ways of expressing dates, times, and so on. This chapter aims to provide us with the tools to write Perl programs in many different languages, and to see the kinds of problems we may encounter along the way.
TE
We will take a look at the kinds of ways we can develop a multilingual application in Perl, be it for a web site or for something entirely different. Going multilingual does not only mean expressing messages in different languages, but also adapting the minor details of our site, for instance, to obey the conventions of other cultures. To use the example mentioned above, it means that we must change the presentation of a date or time. Catering for other languages will always be a good thing to do, attracting more visitors to our website. As an example, we may be surprised with the results if we translated a personal home page from Spanish into German. Its popularity and number of hits would rise rapidly, and visitors will be especially surprised at how the page may be instantly translated into German – if someone in Germany accessed it. The same goes for all multilingual websites: the site does its best to guess the country of origin of the user, and show messages accordingly. For instance, if we connect to Google from a Spanish domain (.es), for example melmac.ugr.es we will immediately get the URL http://www.google.com/intl/es/, which is in Spanish. This is a neat trick that deduces the location from the Top-Level Domain (TLD) of our machine, and it will score points with the Spanish-speaking population. However, localization is not as easy as showing two different pages depending on where the client comes from. It is also a matter of knowing the language that the user, or more correctly, the user's application's client, is immersed in. The site will then show information in such a way that the client, and therefore the user, can understand it. For example, a quantity such as 3'5 will not mean much to an English person, but it means 3.5 to a Spaniard. The application must be aware of its cultural 'location', and act accordingly.
Team-Fly®
Chapter 26
If we live in a country whose native language is not English, undergoing this process is a must. We will need to use it for undertaking tasks as simple as alphabetical sorting, showing quantities, or matching regular expressions. Our program will not work correctly, at least from the point of view of the user, if we do not use the right locale settings. The good news is that many people have already thought about this problem in depth. There are several frameworks that make writing multilingual and localized applications easier. It will come as even better news to readers of this book, that Perl has excellent support for this, as we shall soon see. In this chapter, we will see several ways of creating Perl applications for a multilingual environment, from simple tricks such as storing messages in different languages, and sorting according to local uses, to recognizing a foreign language or conjugating foreign verbs. At this early stage it is important to note that most of the programs presented in this chapter will not work in non-POSIX machines, including Win9x, and Macs.
Why Go Locale? Suppose we want to create a Spanish web site to show the number of firms investing in a particular venture. Users should be able to access the site and find out the names of the primary investors, and the site should be designed accordingly. One basic hurdle to overcome would be the need to show a list of those firms in alphabetical order. First, we need to create a plain text file in which we can list the firms. Note that these are in no alphabetical order. We call this file firms.txt: Chilindrina Ántico Cflab.org Cantamornings Cántaliping Andalia Zinzun.com
We can use a simple Perl command line that prints a sorted list on the screen (note that on Windows the special characters are not displayed properly): > perl -e 'print sort <>;' firms.txt Andalia Cantamornings Cflab.org Chilindrina Cántaliping Zinzun.com Ántico This is not ideal though. In Spanish, 'Ch' is used as a stand-alone letter, coming after 'C' in the alphabet (although admittedly, this is slowly being phased out). The letters 'A' and 'Á' are in fact the same, the only difference being that the latter has an acute accent, so they should go together. To help solve this problem, we can find a useful tutorial included with Perl itself, by typing: > perldoc perllocale
1064
Locale and Internationalization
This should tell us most of what we need to know about locale, a method that most versions of UNIX use for taking alphabets and time/date expressions into account . Once we are happy with the basic principles and concepts, we can leave the reading for later on, and just add a simple part to our command line. When applied to the same file, and when issued from a computer with the Spanish locale installed, we can expect something like the following output: > perl -e 'use locale; print sort <>;' firms.txt Andalia Ántico Cántaliping Cantamornings Cflab.org Chilindrina Zinzun.com There is a great likelihood that something slightly different will be displayed on our computer (if it uses Linux or another strain of UNIX), but this will allow us to gauge differences between English, Spanish, and whatever locale sorting we have. In any case, this is not completely correct even for the Spanish traditional sorting, although it is an improvement on the last attempt. The 'ch' was still not considered to be a different letter, but at least accented vowels were not alphabetized as different characters from their unaccented counterparts. The bad ordering of the 'ch' could be a bug in the locale implementation of our system (or maybe a bug in our understanding of the local alphabetization rules, as we will see later on). It could be improved for other Spanish locale implementations, so in order to fix it, we have to delve a little deeper into what locale actually means. The locale framework is concurrent with the phrase 'When in Rome, do as the Romans do' – computers have to use the local alphabet and local numbers. This is the reason why localization was included into the POSIX standard, (that is, the set of functions and files that should be understood by all UNIX platforms). There are some non-UNIX platforms, such as Windows NT (which has a slightly flawed implementation of the POSIX standard) which can use the locale set of functions, also known as NLS (for National Language Support). If one arrives at Perl from the C field, all functions, constants etc. related to locale are in the locale.h header file. This header distributes localization elements into several categories: LC_COLLATE
Collation or sorting.
LC_CTYPES
Character types, distinguishing whether a character is alphanumeric or not.
LC_MONETARY
Handling monetary amounts.
LC_TIME
Displaying time.
LC_MESSAGES
Messages, not used in Perl by default.
There are a few more categories that are not so widely used, whose names should give away their meanings: LC_PAPER, LC_NAME, LC_ADDRESS, and LC_MEASUREMENT, LC_IDENTIFICATION.
1065
Chapter 26
These constants can be used as environment variables, or else using the POSIX setlocale function. This means we have to use POSIX to have them available in our program. Going back to our problem regarding the alphabet of the Spanish language, we can locate a locale definition on the Internet, which effectively includes the traditional sorting, that is, 'ch' after 'c'. One place in which such a locale definition can be found, amongst many others, is under the name locales-es-2.11mdk.noarch.rpm, at http://www.ping.be/linux/locales/. We can install it by issuing the following command: > rpm -Uvhf locales-es-2.1-1mdk.noarch.rpm The f option is needed since one of the files, LC_COLLATE, conflicts with the existing settings; the file was renamed and then moved back to its original value, (we need to do this, or the locale settings will not work properly in Spanish). One of the locales included in that package is es@tradicional. This should sort the 'ch' in the traditional way. However, this implementation does not work very well, as we will see. There is a more complete list of locales for Linux available from IBM DeveloperWorks, at http://oss.software.ibm.com/developerworks/opensource/locale/download.html. This includes locales such as Maltese, which will be required later. These locales are still in beta stages, and thus look set to change considerably in the future. After installing the package, we can create the following short script: #!/usr/bin/perl # sort.pl use warnings; use strict; use POSIX qw(LC_COLLATE setlocale strcoll); setlocale(LC_COLLATE, 'es_US'); print sort {strcoll($a, $b);} <>;
When we run this program we will obtain the following output: > perl sort.pl firms.txt Andalia Ántico Cántaliping Cantamornings Cflab.org Chilindrina Zinzun.com Before we continue, this program probably needs a bit of explanation. Instead of using the use locale pragma, it relies on POSIX calls to do sorting correctly. By default, locale uses the locale settings contained in the LC_* UNIX environment variables. However, in this case we want to be able to change locale in runtime. The POSIX setlocale function allows us to do this; we take the locale category we want to change as the first argument. We then take the locale we want to change it to as the second argument. It must be a valid locale, that is, it must correspond to a file already installed on our system. If it works correctly, it returns the name of the locale it has been set to; in this case, es_US. The reason for using this locale is that, for some strange reason, it seems to be the only one to use the 'traditional' Spanish ordering.
1066
Locale and Internationalization
The problem with setting locales using this function (instead of the pragma), is that despite what is said in the perllocale documentation, the Perl functions cmp and sort do not use it by default. For this reason we need to use an alternative form of sort, sort {expr}, and the POSIX function strcoll, which compares (or collates, hence the name) using the setting specified in the setlocale call.
Delving Deeper into Local Culture Now, suppose we have a scenario where a user views the site and wants to add another firm to the list of investors. Using the current code, they would have to undertake this process in Spanish, therefore eliminating the likelihood of nonSpanish speaking firms investing through the site. At first this may seem a difficult task because it is impossible to find out where the user is from, since Spanish autonomous regions do not have their own top level domains. It would however, be possible if there was only one user in each region responsible for liaising through our web site. So, how can we use those different locales? To start with, we need to find the way locales are called. Locales usually have three, sometimes four parts, written in the following way: xx_YY.charset@dialect. Breaking this down would perhaps make it easier to understand. ❑
xx represents the language.
❑
YY represents the country.
❑
charset is the character set, such as ISO8859-15 (for Latin languages) or KOI8 (for Russian).
❑
@dialect is a particular modality, for instance, a dialectal variety, such as no@nynorsk in Norwegian. It also represents special symbols such as ca_ES@euro (which is a variant of the ca_ES locale including the euro symbol).
Following this form, a typical Spanish locale would be es_ES.iso885915, or es_ES@euro, which can be found in any of the above-mentioned web sites if they are not already installed in the system. On the same page we can find locales for Basque, Catalan and Galician – the three other official languages in Spain. Getting back to the example, we decide to greet the visitors to our site in their own language and show them the local time (of our site) using the format favored by them. The site will then show them how much money each company has invested. In order to do this, we start by creating a new plain text file called firms_money.txt: Ántico 1000000 Chilindrina 2000.35 Cflab.org 123456.7 Cantamornings 6669876 Cántaliping 46168.5 Andalia 4567987 Zinzun.com 33445 We also need to remember that a Spanish language web site can be accessed from all over the world, and that there are large Spanish-speaking populations in South America, Central America, and North America. This adds an extra issue; different countries will more than likely have different currencies. Not everyone would understand the value of a given number of Pesetas offhand, and users may want to use the site to find out the amount of money invested in their own currency. A more general solution would be considering the names of the people connecting to our site from different parts of the world.
1067
Chapter 26
#!/usr/bin/perl # invest.pl use warnings; use strict; use Finance::Quote; use POSIX qw (localeconv setlocale LC_ALL LC_CTYPE strftime strcoll); my %clientMap = ( 'Jordi' => ['ca_ES', 'Hola', "aquí está el teu inform per"], 'Patxi' => ['eu_ES', 'Kaixo', "egunkaria hemen hire"], 'Miguelanxo' => ['gl_ES', 'Olá', "aquí está o seu relatório pra"], 'Orlando' => ['es_AR', 'Holá', "aquí está tu informe para"], 'Arnaldo' => ['es_CO', 'Holá', "aquí está tu informe para"], 'Joe', ['en_US', 'Hi', "here's your report for"] );
This script takes as its arguments the name of the client, and the name of the file that contains the firm and amount of money invested. In our case, our file is firms_money.txt. # Firm Quantity die "Usage: $0 \n" if $#ARGV <1; my ($clientName,$fileName) = @ARGV; # default my ($locale, $hello, $notice) = ('es_ES', 'Hola', 'aquí está tu informe para');
if ($clientMap{$clientName}) { ($locale, $hello, $notice) = @{$clientMap{$clientName}}; } setlocale(LC_ALL, $locale);
# use settings
Arguments are read in the lines above, and locales assigned. The default value in Spain is (somewhat unsurprisingly) Spanish. If the name of the client given as the first argument in the command line is a key in clientMap, the values are assigned to the three variables that represent the locale name, the hello message, and the message to be given as an introduction. The locale name is then used in the setlocale function to assign that value to all categories. my %investments; open(F, "<$fileName"); # open file and process it while () { chop; my ($firm, $invest) = split; $investments{$firm} = $invest; } close F;
The file has been read, and entered into the %investments hash, keyed by the name of the firm. # obtain time info my ($sec, $min, $hour, $mday, $mon, $year, $wday) = localtime (time); print "$hello $clientName, $notice ", strftime("%c", $sec, $min, $hour, $mday, $mon, $year, $wday), "\n";
1068
Locale and Internationalization
Currency conversion is used in the lines below. The objective of these lines is to show how information about the local international currency name and symbol is also contained in the locale settings; they are retrieved using the localeconv POSIX function. This function returns a hash reference, but we only use the keys int_curr_symbol, which references the three letter international name of the currency, such as USD (for US dollars) and ESP (Spanish pesetas), and currency_symbol. The currency_symbol references the symbol of such currency, such as $ for the US dollar (amongst others), and £ for the British pound. #Set up currency conversion my $q = Finance::Quote->new; $q->timeout(60); my $lconv = localeconv(); chop($lconv->{int_curr_symbol}); my $conversion_rate=$q->currency("ESP",$lconv->{int_curr_symbol}) || 1.0; for (sort {strcoll($a,$b);} keys %investments) { printf ("%s %.2f ESP %.2f %s %s \n", $_, $investments{$_}, $investments{$_}*$conversion_rate, $lconv->{int_curr_symbol}, $lconv->{currency_symbol} ); }
Now, we can use the script and obtain output such as the following. Of course the exact output will depend on the country we are in, the date, time, currency rates, etc. > perl invest.pl Arnaldo firms_money.txt Holá Arnaldo, aquí está tu informe para 13 dic 2000 14:56:22 CET Andalia 4567987.00 ESP 52530160.34 USD $ Ántico 1000000.00 ESP 11499630.00 USD $ Cántaliping 46168.50 ESP 530920.67 USD $ Cantamornings 6669876.00 ESP 76701106.15 USD $ Chilindrina 2000.35 ESP 23003.28 USD $ Cflab.org 123456.70 ESP 1419706.37 USD $ Zinzun.com 33445.00 ESP 384605.13 USD $ > perl invest.pl Patxi firms_money.txt Kaixo Patxi, egunkaria hemen hire 00-12-13 14:57:38 CET Andalia 4567987.00 ESP 4567987.00 ESP Pts Ántico 1000000.00 ESP 1000000.00 ESP Pts Cántaliping 46168.50 ESP 46168.50 ESP Pts Cantamornings 6669876.00 ESP 6669876.00 ESP Pts Chilindrina 2000.35 ESP 2000.35 ESP Pts Cflab.org 123456.70 ESP 123456.70 ESP Pts Zinzun.com 33445.00 ESP 33445.00 ESP Pts > perl invest.pl Orlando firms_money.txt Holá Orlando, aquí está tu informe para 13 dic 2000 14:57:59 CET Andalia 4567987.00 ESP 23995.27 USD $ Ántico 1000000.00 ESP 5252.92 USD $ Cántaliping 46168.50 ESP 242.52 USD $ Cantamornings 6669876.00 ESP 35036.33 USD $ Chilindrina 2000.35 ESP 10.51 USD $ Cflab.org 123456.70 ESP 648.51 USD $ Zinzun.com 33445.00 ESP 175.68 USD $
1069
Chapter 26
> perl invest.pl Joe firms_money.txt Hi Joe, here's your report for Wed 13 Dec 2000 02:59:49 PM CET Andalia 4567987.00 ESP 24019.30 USD $ Cantamornings 6669876.00 ESP 35071.41 USD $ Chilindrina 2000.35 ESP 10.52 USD $ Cflab.org 123456.70 ESP 649.16 USD $ Cántaliping 46168.50 ESP 242.76 USD $ Zinzun.com 33445.00 ESP 175.86 USD $ Ántico 1000000.00 ESP 5258.18 USD $ Barring bugs such as the alphabetic ordering in Basque (whose alphabet does not include the letter 'c', and thus, no word starting with 'c' can possibly go before 'd') and the dancing of the 'ch' from one place to another, this should look like something native speakers would feel comfortable with. First thing we put into this program is the Finance::Quote module (available from CPAN). The main use of this module is for stock quotes, but it is basically a front end for doing searches over the Yahoo Finance servers. This also means that we must be on-line to take advantage of it. Each time we try to get a currency conversion rate, it is forwarded to a UserAgent. This requests the rate from Yahoo, manipulates it, and gives it back to us. It might also take a while, depending on the speed of the connection, and it could result in some additional errors, depending on the state of the Yahoo site. The program would work perfectly without this currency conversion, but the point of including it here is to show another aspect of localization: local currencies. The subsequent lines create a hash of arrays with the (theoretically unique) names of the customers, and set a different locale, currency, and greeting, with respect to the local language. If the name given is none of the above, it defaults to the es_ES locale and greetings. From that line on, the following action takes place: locale is set, file is read, and system time is read and converted to local time. The strftime function is used; this function formats time into a string using different options; in this case, %c instructs it to use the preferred date format for the local settings. This format will even translate weekday names, in addition to putting date and time in the local favorite arrangement. For instance, Americans prefer 12-hour clocks plus AM/PM, while many Europeans opt for 24 hour clocks. A Finance::Quote object is then created, setting the timeout, so that it will only wait 60 seconds for the answer. The object is used in the next line, where it is asked for the conversion rate between the Spanish peseta (ESP) and local currency. If the quantities in the file were in euros, we could have used the es_ES@euro locale instead, and conversions would have been made from euros instead of from pesetas. Finally, the hash containing the firm names and investment is printed in three columns: name, investment in pesetas, and investment translated into the local currency. It should be noted, however, that the period (.) is used as separator for decimals in English speaking countries, whilst the comma is used in Latin countries. We could even mix and match several settings; for instance, in the case of our friend, Alberto, who lives in Miami, we would have used en_US, for currency and numbers, and es_MX (for México), or es_CU (Cuba) for sorting: setlocale(LC_COLLATE, 'es_MX'); setlocale(LC_NUMERIC, 'en_US'); setlocale(LC_MONETARY, 'en_US'); setlocale(LC_TIME, 'en_US');
1070
Locale and Internationalization
We will probably be better off using the es_US locale, which can be downloaded from one of the above mentioned pages, if not already on the system. However, there are several things that may cause problems. Every time a new language shows up, we will have to go back to the original code, grok what was there, and modify it a little. What would happen, if all of a sudden the user needed another phrase? would they have to modify the whole hash table they had created from scratch? This is why we start to look for something better: a whole framework for internationalization of programs. The idea would be to have a way to store program messages in several languages, and retrieve them using a key, preferably the short version of the message. In reality there are two of them: ❑
gettext – a GNU standard tool for internationalization. It is widely used by many GNU applications. The current version at the time of writing is 0.10.35. It is directly available from http://www.gnu.org/software/gettext/gettext.html and is supported by several tools, and an EMACS mode. For more information regarding this subject, refer to a book such as Professional Linux Programming from Wrox Press, ISBN 1861003013.
❑
Locale::Maketext – a complete and purely Perl based solution. At the time of writing the latest version is 0.18. The documentation (available by typing > Perldoc Locale::Maketext, after installation) includes a synopsis. It is important to note that at present, Locale::Maketext is still in its early stages.
For the purposes of our site, we have decided upon gettext, since it has very good support in Perl. The first thing we have to do is to create a so-called Portable Object (PO) file, which contains the necessary information to translate messages. The main elements of PO files are the keywords msgid and msgstr, which contain the plain (in this case, Spanish) and translated string, respectively. A file can just be created using a text editor (or EMACS + PO mode) in this form: msgid "Hola" msgstr "Olá"
A set of these messages in a file, along with other information such as comments, and context, is called a catalogue, and corresponds to a domain, which is usually a language. For instance, the file above could be saved as CA.po, for the Catalan language. There are also a couple of editors we can use to edit PO files. POedit, which is available from http://www.volny.cz/v.slavik/poedit/, has a simple to use interface based on the GTK library. As an alternative, if we favor the desktop environment, KBabel is available as a part of the KDE software development kit, which can be obtained from http://www.kde.org. For the time being, we opt for the Perl way, and choose to use Locale::PO, an object oriented class for creating PO files. The following program is written using this module: #!/usr/bin/perl # pocreate.pl use warnings; use strict; use Locale::PO; my $i;
1071
Chapter 26
my %hash = "EN" => "EU" => "CA" => "GA" => "FR" => "DE" => "IT" => );
( ['Hello', "here's your report for"], ['Kaixo', "egunkaria hemen hire"], ['Hola', "aquí está el teu inform per"], ['Olá', "aquí está o seu relatório pra"], ['Salut', "voici votre raport pour"], ['Hallo', "ist hier ihr Report für"], ['Ciao', "qui è il vostro rapporto per"]
my @orig = ("Hola", "aquí está tu informe para");
For each element in this hash, a PO file is created. This file receives a reference to an array of Locale::PO objects, which features as main elements' a msgid, containing the key to the message, and a msgstr, the translation of the message to the language represented in the file. After saving the file using the Locale::PO->save_file_fromarray function, it is converted to an internal format using the MsgFormat command. We are using the Spanish equivalent as keys, but any other language, such as English, can be used. Theoretically, these are the files that should be handled by the team of translators in charge of localizing a program. for (keys %hash) { my @po; for ($i = 0; $i <= $#orig; $i++) { $po[$i] = new Locale::PO(); $po[$i]->msgid($orig[$i]); $po[$i]->msgstr($hash{$_}->[$i]); } Locale::PO->save_file_fromarray("$_.po",\@po); `MsgFormat . $_ < $_.po`; }
Running this program produces files similar to the following CA.po (for the Catalan language): msgid "Hola" msgstr "Hola" msgid "aquí está tu informe para" msgstr "aquí está el teu inform per"
These files can be used by any program compliant with gettext (that is, any program that uses the libgettext library). In the case of Perl, there are two modules that use them: Locale::PGetText and Locale::gettext. The main difference between them is that the first is a pure Perl implementation of gettext, using its own machine readable file formats, while the second needs to have gettext installed. In both cases, .po files have to be compiled to a machine readable file. We will opt for the first, but both are perfectly valid. Indeed, a program that belongs to the Locale::PGetText module is called in the following line of the above code: MsgFormat . $_ < $_.po;
1072
Locale and Internationalization
The following program produces a database file, which can then be read by Locale::PGetText. The implementation of this module is reduced to a few lines; it merely reads the DB file and ties it to a hash. The hash is accessed using a couple of functions, shown in the following example, which is an elaboration of the previous invest.pl program: #!/usr/bin/perl # pouse.pl use warnings; use strict; use Finance::Quote; use POSIX qw(localeconv setlocale LC_ALL LC_CTYPE strftime strcoll); use Locale::PGetText; # this is new # in this case only the locale is needed, the rest is stored in PO files my %clientMap = ( 'Jordi' => 'ca_ES', 'Patxi' => 'eu_ES', 'Miguelanxo' => 'gl_ES', 'Orlando' => 'es_AR', 'Arnaldo' => 'es_CO', 'Joe' => 'en_US' );
AM FL Y
die "Usage: $0 \n" if $#ARGV <1; my ($clientName,$fileName) = @ARGV; my $locale = $clientMap{$clientName} || 'es_ES'; setlocale(LC_ALL, $locale); # use settings
TE
my %investments; open(F,"<$fileName"); # open file and process it while () { chop; my ($firm, $invest) = split; $investments{$firm} = $invest; } close F;
Local messages are retrieved from the PO files using functions from the Locale::PGetText module. The directory where the files are located, is first set with setLocaleDir and then the corresponding file is loaded. my ($lang) = ($locale =~ /^(\w{2})/); Locale::PGetText::setLocaleDir('/root/txt/properl'); # change to our own dir here Locale::PGetText::setLanguage(uc($lang)); #Spanish for 'hello' my $hello = gettext("Hola"); #Spanish for 'Here is your report for' my $notice = gettext("Aquí está tu informe para");
1073
Team-Fly®
Chapter 26
#The rest is the same as invest.pl #Set up currency conversion my $q = Finance::Quote->new; $q->timeout(60); my $lconv = localeconv(); chop($lconv->{int_curr_symbol}); my $conversion_rate=$q->currency("ESP", $lconv->{int_curr_symbol}) || 1.0;
This will output exactly the same as the last but one example, invest.pl, with the only difference being, that in this case, we did not need to code the translated languages explicitly within the program. Adding new languages and messages is just a matter of adding the name of a person, and the locale, and the currency they use. The program will select the relevant file itself, extracting the language code from the two first letters of the locale name. In this case, the directory /root/txt/properl should be substituted for the directory in which our own .po files reside. The main change from the previous version is around the lines where Locale::PGetText functions are used: the two letter code for the language is extracted from the locale name and translated to all capitals; this code is used to set the language, using the function setLanguage. This in turn loads the corresponding database file; messages are then obtained from the DB, and used as they were in the previous example. This is not a huge improvement, but at least we can now build, bit by bit, a multilingual dictionary of words and phrases that can be used and reused in all of our applications, with a common (and simple) interface. We soon realize that looking for an archetypal name in each place and deducing their mother tongue (and everything else) is not very efficient. Besides, it is not very efficient to have to translate every single message by hand. What is needed is a way of automating the process. For example, using the Internet to do the job. Suppose now, that we want to go one step further and design a web site (which we will call El Escalón), enabling users all over the world to log in using their own specific username, and then to type a short report. The following program implements these ideas (note that it will not work on Windows as it uses the WWW::Babelfish module, which is only available on UNIX systems): #!/usr/bin/perl -T # babelfish.pl use warnings; use strict; use CGI qw(:standard *table); use Locale::Language; use WWW::Babelfish; print header, start_html(-bgcolor => 'white', -title => "El Escalón");
We find out the language from the TLD, or by using any of the other tricks contained in the grokLanguage subroutine. If nothing works, we can always query the user. my ($TLD) = (remote_host() =~ /\.(\D+)$/); print start_form; my $language = grokLanguage($TLD) || print "Your language", textfield('language'); # English name of the language
1074
Locale and Internationalization
We now ask the user to type in their name and report; messages are translated using the translate subroutine. if ($language) { # language has been set print start_table({bgcolor => 'lightblue', cellspacing => 0, cellpadding => 5}), Tr(td([translate("What is your code name?", $language), textfield('codename')])), "\n", Tr(td([translate('Type in your report', $language), textfield('report')])), "\n", Tr(td([translate('What is your assessment of the situation', $language), popup_menu('situation', [translate('Could not be worse', $language), translate('Pretty good', $language), translate('Could be worse', $language)])])), "\n", end_table; } print p({align => 'center'}, submit()), end_form, end_html;
For a more fuller description of the methods supported by the WWW::Babelfish module, we can do the usual command: > perldoc WWW::Babelfish In this next subroutine, we use one of these methods to hide the call to the WWW::Babelfish module. The arguments it takes are the message in English, and the name of the language to which the message is going to be translated. sub translate { my ($toTranslate, $language) = @_; my $tr; eval { $tr = new WWW::Babelfish(); }; if (!$tr) { print("Error connecting to babelfish: $@"); return $toTranslate; } my $translated = $tr->translate('source' => 'English', 'destination' => $language, 'text' => $toTranslate ); print $language, " ", $tr->error if $tr->error; $translated = $toTranslate unless defined ($translated); return $translated; }
1075
Chapter 26
This subroutine tries to find out the language of the client using several tricks, which will be explained later. sub grokLanguage { # Returns the name of the language, in English, if possible my $TLD = shift; return param('language') if (param('language')); # try predefined my ($ua) = user_agent() =~ /\[(\w+)\]/ ; # and now from the user agent return code2language($ua) if ($ua); return code2language($TLD) if (defined ($TLD)); # Or from the top-level domain return; }
This file is rather long, but it performs many functions. First, using the subroutine grokLanguage it tries to find out what language we speak, deducing it via the following channels (in order of precedence): ❑
The declaration of the user; for example, by calling it using something like: http://myhost/cgibin/babelfish.pl?language=German or http://myhost/cgi-bin/babelfish.pl?language=Spanish.
❑
The user_agent string, sent from the browser (at least, from Netscape Navigator and MS Internet Explorer) to the server, which in some cases takes this form: Mozilla/4.75 [es] (X11; U; Linux 2.2.16-22smp i686).
The [es] means it is a client compiled for the Spanish language. Not all languages are represented, and this might not be valid for browsers outside the Mozilla realm, but if this information is there, we can take advantage of it. This code is translated to a language name using code2language, which is part of the Locale::Language module. Then it tries to deduce our language from the top-level domain of our computer. Our server receives, along with the name of the file the client requires, a wealth of information from the client. For example, it receives the host name of the computer of the client. In some cases, the host name will not be resolved, that is, converted from its numerical address, and it will only get an address of the form 154.67.33.3. From this address it would be difficult to ascertain the native language of the client in the area of the calling computor; in many other cases, generic TLDs such as .com or .net are used. Finally, the computer is revealed as something like somecomputer.somedomain.it, which means that this calling computer is most probably in Italy. In this case we use the Perl module Locale::Language, which comes along with Locale::Country, to deduce the language from the two-letter TLD code. This is a good hunch at best, since in most countries more than one language is spoken, but at least we can use it as a fallback method if the other two fail. The code2language function in this module understands most codes, and translates them to the English name of the language that is spoken. However, there is a small caveat here: the language code need not be the same as the TLD code; for instance, UK is the top level domain for the United Kingdom, but it is the code for the Ukrainian language. However, we can expect that most English clients will be caught by one of the other methods. The right way to proceed, would be to convert TLD to country name, and then to one or several language codes or names. This is not practical, however, so we have to make do with the above method, which will work in at least some cases.
1076
Locale and Internationalization
Once the language is known (or guessed), it is used to translate messages from English to several other languages (the translate subroutine performs this task). It uses the WWW::Babelfish module (available on CPAN), which is a front end to the translating site at: http://babelfish.altavista.com/. This site then takes phrases in several languages, translates them to English, and back again. All this implies two requirements: ❑
The user must be on-line; each call to translate means a call to Babelfish.
❑
The language requiring translation must be included in this list.
In summary, the program could determine which language the user is accustomed to speaking. Of course, this is not the best solution in terms of speed or completeness for a multilingual web site. If we receive more than a couple of requests an hour, we will be placing a heavy load on Babelfish, and we will have access cut down. A solution based on gettext can suffice for a few languages and a few messages, but it will not scale up particularly well. The best solution is to keep messages in a relational database, indexed by message key and language, and have a certified translator to do all the translation. Another possible solution in Windows (and other operating systems with shared libraries) is to use one file for each language, and load the file which best suits the language of the user during run-time.
It's About Time: Time Zones It goes without saying that different areas of the globe fall into different time zones, and users from different countries will log in at different local times. To take this into account, we can create a Perl program to remind us of the local time. Just by inputting the name of the city and general zone we are in, our computer returns local time: #!/usr/bin/perl # zones.pl use warnings; use strict; use Time::Zone; use POSIX; die "Usage = $0 City Zone \n" if $#ARGV < 1; my ($city, $zone) = @ARGV; $ENV{TZ} = "$zone/$city"; tzset();
The following POSIX call returns the time zone, and DST (Daylight Saving Time) zone, the computer is set to: my ($thisTZ, $thisDST) = tzname(); my $offset = tz_offset($thisTZ); my ($sec, $min, $hour) = localtime(time + $offset); print qq(You are in zone $thisTZ Difference with respect to GMT is ),$offset/3600, qq( hours And local time is $hour hours $min minutes $sec seconds );
1077
Chapter 26
When issued with the following example commands, the program outputs something similar to: > perl zones.pl Cocos Indian You are in zone CCT Difference with respect to GMT is 8 hours And local time is 14 hours 34 minutes 38 seconds > perl zones.pl Tokyo Asia You are in zone JST Difference with respect to GMT is 9 hours And local time is 18 hours 4 minutes 42 seconds > perl zones.pl Tegucigalpa America You are in zone CST Difference with respect to GMT is -6 hours And local time is 12 hours 4 minutes 48 seconds Time zone notation is hairy stuff. There are at least three ways of denoting it: ❑
Using a (normally) 3 letter code, such as GMT (Greenwich Mean Time), CET (Central European Time), or JST (Japan Standard time). A whole list can be found in the file /usr/lib/share/zoneinfo/zone.tab (in a RedHat >=6.x system; we might have to look for it in a different place, or not find it at all, in a different system). The Locale::Zonetab module also provides a list, and comes with examples. Here is a list of the codes for all timelines. These are not exactly the same as time zones, as a country might choose to follow the time-line it falls into:
Change with respect to GMT
Time zone denomination
-11
NT
-10
HST
-9
YST
-8
PST
-7
MST
-6
CST
-5
EST
-4
EWT
-3
ADT
-2
AT
-1
WAT
0
GMT
1
CET
2
EET
1078
Locale and Internationalization
Change with respect to GMT
Time zone denomination
3
BT
4
ZP4
5
ZP5
6
ZP6
7
(No denomination)
8
WST
9
JST
10
EAST
11
EADT
12
NZT
❑
Using delta with respect to GMT; GMT+1 for instance, or GMT-12.
❑
Using the name of a principal city and zone; but zones are not exactly continents. There are ten zones: Europe, Asia, America, Antarctica, Australia, Pacific, Indian, Atlantic, Africa and Arctic. For instance, the time zone a Spaniard is in would be Madrid/Europe. This denomination has the advantage of also including Daylight Saving Time information (more on this later).
❑
there are several other notations: A to Z (used by US Military), national notations (for instance, in the US)... well, a whole lot of them. There is a lot of information in the Date::Manip pod page.
The whole time zone aspect is a bit more complicated, since local time changes according to Daylight Saving Time, a French invention to make people go to work by night and come back in broad daylight. We therefore have two options: go back to old times, when people worked to the sun, or use local rules by asking the users, which is the approach most operating systems take. Most operating systems include information for all possible time zones; for instance, Linux includes local time and time zone information in the /usr/lib/share/zoneinfo directory, which has been compiled using the zic compiler from time zone rule files. The baseline is that we really have to go out of our way, and ask a lot of local information to find out what the local time zone is. Most people just ask, or take time from the local computer. Ideally, a GPS-equipped computer would have a lookup table allowing it to know the local time for every co-ordinates in the world. In our case, we solve the problem the fast way. In most cases, knowing the name of the most populous city in the surroundings, and the general zone out of the ten mentioned above is all we need. From this information, the three letter code of the time zone, along with DST information, is obtained using POSIX functions tzset and tzname. The first sets the time zone to the one contained in the TZ environment variable, and the second deduces the time zone and DST information from it. This information is later used to compute the offset in seconds with respect to local time, which is used to compute the new time by adding that offset, in seconds, to actual time. Finally, all the above information is printed out. We will skip formatting it in the way required by locale, since we have seen that already in a previous section. Besides, city and general zone might not be enough information to find out which locale the user prefers.
1079
Chapter 26
Looks Like a Foreign Language Continuing with our scenario, we were not too convinced by the setup of our reporting web site. There were many reports, coming in from many well-known (and not so well-known) domains, and we could not find out which language they were written in. Even though we can not translate between these languages, we would like at least to know which language was being used. The following script goes some way to remedying this situation: #!/usr/bin/perl -T # langident.pl use warnings; # lack of use strict; use CGI qw(:standard); use Locale::Language; use Lingua::Ident; print header, start_html; if (!param('phrase')) { print start_form, "->", textfield('phrase'), "<-", end_form; } else { my @dataFiles = qw(data.de data.eu data.en data.fr data.it data.mt data.nl data.ko); for (@dataFiles) { if ( ! -e $_ ) { print "Will not work: File $_ does not exist"; exit; } } print em(param('phrase')), br, "Looks "; my $i = new Lingua::Ident( @dataFiles ); my $lang = $i->identify(param('phrase')); my ($langCode) = $lang =~ /^(\w{2})/; print code2language($langCode), " to me!"; } print end_html;
This program prints a text field, which invites the user to key in a phrase, in a language-independent way. It then takes its input and tries to find out what language it is written in using the Lingua::Ident module. To do that, a previous step is needed: > trainlid xx < sample.xx > data.xx This has to be done for each sample.* we have. There are some samples supplied with the Lingua::Ident module, but we also look at three samples extracted from appropriate web pages say, namely Maltese, Basque, and Dutch (sample.mt, sample.eu and sample.nl). The trainlid program, which comes with the module, creates a data.* file with letter combination frequencies. As an argument, it takes the name of the locale the language refers to, and as input, a file with a sample of a language, and then it outputs the frequency file.
1080
Locale and Internationalization
The data.* files are then used to initialize the Lingua::Ident module. This will enable it to identify as many languages as it is given. Unfortunately, this is done statistically, and it is not guaranteed to identify the correct language; Dutch and German are often mistaken for each other, but it will work for most languages, as can be seen in the following figures:
1081
Chapter 26
Conjugating in Portuguese A rather striking way of learning a foreign language would be to use Perl. Surprisingly, a very useful module called Lingua::PT::Conjugate has been written, for the purpose of conjugating Portuguese verbs. It includes two utilities, conjug, and treinar, the first for conjugating, and the second for interactive training with Portuguese verbs. For instance, we could type: > conjug i ficar pres ficar : pres fico ficas fica ficamos ficais ficam This is the equivalent of the verb 'to do' in English, conjugated in present tense, from the first person in the singular to the third person in the plural. It does not use ISO-8859-1 encoding (due to the i in conjug i). The following script utilizes this module, although it should be noted that this is not Windows compliant because the WWW::Babelfish module cannot be installed on Windows systems: #!/usr/bin/perl -T # conjug.pl use warnings; # lack of use strict here use CGI qw(:standard *table); use WWW::Babelfish; use Lingua::PT::Conjugate; print header, start_html(-bgcolor => 'white', -title => "Conjugando"); if (!param(verb)) { my %tenses = (pres => Present, perf => Perfect, imp => Imperfect, fut => Future, mdp => Pluscuamperfect, cpres => 'Subjunctive Present', cimp => 'Subjunctive Imperfect', cfut => 'Subjunctive Future', ivo => Imperative, pp => 'Past Participle', grd => Gerund); my @labels = keys %tenses; my %persons = (1 => I, 2 => You, 3 => 'He/She/It', 4 => We, 5 => You, 6 => They); my @labelsP = keys %persons;
1082
Locale and Internationalization
After setting up data for the combo boxes (popup_menu), here comes the program itself: print start_table({bgcolor => 'lightblue', cellspacing => 0, cellpadding => 5}), start_form, Tr(td(["English Verb", textfield('verb')])), "\n", Tr(td(['Tense',, popup_menu('tense' , \@labels, 'pres', \%tenses)])), "\n", Tr(td(['Person', popup_menu('person', \@labelsP, 1, \%persons)])), Tr(td({colspan => 2, align => 'center'}, submit())), end_form, end_table;
This is the part of the script that processes the form, translates the verb, and conjugates it: } else { print h1("Translating ", param('verb'), " to ", param('tense'), " in person ", param('person')), p(em (conjug(translate(param('verb')), param('tense'), param('person'))) ); } print end_html;
AM FL Y
sub translate { my ($toTranslate) = @_; my $tr = new WWW::Babelfish() || return $toTranslate; my $translated = $tr->translate('source' => 'English', 'destination' => 'Portuguese', 'text' => "$toTranslate"); print $language, " ", $tr->error if $tr->error; $translated = $toTranslate unless defined ($translated); return $translated; }
TE
In this script, our knowledge of the WWW::Babelfish module is used to translate the gerund (that is, the English form of the verb ending in '-ing') of the verb from English to Portuguese, and then conjugate the verb in Portuguese. Verbs in Portuguese are pretty much the same as many other Latin languages: they have to be conjugated, and have different forms for different tenses and different persons, not just a couple of forms as in English. Thus, we have to state the person (first, second, or third, singular or plural) and the tense. The script follows two paths, depending on whether it is called for the first time or called with a verb. If it is called directly, it prints a form that queries for the verb, which should be entered in gerund form, since this seems to be the way of getting infinitives in Babelfish. The tense is then requested, and this takes the English form as a parameter, and the script outputs the Portuguese form (for instance, ivo as in imperativo, which means, obviously enough, imperative). Once the verb has been set, the script, using a version of the translate subroutine locked for Portuguese, translates the verb into Portuguese, then conjugates it by calling the Lingua::PT::Conjugate::conjug method. This method takes the verb, tense, and person as arguments, and returns the verb form. For instance, typing in driving and selecting the first person plural in the future tense (we will drive) will be processed by the program as:
1083
Team-Fly®
Chapter 26
The only problem is that there is no such module for other languages as well, though if we want to work with other languages, CPAN includes several 'national' modules that take into account national peculiarities. The two main groups are the No::* set, and the Cz::* set. The Cz:: set includes Cz::Cstocs, Cz::Sort, and Cz::Time. The reasoning for the presence of these modules is basically due to the fact that locale settings for the Czech and Slovak languages seem to be broken, so these modules include their own sorting and time display routines. Cz::Cstocs encodes from il2 to ASCII and back. These modules will probably be made obsolete with better locale modules, and Unicode/UTF8 encoding. The No:: set also includes modules for date and sorting in Norway, No::Dato and No::Sort respectively. In addition, several modules that might be useful for e-commerce applications are available: ❑
No::KontoNr – checks social security numbers.
❑
No::PersonNr – checks account numbers.
❑
No::Telenor – computes, phone call prices for different telephone companies, times, and days of the week.
The 'Lingua::*' Modules We can extend our login/report based web site so that users can be known by a pseudonym or nickname (this may not seem very useful, but it is not an uncommon feature on many sites these days). The major new feature though, is that users should now be able to type their report into the site, and the site can give them feedback on their report, judging it on its readability rating. The following code implements these new ideas (note that the Lingua::EN::Fathom module cannot be installed on a Windows machine as yet): #!/usr/bin/perl -T # report.pl use warnings; # lack of use strict use use use use
1084
CGI qw( :standard *table *ul); Lingua::EN::Syllable; Lingua::EN::Nickname; Lingua::EN::Fathom;
Locale and Internationalization
print header, start_html(-bgcolor => white, -title => "Reporting"); if (!param(report)) { print start_table({bgcolor => lightblue, cellspacing => 0, cellpadding => 5}), start_form, Tr(td(["Name" , textfield('name')])), "\n", Tr(td(["Report" , textarea('report')])), "\n", Tr(td({colspan => 2, align => center}, submit())), end_form, end_table; } else { my $text = new Lingua::EN::Fathom; $text->analyse_block(param(report)); my $numSyll = syllable(param(report)); print h1("Received report by ", nickroot(param(name)), ":"), p(em(param(report))), p("Which has received the following ratings"), start_ul, li("Fog = ", $text->fog), li("Kincaid = ", $text->kincaid), li("Flesch = ", $text->flesch), end_ul, p("And you will be paid ", $numSyll*0.25, "\$ for $numSyll syllables"), } print end_html;
This script manages to use all three Lingua modules mentioned above (we may have some problems building the Lingua::En::Fathom module in English, but we can try and build it in Linux and then copy the resulting files to Windows). The main topic of the Lingua modules is (the English) language; and there are modules for syntactic parsing, recognizing gender, and so forth. There are also a few modules that deal with Russian (Lingua::RU::Charset), and the Portuguese module mentioned above. Hopefully, this will be extended in the future, but for the moment, we have to make do with the many modules that deal with the English language. In this case, we are using modules that count syllables (Lingua::EN::Syllable), and find the real name that corresponds to a nickname such as 'Joe' or 'Jimmy' (Lingua::EN::Nickname). Most curious of all, Lingua::EN::Fathom analyzes text for readability, assigning it three standard indices called Kincaid, Flesch, and Fog. Kincaid returns the Flesch-Kincaid grade level for the text. Flesch grades the text from 0 to 100 according to its readability; a score between 60 and 70 being ideal. Fog returns the (theoretical) mental age needed to understand the text. The structure of the script is very similar to the others in this chapter. It checks if the CGI parameter report is present. If it is not present, it prints the form, requesting the nickname and the report. If it is present, it is processed, creating a Fathom object, which analyzes text, counts syllables, and computes the name corresponding to the nickname, using nickroot from the following lines (taken from the above code): $text->analye_block(param(report)); my $numSyll = syllable(param(report)); print h1("Received report by ", nickroot(param(name)), ":"), p(em( param(report ))),
1085
Chapter 26
The rest is printing, and calling the fog, kincaid, and flesch functions to get the respective indices. The expected style of result is shown in the following figure:
This does not say much about the linguistic reporting skills of 'Tamara'. A fog index of 21 is way out (18 would already be pretty difficult to understand), and the Kincaid index is also quite far from the optimal (between 7 and 8). A small piece of advice here: nickroot does not appear in the Nickname POD; rootname is mentioned instead. The documentation, (version 1.1) is badly out of sync with the current release.
Spelling and Stemming A simple way to improve the readability ratings of the reports submitted by users would be to check the spelling for errors. The previous program has therefore been modified to check English spelling and correct it, if possible: This code is not Windows compliant. Although the Lingua::Ispell module can be installed on Windows, the code will not work, because the Windows spellchecker does not work in the same way as the ispell program on UNIX #!/usr/bin/perl -T # spell.pl use warnings; # lack of use strict
1086
Locale and Internationalization
use use use use use
CGI qw(:standard *table *ul); Lingua::EN::Syllable; Lingua::EN::Nickname; Lingua::EN::Fathom; Lingua::Ispell qw(spellcheck);
print header, start_html(-bgcolor => white, -title => "Reporting"); if (!param(report)) { print start_table({bgcolor => 'lightblue', cellspacing => 0, cellpadding => 5}), start_form, Tr(td(["Name" , textfield('name')])), "\n", Tr(td(["Report" , textarea('report')])), "\n", Tr(td({colspan => 2, align => 'center'}, submit())), end_form, end_table; } else { my $text = new Lingua::EN::Fathom; my $report = param('report'); $text->analyse_block($report); my $numSyll = syllable($report); $report =~ s/\n//g; # needed by spellcheck
Most of the new code is below. The correct path to ispell is set, and then the text is spell checked: $Lingua::Ispell::path = '/usr/bin/ispell'; my @checks = spellcheck($report); my %errors; for (@checks) { $errors{$_->{term}} = $_->{misses} if ( $_->{type} = 'miss'); } print h1("Received report by ", nickroot(param(name)), ":"), p(em(param(report))), p("Which has received the following ratings"), start_ul, li("Fog = ", $text->fog ), li("Kincaid = ", $text->kincaid), li("Flesch = ", $text->flesch), end_ul, p("And had the following misspellings "), start_ul; for (keys %errors) { print li($_, " Suggestions :", join(" ", @{$errors{$_}})); } print end_ul, p("And you will be paid ", $numSyll*0.25, "\$ for $numSyll syllables "); } print end_html;
As we can see, this program is quite similar to the previous example. In the highlighted lines, the path to the ispell program is set in the following line: $Lingua::Ispell::path = '/usr/bin/ispell';
and spellcheck is called in the following line: my @checks = spellcheck($report);
1087
Chapter 26
This function returns an array of references to hashes, which has several keys for each word in the input string; term contains the original word, type is the flag set by ispell. We should pay attention if type is none or miss. In this last case, a new key, misses, can be used. This contains an array of possible correct spellings of the word. We only flag near misses and place them in the %errors hash, and print them afterwards. Take into account that the ispell utility seems to be a bit obsolete, and may not be widely available. Indeed, it has been replaced with aspell in the current RedHat Linux distributions, but there is no difference in results, since aspell includes an ispell wrapper in the package. If we type in an English name and a report composed of the following phrases, the program will return something like this:
Perl can play other word games besides the spell check. For instance, in many information retrieval applications, it is sometimes useful to know the root of a word. Take the case where somebody is searching for the word 'birds'. They may also be looking for 'birdies', 'bird' or 'bird's nest'. The operation that extracts the root of a word is called stemming, and it is a language dependent operation. Incidentally, this operation is very easy in English, with conjugations that have only a few forms, no declinations, and no male/female variation in adjectives. Difficulties may arise in other languages such as Basque, which has 14 different declinations or cases, and 3 different numbers: singular, plural and mugagabe, indefinite or collective. There is, however, a Lingua::Stem module that can help (in the English case at least): #!/usr/bin/perl # stem.pl use warnings; use strict;
1088
Locale and Internationalization
use Lingua::Stem qw(stem); my @words = qw(birds birdie flying flown worked working); my $stemmer = Lingua::Stem->new(-locale => 'EN-US'); my $stems = $stemmer->stem(@words); print "Word \tStem \n"; while (@$stems) { print pop @words, "\t", pop @$stems, "\n"; }
This returns something like this: > perl stem.pl Word Stem working work worked work flown flown flying fly birdie birdi birds bird Well, it is not perfect. It cannot spot irregular verbs, or familiar diminutives such as 'birdie'. However, it would be useful in a search engine looking for all strings in a text containing the stem, so that more matches would be returned. Of course, this is not the ultimate information retrieval pre-processing system; synonyms, numbers, and names will have to be taken into account, as well as the weight of words within the text, but in any case, stemming is the first step in many information retrieval applications.
Writing Multilingual Web Pages In order to give an example of a multilingual web page, we are going to create two forms. The first allows users to input their name, and the language they understand. User information will be stored in cookies by the first script for use in the second script. The second script will then display, amongst other things, a greeting in the users' chosen language, and will also allow users to type in text and have it translated into Spanish. #!/usr/bin/perl -T # form.pl use warnings; # lack of use strict; use CGI qw(:standard *table); use Locale::Language; use Locale::Country; if (!param('name')) { print header, start_html(-bgcolor => 'white', -title => "ID card"), start_table({bgcolor => 'lightblue', cellspacing => 0, cellpadding => 5} ), start_form, Tr(td("Name" ),td({colspan => 3},textfield('name'))), "\n",
1089
Chapter 26
Tr(td(["Language you understand", radio_group('lang', [English, French, Spanish])])), "\n", Tr(td({colspan => 4, align => 'center'}, submit())), end_form, end_table; } else { #Process form and set cookie my @cookies; push(@cookies, cookie(-name => 'name', -value => param('name'), -expires => '+3M', -path => '/cgi-bin')); push(@cookies, cookie (-name => 'lang', -value => language2code(param('lang')), -expires => '+3M', -path => '/cgi-bin')); print header(-cookie => \@cookies), start_html(-bgcolor => 'white', -title => "ID card accepted"), "Application accepted \n"; } print end_html;
This example is pretty straightforward: by default, it prints a form, and if the name field is already filled, it is processed. The language is also processed to a language code: en, fr, or es. The path in the cookie function will have to be changed to whatever path our script resides in. The expiry date can also be changed to suit any other needs; in our case it was set to 3 months. Next, we will use a comma separated flatfile called messages.txt to store our multilingual messages. Our application will then search through it, displaying the correct language on our report form. Of course, this depends on the language that the user has indicated that they understand. All messages are normalized, with no capitals, so that capitalization can be established by the application: hello, hola, salut please, por favor, s'il vous plait here is, aquí está, voici welcome, bienvenido, bienvenu today's date is, la fecha del día es, aujourd'hui est le your name, su nombre, votre nom report, informe, rapport send, enviar, envoyer received, recibido, reçu by, por, par
Since languages are taken from cookies, and we set the cookies ourselves, we would not expect this to give an error (but there will always be nonvalid input when one expects valid input). Again note that the script will not work on Windows, since the WWW::Babelfish module cannot be installed: #!/usr/bin/perl -T # babel.pl use warnings; # lack of strict
1090
Locale and Internationalization
use use use use use use
CGI qw(:standard *table); locale; IO::File; POSIX; WWW::Babelfish; Locale::Language;
if (cookie('name')) { my $name = cookie('name'); my $language = cookie('lang'); print header, start_html(-bgcolor => 'white', -title => "Reporting"); setlocale(LC_TIME, cookie('locale')); # If it does not exist, defaults to local
So far, common processing is used for both showing and processing the form. Now, if the report CGI parameter is not present, it will show the form. If it is present, it will process it. if (!param('report')) { #Show form my ($sec, $min, $hour, $mday, $mon, $year, $wday) = localtime (time); print h2(ucfirst(getMessage('hello', $language)), " $name,", getMessage("today's date is", $language), strftime("%c", $sec, $min, $hour, $mday, $mon, $year, $wday), "\n"); print start_table({bgcolor => 'lightblue', cellspacing => 0, cellpadding => 5}), start_form, Tr(td([ucfirst(getMessage('report', $language)) , textarea(-rows => 10, -columns => 50, -name => 'report')])), "\n", Tr(td({colspan => 2, align => 'center'}, submit(ucfirst(getMessage('send', $language))))), end_form, end_table;
If the form has been completed, we have the parameter report. We then have to process it: } else { # process it and show it in English print h2(ucfirst(getMessage('received', $language)), getMessage('report', $language), getMessage('by', $language), " $name"), p(b("Original"), param('report')), p(em (translate(param('report'), code2language($language)))); } } else { # ask login print redirect(-uri => '/cgi-bin/form.pl'); } print end_html;
The getMessage subroutine (below) opens up a filehandle to the flatfile database, taking the key and the language code as parameters. It then places each phrase into an array, and then searches it, returning the desired element. The other subroutine, translate, does exactly as expected – it translates our report into Spanish. Beware, however, that Babelfish cannot translate every language into any other language. In our case, users selecting French as the language they understand, will not have their report translated into Spanish. For a full listing of languages Babelfish can translate to and from, visit http://www.infotektur.com/demos/babelfish/en.html.
1091
Chapter 26
This is the getMessage subroutine used in the babel.pl script: sub getMessage { my ($key,$language) = @_; chomp ($key,$language); my $filename = 'messages.txt'; my $counter = -1; my $fh = new IO::File; $fh->open("< $filename") or die "Can't open '$filename': $!\n"; my (@messages,@final); while (<$fh>) { @messages = (split ', '); push @final, @messages; } chomp @final; foreach (@final){ my $msg; $counter++; if($_ =~ /$key/ and $language =~ /es/){ $msg = $final[$counter+1]; return $msg; } elsif ($_ =~ /$key/ and $language =~ /fr/) { $msg = $final[$counter+2]; return $msg; } elsif ($_ =~ /$key/ and $language =~ /en/) { $msg = $final[$counter]; return $msg; } else { next; } } }
and here is the translate subroutine: sub translate { my ($toTranslate, $language) = @_; my $tr = new WWW::Babelfish() || return $toTranslate; my $translated = $tr->translate('source' => $language, 'destination' => 'Spanish', 'text' => $toTranslate); print $language, " ", $tr->error if $tr->error; $translated = $toTranslate unless defined ($translated); return $translated; }
1092
Locale and Internationalization
This script is basically a form processing CGI; the only difference being that we make use of the locale to show the date, and the selected language for the messages. Recall that the user selected the locale, in their choice of country and language, in the sign in script. Cookies are retrieved using standard CGI.pm instructions. If the cookie name is not present, we redirect the script to form.pl. If it is present, there are two different paths it could take, depending on whether the parameter report has been set, or not. We have taken the 'atomic' approach to messages, putting words or short phrases into the database and building up phrases little by little. We could also insert whole phrases, but in that case we would have to include every possible message. For instance, in this script, the phrase Received report by is made up of three different atoms, Received, report, and by.
AM FL Y
We now briefly show how these scripts are implemented in, say, a simple multilingual web site. First we run the form.pl script and the user will input their name and the language they understood:
TE
After submitting the form, a screen showing acceptance is displayed. Now we move to using the second script, babel.pl, to perform our translation from the selected language to Spanish, as hardcoded in the script:
1093
Team-Fly®
Chapter 26
And the result of the call to Babelfish is as below:
Creating our own Local Perl module Despite all the precautions we have taken thus far to make our site multilingual, some countries still have peculiarities that we have not catered for. This can be important when writing a program or representing any user interface. For instance, local geographical divisions are often quite specific to a place: there are states in the US, provinces in Canada, etc. Along similar lines, postal or ZIP codes are also specific to a place – routines for converting from postal code to geographical place (and back) could be quite useful when processing, for example, a mail order. Some other modules could serve as a frontend to national news services, stock tickers, search engines, etc. At the time of writing, only two countries have been fortunate enough to have some of these modules written for them: the Czech Republic and Norway (as we have seen in previous sections). We will try to take the first steps towards the creation of other object sets; in this case, the Es::* set. The first problem to arise is how we should name the module. We could use the continent::country convention, (especially in Europe, where everybody is part of one big family), or just go for the national two-letter code; but in this case, should the language code or the national ISO code be used? Our suggestion is that it will depend upon the topic of the module. Matters relative to the whole Spanish state should be called Es::*, while matters peculiar to an autonomous region should follow the locale convention: languagecode_COUNTRYCODE. For instance, national Catalan modules could be named Ca_ES::*. This is only a suggestion, and the right way to baptize a module is to issue a request for comments in the news:comp.lang.Perl.modules and/or other national discussion groups, such as [email protected]. More details can also be found in the Modules FAQ, available by entering > perldoc perlmodlib. One of the varying aspects between countries is the id card number. In Spain, it is the same number as the national revenue service code and includes a letter, which acts as a checksum. The following Es::Nif package can be used to check an NIF (Número de Identificación Fiscal, or fiscal identification number). #!/usr/bin/perl # ES/Nif.pm use strict; package ES::Nif; # Check that the letter of the Spanish NIF is correct. A Spanish personal # NIF is a set of numbers followed by a letter. It can be expressed in upper # or lower case, separated by a dash or not.
1094
Locale and Internationalization
use Carp qw(carp); sub check { my $nif = shift; carp "Incorrect NIF format: $nif \n" if $nif !~ /^(\d+)\-?(\w)$/; my ($number, $letter) = ($1, $2); return letter($number) eq uc($letter); } sub letter { my $number = shift; my $mod = $number % 23; my ($digit1, $digit2) = split(//, $mod); my $letterComputed = $::lookup[$digit1][$digit2]; return $letterComputed; }
This routine is called when the package is declared in a program. It reads the letter codes and builds up the lookup table: INIT { my $i = 0; while () { my @data = split; push(@{$::lookup[$i++]}, @data); } } "No te lo digo"; # this is a true value # taken from http://www.terra.es/personal2/bomb009/telefonos.htm __DATA__ T R W A G M Y F P D X B N J Z S Q V H L C K E T
The letter in the fiscal number is found using the remainder of dividing the id number by 23, that is, DNI % 23. The result is then looked up in a double entry table; the first digit corresponds to the row, the second digit to the column, remembering that we start counting from zero. For instance, if DNI % 23 = 12, we would look up second row and third column, that is, N. There are only two routines, one computes the letter from the number, and the other checks it against the letter that comes with the NIF. They can be used like this: #!/usr/bin/perl # ES/Nif.pl use warnings; use strict; use ES::Nif; print "Checking $ARGV[0] "; if (ES::Nif::check($ARGV[0])) { print "Correct"; } else { chop($ARGV[0]); # This chops off letter print "Incorrect. Letter should be ", ES::Nif::letter($ARGV[0]); } print "\n";
1095
Chapter 26
This would produce an output similar to: > perl ES/Nif.pl 2345678U Checking 2345678U Incorrect. Letter should be T > perl ES/Nif.pl 3332210Q Checking 3332210Q Correct
Summary Over the course of this chapter, we have taken a look at Perl internationalization and localization from different angles. As an overview we have: ❑
Seen how to use the locale framework in order to present information according to local conventions.
❑
Covered modules and tools available that deal with languages, alphabets, time zones, etc.
❑
Looked at how we can use Perl to help us identify and manipulate unknown languages when we come across them.
❑
Created a basic translation page using flatfiles incorporating three common languages.
❑
Illustrated guidelines and possible topics for module creation, as well as writing our own simple module.
1096
Locale and Internationalization
1097
Chapter 26
1098
Command Line Options The following is a list of the available switches, which can be appended to the calling of Perl from the command line. The exact syntax for calling Perl from the command line is as follows: > perl (switches) (--) (programfile) (arguments) Switch
Function
-0(octal)
This sets the record separator $/ by specifying the character's number in the ASCII table in octal. For example, if we wanted to set our separator to the character e we would say perl -0101. The default is the null character, and $/ is set to this if no argument is given.
-a
-a can be used in conjunction with -n or -p. It enables autosplit, and uses whitespace as the default delimiter. Using -p will print out the results, which are always stored in the array @F.
-C
Enables native wide character system interfaces.
-c
This is a syntactic test only. It stops Perl executing, but reports on any compilation errors that a program has before it exits. Any other switches that have a runtime effect on your program will be ignored when -c is enabled.
-d filename
This switch invokes the Perl debugger. The Perl debugger will only run once you have gotten your program to compile. Enabling -d allows you to prompt debugging commands such as install breakpoints, and many others. Table continued on following page
Appendix A
Switch
Function
-D(number)
-D will set debugging flags, but only if you have debugging compiled into your program. The following table shows you the arguments that you may use for -D, and the resulting meaning of the switch.
-D(list)
Argument
Argument
(number)
(character)
Operation
1
p
Tokenizing and parsing
2
s
Stack snapshots
4
l
Label stack Processing
8
t
Trace execution
16
o
Object method lookup
32
c
String/numeric conversions
64
p
Print preprocessor command for -p
128
m
Memory allocation
256
f
Format processing
512
r
Regular expression processing
1024
x
Syntax tree dump
2048
u
Tainting checks
4096
L
Memory leaks
8192
H
Hash dump
16384
X
Scratchpad allocations
32768
D
Cleaning up
65536
S
Thread synchronization
-e
This allows you to write one line of script, by instructing Perl to execute text following the switch on the command line, without loading and running a program file. Multiple calls may be made to -e in order to build up scripts of more than one line.
-F/pattern/
Causes -a to split using the pattern specified between the delimiters. The delimiters may be / /, " ", or ' '.
-h
Prints out a list of all the command line switches.
-i(extension)
Modifies the <> operator. Makes a backup file if an argument is given. The argument is treated as the extension that the saved file is to be given.
-I (directory)
Causes a directory to be added to the search path when looking for files to include. This path will be searched before the default paths, one of which is the current directory, the other is generally /usr/local/lib/Perl on UNIX and C:\perl\bin on Windows.
1100
Command Line Options
Switch
Function
-l(octal)
-l adds line endings, and defines the line terminator by specifying the character's number in the ASCII table in octal. If it is used with -n or -p, it will chomp the line terminator. If the argument is omitted, then $\ is given the current value of $/. The default value of the special variable $/ is newline.
-(mM)(-)module
Causes the import of the given module for use by your script, before executing the program.
-n
Causes Perl to assume a while (<>) {My Script} loop around your script. Basically it will iterate over the filename arguments. It does no printing of lines.
-p
This is the same as -n, except that it will print lines.
-P
-P causes your program to be run through the C preprocessor before it is compiled. Bear in mind that the preprocessor directives begin with #, the same as comments, so preferably use ;# to comment your script when you use the -P switch.
-s
This defines variables with the same name as the switches that follow on the command line. The other switches are also removed from @ARGV. The newly-defined variables are set to 1 by default. Some parsing of the other switches is also enabled.
-S
Causes Perl to look for a given program file using the PATH environment variable. In other words, it acts much like #!
-T
Stops data entering a program from performing unsafe operations. It is a good idea to use this when there is a lot of information exchange occurring, like in CGI programming.
-u
This will perform a core dump after compiling the program.
-U
This forces Perl to allow unsafe operations.
-v
Prints the version of Perl that is currently being used (includes VERY IMPORTANT Perl info).
-V(:variable)
Prints out a summary of the main configuration values used by Perl during compiling. It will also print out the value of the @INC array.
-w
Invokes the raising of many useful warnings based on the (poor or bad) syntax of the program being run. This switch has been deprecated in Perl 5.6, in favor of the use warnings pragma.
-W
Enables all warnings.
-X
This will disable all warnings. We already know that we always use use warnings when writing our programs. So you will not need this.
-x(directory)
Tells Perl to get rid of extraneous text that precedes the shebang line. All switches on the shebang line will still be enabled.
1101
Appendix A
1102
Special Variables
AM FL Y
This is a categorized listing of the predifined variables provided by Perl. The longer and more descriptive 'English' names become available when using the English module with use English.
Default Variables and Parameters English Name
$_
$ARG
Description
This global scalar acts as a default variable for function arguments and pattern-searching space – with many common functions, if an argument is left unspecified, $_ will be automatically assigned. For example, the following statements are equivalent:
TE
Variable
chop($_) and chop $_ =~ m/expr/ and m/expr/ @_
n/a
The elements of this array are used to store function arguments, which can be accessed (from within a function definition) as $_[num]. The array is automatically local to each function.
@ARGV
n/a
The elements of this array contain the command line arguments intended for use by the script.
$ARGV
n/a
This contains the name of the current file when reading from the null filehandle <>. <> is a literal, and defaults to standard input, <STDIN>, if no arguments are supplied from elements in @ARGV.
Team-Fly®
Appendix B
Regular Expression Variables (all read-only) Variable
English Name
Description
$(num)
n/a
The scalar $n contains the substring matched to the n'th grouped subpattern in the last pattern match, and remains in scope until the next pattern match with subexpressions. It ignores matched patterns occurring in nested blocks that are already exited. If there are no corresponding groups, then the undefined value is returned.
$&
$MATCH
This scalar contains the string matched by the last successful pattern match. Once again, this will not include any strings matched in nested blocks. For example: 'UnicornNovember' =~ /Nov/; print $&;
will print Nov. For versions of Perl since 5.005, this is not an expensive variable to use. $'
$POSTMATCH
This scalar holds the substring following whatever was matched by the last successful pattern match. For example, if we say: 'UnicornNovember' =~ /Nov/;
$' will return ember. $`
$PREMATCH
This scalar holds the substring preceding whatever was matched by the last successful pattern match. For example, if we say: 'UnicornNovember' =~ /Nov/;
$` will return Unicorn. $+
$LAST_PAREN_MATCH
This scalar holds the last substring matched to a grouped subpattern in the last search. It comes in handy if you're not sure which of a set of alternative subpatterns matched. For example, if you successfully match on /(ab)*| (bc*)/, then $+ stores either $1 or $2, depending on whether it was the first or second grouped subpattern that matched. For example, following: 'UnicornNovember' =~ /(Nov)|(Dec)/;
$+ will return Nov.
1104
Special Variables
Variable
English Name
Description
$*
$MULTILINE_MATCHING
This sets to 1 to do multi-line matching within a string, or 0 to tell Perl that it can assume that strings contain a single line, for the purpose of optimizing pattern matches. Pattern matches on strings containing multiple newlines can produce confusing results when $* is 0. Default is 0. (Mnemonic: * matches multiple things.) This variable influences the interpretation of only ^ and $. A literal newline can be searched for even when $* == 0. Use of $* is deprecated in modern Perl, supplanted by the /s and /m modifiers on pattern matching.
@+
@LAST_MATCH_END
This array lists the back pointer positions (in the referenced string) of the last successful match. The first element @+[0] contains the pointer's starting position following that match, each subsequent value corresponds to its position just after having matched the corresponding grouped subpattern. For example, following: 'UnicornNovember' =~ /(U)\w?(N)/;
@+ will return: (8,1,8), while following: 'UnicornNovember' =~ /(Uni)\w?(Nov)/;
@+ will return: (10,3,10). @-
@LAST_MATCH_START
This array lists the front pointer positions (in the referenced string) of the last successful match. The first element @-[0] contains the pointer's starting position prior to that match, each subsequent value corresponds to its position just before having matched the corresponding grouped subpattern. For example following: 'UnicornNovember' =~ /(Uni)\w?(Nov)/;
@- will return: (0,0,7), while following: 'UnicornNovember' =~ /(Uni)(\w?)(Nov)/;
@- will return: (0,0,3,7).
1105
Appendix B
Input/Output Variables Variable
English Name
Description
$.
$INPUT_LINE_NUMBER
This scalar holds the current line number of the last filehandle on which you performed either a read, seek, or tell. It is reset when the filehandle is closed. NB: <> never does an explicit close, so line numbers increase across ARGV files. Also, localizing $. has the effect of localizing Perl's notion of 'the last read filehandle'.
$/
$INPUT_RECORD_SEPARATOR
This scalar stores the input record separator, which by default is the newline \n. If it's set to " ", input will be read one paragraph at a time.
$\
$OUTPUT_RECORD_SEPARATOR
This scalar stores the output record separator for print – normally this will just output consecutive records without any separation (unless explicitly included). This variable allows you to set it for yourself. For example: $\ = "-"; print "one"; print "two";
will print: one-two-. $|
$OUTPUT_AUTOFLUSH
This corresponds to an internal flag used by Perl to determine whether buffering should be used on a program's write/read operations to/from files. If the value is TRUE ($| is greater than 0), buffering is disabled.
$,
$OUTPUT_FIELD_SEPARATOR
This is the output field separator for print. Normally this will just output consecutive fields without any separation (unless explicitly included). This variable allows you to set it for yourself. For example: $, = "-"; print "one", "two";
will print: one-two. $"
$LIST_SEPARATOR
This is the output field separator for array values interpolated into a double-quoted string (or similar interpreted string), the default is a space. For example: $" = "-"; @ar = ("one", "two", "three"); print "@ar";
will print: one-two-three.
1106
Special Variables
Variable
English Name
Description
$;
$SUBSCRIPT_SEPARATOR
This is the subscript separator for multidimensional array emulation. If you refer to a hash element as: $foo{$a, $b, $c}
it really means $foo{join($;, $a, $b, $c)}
Filehandle/format Variables Variable
English Name
Description
$#
$OFMT
This holds the output format for printed numbers. NB: The use of this variable has been deprecated.
$|
$OUTPUT_AUTOFLUSH
This corresponds to an internal flag used by Perl to determine whether buffering should be used on a program's write/read operations to/from files, if its value is TRUE ($| is greater than 0), then buffering is disabled.
%$$FORMAT_PAGE_NUMBER
The current page number of the selected output channel.
$=
$FORMAT_LINES_PER_PAGE
The current page length, measured in printable lines, the default is 60. This only becomes important when a top-of-page format is invoked, if a write command does not fit into a given number of lines, then the top-of-page format is used, before any printing past the page length continues.
$-
$FORMAT_LINES_LEFT
The number of lines left on a page, when a page is finished, it's given the value of $=, and is then decremented for each line outputted.
$~
$FORMAT_NAME
The currently selected format name, the default is the name of the filehandle.
$^
$FORMAT_TOP_NAME
The name of the top-of-page format.
$:
$FORMAT_LINE_BREAK_ CHARACTERS
The set of characters after which a string may be broken to fill continuation fields (starting with ^) in a format – default is ' \n-' to break on whitespace or hyphens.
$^L
$FORMAT_FORMFEED
This holds a character that is used by a format's output to request a form feed, default is \f.
$^A
$ACCUMULATOR
This is the current value of the write() accumulator for format() lines. A format contains formline() calls that put their result into $^A. After calling its format, write() prints out the contents of $^A and empties. We never really see the contents of $^A unless we call formline() ourselves and then look at it.
1107
Appendix B
Error Variables Variable
English Name
Description
$?
$CHILD_ERROR
This holds the status value returned by the last pipe close, backtick (``) command, or system() operator.
$@
$EVAL_ERROR
This holds the syntax error message from the last eval() command, it evaluates to null if the last eval() parsed and executed correctly (although the operations you invoked may have failed in the normal fashion).
$!
$ERRNO
If used in a numeric context, this returns the current value of errno, with all the usual caveats (so you shouldn't depend on $! to have any particular value unless you've got a specific error return indicating a system error). If used in a string context, it returns the corresponding system error string. We can assign a set errno value to $! if, for instance, we want it to return the string for that error number, or we want to set the exit value for the die() operator.
$^E
$EXTENDED_OS_ERROR
This returns an extended error message, with information specific to the current operating system. At the moment, this only differs from $! under VMS, OS/2, and Win32 (and for MacPerl). On all other platforms, $^E is always the same as $!.
System Variables Variable
English Name
Description
$$
$PROCESS_ID
The process ID (pid) of the Perl process running the current script.
$<
$REAL_USER_ID
The real user ID (uid) of the current process.
$>
$EFFECTIVE_USER_ID
The effective uid of the current process. NB: $< and $> can only be swapped on machines supporting setreuid().
$(
$REAL_GROUP_ID
The real group ID (gid) of the current process.
$)
$EFFECTIVE_GROUP_ID
The effective group ID (gid) of the current process.
$0
$PROGRAM_NAME
The name of the file containing the Perl script being executed.
$^X
$EXECUTABLE_NAME
The name that the Perl binary was executed as.
1108
Special Variables
Variable
English Name
Description
$]
n/a
The version number of the Perl interpreter, including patchlevel / 1000, can be used to determine whether the interpreter executing a script is within the right range of versions. See also use VERSION and require VERSION for a way to fail if the interpreter is too old.
$[
n/a
The index of the first element in an array, and of the first character in a substring. Default is 0. As of release 5 of Perl, assignment to $[ is treated as a compiler directive, and cannot influence the behavior of any other file. Its use is highly discouraged.
$^O
$OSNAME
The name of the operating system under which this copy of Perl was built, as determined during the configuration process, identical to $Config{'osname'}.
$^T
$BASETIME
The time at which the current script began running, in seconds since the beginning of 1970. Values returned by -M, -A, and -C filetests are based on this value.
$^W
$WARNING
The current value of the warning switch, either TRUE or FALSE
%ENV
n/a
Your current environment, altering its value changes the environment for child processes.
%SIG
n/a
Used to set handlers for various signals.
$^C
$COMPILING
The current value of the flag associated with the -c switch. Mainly of use with -MO=... to allow code to alter its behavior when being compiled, such as to AUTOLOAD at compile time rather than normal, deferred loading. See perlcc. Setting $^C = 1 is similar to calling B::minus_c.
$^D
$DEBUGGING
The current value of the debugging flags.
$^F
$SYSTEM_FD_MAX
The maximum system file descriptor, ordinarily 2.
$^I
$INPLACE_EDIT
The current value of the inplace-edit extension. Use undef to disable inplace editing.
$^M
n/a
Perl can use the contents of $^M as an emergency memory pool after die()ing. Suppose that your Perl was compiled with DPERL_EMERGENCY_SBRK and used Perl's malloc. Then $^M = 'a' x (1 << 16);
would allocate a 64K buffer for use when in emergency. $^P
$PERLDB
The internal variable for debugging support.
$^R
$LAST_REGEXP_ CODE_RESULT
The result of evaluation of the last successful regular expression assertion. Table continued on following page
1109
Appendix B
Variable
English Name
Description
$^S
$EXCEPTIONS_ BEING_CAUGHT
Current state of the interpreter. Undefined if parsing of the current module/eval is not finished (may happen in $SIG{__DIE__} and $SIG{__WARN__} handlers). True if inside an eval(), otherwise false.
$^V
$PERL_VERSION
The revision, version, and subversion of the Perl interpreter, represented as a string composed of characters with those ordinals. This can be used to determine whether the Perl interpreter executing a script is in the right range of versions.
Others Variable
English Name
Description
@INC
n/a
A list of places to look for Perl scripts for evaluation by the do EXPR, require, or use constructs.
%INC
n/a
Contains entries for each filename that has been included via do or require. The key is the specified filename, and the value the location of the file actually found. The require command uses this array to determine whether a given file has already been included.
1110
Special Variables
1111
Appendix B
1112
Function Reference
-X filehandle -X expression -X
AM FL Y
The first table in this appendix lists the file test operators. The syntax of these tests can take one of the following forms:
TE
X here is one of the following letters: ABCMORSTWX bcdefgkloprstuwxz. If the filehandle or expression argument is omitted, the file test is performed against $_, with the exception of -t, which tests STDIN. Here is a complete rundown of what each file test checks for: Test
Meaning
-A
How long in days between the last access to the file and latest startup.
-B
True if the file is a binary file (Compare with -T).
-C
How long in days between the last inode change and latest startup.
-M
How long in days between the last modification to the file and latest startup.
-O
True if the file is owned by a real uid/gid.
-R
True if the file is readable by a real uid/gid. Table continued on following page
Team-Fly®
Appendix C
Test
Meaning
-S
True if the file is a socket.
-T
True if the file is a text file (compare with -B).
-W
True if the file is writable by a real uid/gid.
-X
True if the file is executable by a real uid/gid.
-b
True if the file is a block special file.
-c
True if the file is a character special file.
-d
True if the file is a directory.
-e
True if the file exists.
-f
True if the file is a plain file, not a directory.
-g
True if the file has the setgid bit set.
-k
True if the file has the sticky bit set.
-l
True if the file is a symbolic link.
-o
True if the file is owned by an effective uid/gid.
-p
True if the file is a named pipe or if the filehandle is a named pipe.
-r
True if the file is readable by an effective uid/gid.
-s
True if the file has non-zero size, returns size of file in bytes.
-t
True if the filehandle is opened to a tty.
-u
True if the file has the setuid bit set.
-w
True if the file is writable by an effective uid/gid.
-x
True if the file is executable by an effective uid/gid.
-z
True if the file has zero size.
The following table includes an alphabetical list of every function in Perl 5.6. Marked against its name will be the syntax for the function, a brief description of what it does and any directly related functions: Function
Syntax
Description
abs
abs value
Returns the absolute (non-negative) value of an integer. For example abs(-1) and abs(1) both return 1 as a result.
abs
If no value argument is given, abs returns the absolute value of $_. accept
1114
accept newsocket, genericsocket
Accepts an incoming socket connect with sessions enabled if applicable.
Function Reference
Function
Syntax
Description
alarm
alarm num_seconds
Starts a timer with num_seconds seconds on the clock before it trips a SIGALRM signal. Before the timer runs out, another call to alarm cancels it and starts a new one with num_seconds on the clock. If num_seconds equals zero, the previous timer is cancelled without starting a new one.
alarm
atan2
atan2 x, y
Returns the arctangent of x/y within the range -5 to 5.
bind
bind socket, name
Binds a network address (TCP/IP, UDP, etc) to a socket, where name should be the packed address for the socket.
binmode
Binmode filehandle
Sets the specified filehandle to be read in binary mode explicitly for those systems, that cannot do this automatically. UNIX and MacOS can and thus binmode has no effect under these OS's.
bless
bless ref, classname
Takes the variable referenced by ref and makes it an object of class classname.
bless ref caller
caller expression caller
chdir
chdir new_directory chdir
Called within a subroutine, caller returns a list of information outlining what called it, that is the sub's context. This actually returns the caller's package name, its filename, and the line number of the call. Returns the undefined value if not in a subroutine. If expression is used, also returns some extra debugging information to make a stack trace. Changes your current working directory to new_directory. If new_directory is omitted, the working directory is changed to that one specified in $ENV(HOME).
chmod
chmod list
Changes the permissions on a list of files. The first element of list must be the octal representation of the permissions to be given to it.
chomp
chomp variable
Usually removes \n from a string. Actually removes the trailing record separator as set in $/ from a string or from each string in a list and then returns the number of characters deleted. If no argument is given, chomp acts on $_.
chomp list chomp chop
chop variable chop list
Remove the last character from a string, or from each string in a list, returning the (last) character chopped. If no argument is given, chop acts on $_.
chop Table continued on following page
1115
Appendix C
Function
Syntax
Description
chown
chown list
Changes the ownership on a list of files. Within list, the first two elements must be the user ID and group ID of that user and group to get ownership, followed by any number of filenames. Setting -1 for either ID means, 'Leave this value unchanged.'
chr
chr number
Returns ASCII character number number. If number is omitted, $_ is used.
chr chroot
chroot directory chroot
close
close filehandle close
Changes the root directory for all further path lookups to directory. If directory is not given, $_ is used as the new root directory. Closes the file, pipe or socket associated with the nominated filehandle, resetting the line counter $. as well. If filehandle is not given, closes the currently selected filehandle. Returns true on success.
closedir
closedir dirhandle
Closes the directory opened by opendir() given by dirhandle.
connect
connect socket, address
Tries to connect to a socket at the given address.
continue
continue block
A flow control statement rather than a function. If there is a continue attached to a block (typically in a while or foreach), it is always executed just before the conditional is about to be evaluated again.
cos
cos num_in_radians
Calculates and returns the cosine of a number given in radians. If num_in_radians is not given, calculates the cosine of $_.
crypt
crypt plaintext, key
A one-way encryption function (there is no decrypt function) that takes some plaintext (a password usually) and encrypts it with a two character key.
dbmclose
dbmclose hash
Deprecated in favor of untie(). Breaks the binding between a dbm file and the given hash.
dbmopen
dbmopen hash, dbname,
Deprecated in favor of tie(). Binds the specified hash to the database dbname. If the database does not exist, it is created with the specified read\write mode, given as an octal number.
mode
defined
defined expression defined
1116
Checks whether the value, variable, or function in expression is defined. If expression is omitted, $_ is checked.
Function Reference
Function
Syntax
Description
delete
delete $hash{key}
Deletes one or more specified keys and their corresponding values from the hash. Returns the associated values.
delete @hash{keys %hash} die
die message
Writes message to the standard error output and then exits the currently running program with $! as its return value.
do
do filename
Executes the contents of filename as a Perl script. Returns undef if it cannot read the file.
dump
dump label
Initiates a core dump to be undumped into a new binary executable file, which when run will start at label. If label is left out, the executable will start from the top of the file.
dump
each
each hash
Returns the next key/value pair from a hash as a two-element list. When hash is fully read, returns null.
endgrent
engrent
Frees the resources used to scan the /etc/group file or system equivalent.
endhostent
endhostent
Frees the resources used to scan the /etc/hosts file or system equivalent.
endnetent
endnetent
Frees the resources used to scan the /etc/networks file or system equivalent.
endprotoent
endprotoent
Frees the resources used to scan the /etc/protocols file or system equivalent.
endpwent
endpwent
Frees the resources used to scan the /etc/passwd file or system equivalent.
endservent
endservent
Frees the resources used to scan the /etc/services file or system equivalent.
eof
eof filehandle
Returns 1 if filehandle is either not open or will return the end of file on the next read. eof() checks for the end of the pseudo file containing the files listed on the command line as the program was run. If eof does not have an argument, it will check the last file to be read.
eof() eof
eval
eval string eval block eval
Parses and executes string as if it were a miniprogram and returns its result. If no argument is given, it evaluates $_. If an error occurs or die() is called eval returns undef. Works similarly with block except eval block is parsed only once. eval string is re-parsed each time eval executes. Table continued on following page
1117
Appendix C
Function
Syntax
Description
exec
exec command
Abandons the current program to run the specified system command.
exists
exists $hash{$key}
Returns true if the specified key exists within the specified hash.
exit
exit status
Terminates current program immediately with return value status. N.B. The only universally recognized return values are 1 for failure and 0 for success.
exp
exp number
Returns the value of e to the power of number (or $_ if number is omitted).
fcntl
fcntl filehandle,
Calls the fcntl function, to use on the file or device opened with filehandle.
function, args fileno
fileno filehandle
Returns the file descriptor for filehandle.
flock
flock filehandle, locktype
Tries to lock or unlock a write-enabled file for use by the program. Note that this lock is only advisory and that other systems not supporting flock will be able to write to the file. locktype can take one of four values; LOCK_SH (new shared lock), LOCK_EX (new exclusive lock), LOCK_UN (unlock file) and LOCK_NB (do not block access to the file for a new lock if file not instantly available). Returns true for success, false for failure.
for
for loop iterator
Perl's C-style for loop works exactly like the corresponding while loop.
block
There is one minor difference: The first form implies a lexical scope for variables declared with my in the initialization expression. foreach
foreach loop iterator statement
The foreach loop iterates over a normal list value and sets the variable VAR to be each element of the list in turn. The foreach keyword is actually a synonym for the for keyword, so you can use foreach for readability, or for for brevity.
fork
fork
System call that creates a new system process also running this program from the same point the fork was called. Returns the new process's ID to the original program, 0 to the new process, or undef if the fork did not succeed.
format
format
Declares an output template for use with write().
1118
Function Reference
Function
Syntax
Description
formline
formline template, list
An internal function used for formats. Applies template to the list of values and stores the result in $^A. Always returns true.
getc
getc filehandle
Waits for the user to press Return and then retrieves the next character from filehandle's file. Returns undef if at the end of a file. If filehandle is omitted, uses STDIN instead.
getc
getgrent
getgrent
Gets the next group record from /etc/group or the system equivalent, returning an empty record when the end of the file is reached.
getgrgid
getgrgid gid
Gets the group record from /etc/group or the system equivalent whose ID field matches the given group number gid. Returns an empty record if no match occurs.
getgrnam
getgrnam name
Gets the group record from /etc/group or the system equivalent whose name field matches the given group name. Returns an empty record if no match occurs.
gethostbyaddr
gethostbyaddr address,
Returns the hostname for a packed binary network address of a certain address type. By default, addrtype is assumed to be IP.
addrtype
gethostbyaddr address gethostbyname
gethostbyname hostname
Returns the network address given its corresponding hostname.
gethostent
gethostent
Gets the next network host record from /etc/hosts or the system equivalent, returning an empty record when the end of the file is reached.
getlogin
getlogin
Returns the user ID for the currently logged in user.
getnetbyaddr
getnetbyaddr address,
Returns the net name for a given network address of a certain address type. By default, addrtype is assumed to be IP.
addrtype
getnetbyaddr address Table continued on following page
1119
Appendix C
Function
Syntax
Description
getnetbyname
getnetbyname name
Returns the net address given its corresponding net name.
getnetent
getnetent
Gets the next entry from /etc/networks or the system equivalent, returning an empty record when the end of the file is reached.
getpeername
getpeername socket
Returns the address for the other end of the connection to this socket.
getpgrp
getpgrp process_id
Returns the process group in which the specified process is running. Assumes current process if process_id is not given.
getpgrp
getppid
getppid
Returns the process ID of the current process's parent process.
getpriority
getpriority type, id
Returns current priority for a process, process group, or user as determined by type.
getprotobyname
getprotobyname name
Returns the number for the protocol given as name.
getprotobynumber
getprotobynumber number
Returns the name of the protocol given its number.
getprotoent
getprotoent
Gets the next entry from /etc/protocols or the system equivalent, returning an empty record when the end of the file is reached.
getpwent
getpwent
Gets the next entry from /etc/passwd or the system equivalent, returning an empty record when the end of the file is reached.
getpwnam
getpwnam name
Gets the password record whose login name field matches the given name. Returns an empty record if no match occurs.
getpwuid
getpwuid uid
Gets the password record whose user ID field matches the given uid. Returns an empty record if no match occurs.
getservbyname
getservbyname name,
Returns the port number for the named service on the given protocol.
protocol getservbyport
getservbyport port, protocol
1120
Returns the port name for the service port on the given protocol.
Function Reference
Function
Syntax
Description
getservent
getservent
Gets the next entry from /etc/services or the system equivalent, returning an empty record when the end of the file is reached.
getsockname
getsockname socket
Returns the address for this end of the connection to this socket.
getsockopt
getsockopt socket, level, optname
Returns the specified socket option or undef if an error occurs.
glob
glob expression
Returns a list of filenames matching the regular expression in the current directory. If expression is omitted, the comparison is made with $_.
glob
gmtime
gmtime time
Returns a nine-element integer array representing the given time (or time() if not given) converted to GMT. By index order, the nine elements (all zero-based) represent:
0 Number of seconds in the current minute 1 Number of minutes in the current hour 2 Current hour 3 Current day of month 4 Current month 5 Number of years since 1900 6 Weekday (Sunday = 0) 7 Number of days since January 1 8 Whether daylight savings time is in effect goto
goto tag goto expression goto &subroutine
Looks for tag either given literally of dynamically derived by resolving expression and resumes execution of the program there on the provision that it is not inside a construct which requires initializing. For example, a for loop. Alternatively, goto &subroutine switches a call to subroutine for the currently running subroutine. Table continued on following page
1121
Appendix C
Function
Syntax
Description
grep
grep expression, list
Evaluates a given expression or block of code against each element in list and returns a list of those elements for which the evaluation returned true.
grep {block} list
hex
if if..else
hex string hex
Reads in string as a hexadecimal number and returns the corresponding decimal equivalent. Uses $_ if string is omitted.
if expression block
Executes block if expression is true.
if expression block1
Executes block1 if expression is true, otherwise executes block2.
else block2 if..elsif
if expression1 block1 elsif expression2 block2 else block3
import
import module list import module
index
index string, substring, position
If expression1 is true, block1, otherwise, if expression2 is true then block2 is executed. If both expression1 and expression2 are false then block3 is executed. Patches a module's namespace into your own, incorporating the listed subroutines and variables into your own package (or all of them if list is not given). Returns the zero-based position of substring in string first occurring after character number position. Assumes position equals zero if not given. Returns -1 if match not found.
index string, substring int
ioctl
int number int
Returns the integer section of number, or $_ if number is omitted.
ioctl filehandle, function,
Calls the ioctl function, to use on the file or device opened with filehandle.
argument join
join character, list
Returns a single string comprising the elements of list, separated from each other by character.
keys
keys hash
Returns a non-ordered list of the keys contained in hash.
kill
kill signal, process_list
Sends a signal to the processes and/or process groups in process_list. Returns number of signals successfully sent.
1122
Function Reference
Function
Syntax
Description
last
last label
Causes the program to break out of the labeled loop (or the innermost loop, if label is not given) surrounding the command and to continue with the statement immediately following the loop.
last
lc
lc string
Returns string in lower case, or $_ in lower case if string is omitted.
lcfirst
lcfirst string
Returns string with the first character in lower case. Works on $_ if string is omitted.
length
Length expression
Evaluates expression and returns the number of characters in that value. Returns length $_ if expression is omitted.
link
link thisfile, thatfile
Creates a hard link in the filesystem, from thatfile to thisfile. Returns true on success, false on failure.
listen
listen socket,
Listens for connections to a particular socket on a server and reports when the number of connections exceeds max_connections.
local
AM FL Y
max_connectons
local var
Declares a 'private' variable, which is available to the subroutine in which it is declared and any other subroutines, which may be called by this subroutine.
localtime
TE
Actually creates a temporary value for a global variable for the duration of the subroutine's execution.
localtime time
Returns a nine-element array representing the given time (or time() if not given) converted to system local time. See gmtime() for description of elements.
log
log number
Returns the natural logarithm for a number. That is, returns x where x e =number.
lstat
lstat filehandle
Returns a thirteen-element status array for the symbolic link to a file and not the file itself. See stat() for further details.
lstat expression lstat
Table continued on following page
1123
Team-Fly®
Appendix C
Function
Syntax
Description
m//
m//
Tries to match a regular expression pattern against a string.
map expression, list
Evaluates a given expression or block of code against each element in list and returns a list of the results of each evaluation.
map
map {block} list
mkdir
mkdir dirname, mode
Creates a directory called dirname and gives it the read\write permissions as specified in mode (an octal number).
msgctl
msgctl id, cmd, arg
Calls the System V IPC msgctl function.
msgget
msgget key, flags
Calls the System V IPC msgget function.
msgrcv
msgrcv id, var, size, type,
Calls the System V IPC msgrcv function.
flags msgsnd
msgsnd id, msg, flags
Calls the System V IPC msgsnd function.
my
my variable_list
Declares the variables in variable_list to be lexically local to the block or file it has been declared in.
next
next label
Causes the program to start the next iteration of the labeled loop (or the innermost loop, if label is not given) surrounding the command.
next
no
no module_name
Removes the functionality and semantics of the named module from the current package. Compare with use() which does the opposite.
oct
oct string
Reads in string as an octal number and returns the corresponding decimal equivalent. Uses $_ if string is omitted.
oct open
open filehandle, filename open filehandle
Opens the file called filename and associates it with filehandle. If filename is omitted, open assumes that the file has the same name as filehandle.
opendir
opendir dirhandle, dirname
Opens the directory called dirname and associates it with dirhandle.
ord
ord expression
Returns the numerical ASCII value of the first character in expression.
1124
Function Reference
Function
Syntax
Description
pack
pack template, list
Takes a list of values and puts them into a binary structure using template (a sequence of characters as shown below) to give the structure an ordered composition. The possible characters for template are: a Null-padded ASCII string A Space-padded ASCII string b A bit string (low-to-high) B A bit string (high-to-low) c A signed char value C An unsigned char value d A double-precision float in the native format f A single-precision float in the native format h A hexadecimal string, low to high H A hexadecimal string, high to low
Table continued on following page
1125
Appendix C
Function
Syntax
Description
pack (continued)
pack template, list
i A signed integer I An unsigned integer l A signed long value L An unsigned long value n A big-endian short (16-bit) value N A big-endian long (32-bit) value p A pointer to a null-terminated string P A pointer to a fixed-length string q A signed quad (64-bit) value Q An unsigned quad (64-bit) value s A signed short (16-bit) value S An unsigned short (16-bit) value
1126
Function Reference
Function
Syntax
Description
pack (continued)
pack template, list
v A little-endian short (16-bit) value V A big-endian long (32-bit) value u An uuencoded string w A BER compressed integer - an unsigned integer in base 128, high-bit first. x A null byte X Back up a byte Z A null-padded, Null-terminated string @ Null-fill to absolute position
package
package namespace
Declares that the following block of code is to be defined within the specified namespace.
pipe
pipe readhandle,
Opens and connects two filehandles, such that the pipe reads content from readhandle and passes it to writehandle.
writehandle
pop
pop array pop
pos
pos scalar
Removes and returns the last element (at largest index position) from array. Pops @ARGV if array is not specified. Returns the position in scalar of the character following the last m//g match. Uses $_ for scalar if omitted. Table continued on following page
1127
Appendix C
Function print
Syntax
Description
print filehandle list
Prints a list of comma-separated strings to the file associated with filehandle or STDOUT if not specified. If both arguments are omitted, prints $_ to the currently selected output channel.
print list print
printf
printf filehandle format, list
As print() but prints to the output channel using a specified format.
printf format, list prototype
prototype function
Returns the prototype of a function as a string or undef if the prototype does not exist.
push
push array, list
Adds the elements of list to the array at position max_index.
q/ /
q/string/
Alternative method of putting single quotes around a string.
qq/ /
qq/string/
Alternative method of putting double quotes around a string.
quotemeta
quotemeta expression
Scans through expression and returns it having prefixed all the metacharacters with a backslash.
qr/ /
qr/strings/
Method used for quoting and compiling its string element as a regular expression.
qw/ /
qw/strings/
Returns a list of strings, the elements of which are created by splitting a string by whitespace, or the strings sent to qw//.
qx/ /
qx/string/
Alternative method of backtick-quoting a string (which now acts as a commandline command).
rand
rand expression
Evaluates expression and then returns a random value x where 0 <= x < the value of expression.
read
read filehandle, scalar,
Reads length number of bytes in from filehandle, placing them in scalar. Starts by default from the start of the file, but you can specify offset, the position in the file you wish to start reading from. Returns the number of bytes read, zero if at the end of the file or undef if file does not exist.
length, offset read filehandle, scalar, length
1128
Function Reference
Function
Syntax
Description
readdir
readdir dirhandle
Returns the next entry in the directory specified by dirhandle, or if being used in list context, the entire contents of the directory.
readline
readline filehandle
Returns a line from filehandle's file if in scalar context, else returns a list containing all the lines of the file as its elements.
readlink
readlink linkname
Returns the name of the file at the end of symbolic link linkname.
readpipe
readpipe command
Executes command on the command line and then returns the standard output generated by it as a string. Returns a list of lines from the standard output if in list context.
recv
recv socket, scalar, length, flags
Receives a length byte message over the named socket and reads it into a scalar string.
redo
redo label redo
Causes the program to restart the current iteration of the labeled loop (or the innermost loop, if label is not given) surrounding the command without checking the while condition.
ref
ref reference ref
Returns the type of object being referenced by reference, or a package name if the object has been blessed into a package.
rename
rename oldname, newname
Renames file oldname as newname. Returns 1 for success, 0 otherwise.
require
require file require package require num require
Ensures that the named package or file is included at runtime. If num is argument, ensures that version of Perl currently running is greater than or equal to num (or $_ if omitted).
reset
reset expression
Resets all variables in current package beginning with one of the characters in expression and all ?? searches to their original state.
return
return expression
Returns the value of expression from a subroutine or eval().
reverse
reverse list
Returns either list with its elements in reverse order if in list context or a string consisting of the elements of list concatenated together and then written backwards. Table continued on following page
1129
Appendix C
Function
Syntax
Description
rewinddir
rewinddir dirhandle
Resets the point of access for readdir()to the top of directory dirhandle.
rindex
rindex string, substring, position rindex string, substring
Returns the zero-based position of the last occurrence of substring in string at or before character number position. Returns -1 if match not found.
rmdir
rmdir dirname
If directory dirname (or that given in $_ if omitted) is empty, it is removed. Returns true on success, false otherwise.
s///
s/matchstring/replacestrin g/
Searches for matchstring in $_ and replaces it with replacestring if found.
scalar
scalar expression
Evaluates expression in scalar context and returns the resultant value.
seek
seek filehandle, position, flag
Sets the position (character number) in a file denoted by filehandle from which the file will be read/written. flag tells seek whether to goto character number position (flag = 0), number current position + position (flag = 1), or number EOF + position (flag=2). Returns 1 on success, 0 otherwise.
seekdir
seekdir dirhandle, position
Sets the position (entry number) in a directory denoted by dirhandle from which directory entries will be read.
select
select filehandle select
Changes the current default filehandle (starts as STDOUT) to filehandle. Returns the current default filehandle if filehandle is omitted.
select
select rbits, wbits, ebits, timeout
Calls the system select command to wait for timeout seconds until one (if any) of your filehandles become available for reading or writing and returns either success or failure.
semctl
semctl id, sem_num, command, argument semget id, semnum, command, argument semop key, opstring
Calls the System V IPC semctl function.
Calls the System V IPC semop function.
send socket, message, flags, destination send socket, message, flags
Sends a message from a socket to the connected socket, or, if socket is disconnected, to destination. Takes account of any system flags given to it.
semget semop send
1130
Calls the System V IPC semget function.
Function Reference
Function
Syntax
Description
setgrent
setgrent (void)
The setgrent function rewinds the file pointer to the beginning of the /etc/group file.
sethostent
sethostent stayopen
The sethostent function specifies, if stayopen is true (1), that a connected TCP socket should be used for the name server queries and that the connection should remain open during successive queries. Otherwise, name server queries will use UDP datagrams.
setnetent
setnetent stayopen
The setnetent function opens and rewinds the database. If the stayopen argument is non-zero, the connection to the net database will not be closed after each call to getnetent (either directly, or indirectly through one of the other getnet* functions).
setpgrp
setpgrp process_id, process_group
Sets the process_group in which the process with the given process_id should run. The arguments default to 0 if not given.
setpriority
setpriority which, id, priority
Adds to or diminishes the priority of either a process, process group, or user, as determined by which and specifically identified by its id.
setprotoent
setprotoent stayopen
The setprotoent function opens and rewinds the /etc/protocols file. If stayopen is true (1), then the file will not be closed between calls to getprotobyname or getprotobynumber.
setpwent
setpwent (void)
The setpwent function rewinds the file pointer to the beginning of the /etc/passwd file.
setservent
setservent stayopen
The setservent function opens and rewinds the /etc/services file. If stayopen is true (1), then the file will not be closed between calls to getservbyname and getservbyport.
setsockopt
setsockopt socket, level, option, optional_value
Sets the option for the given socket. Returns undef if an error occurs. Table continued on following page
1131
Appendix C
Function shift
Syntax
Description
shift array
Returns the element at position 0 in array and then removes it from array. Returns undef if there are no elements in the array. Shifts @_ within subroutines and formats and @ARGV otherwise if array is omitted.
shift
shmctl
shmctl id, command, argument
Calls the System V IPC shmctl function.
shmget
shmget key, size, flags
Calls the System V IPC shmget function.
shmread
shmread id, variable, position, size
Calls the System V IPC shmread function.
shmwrite
shmwrite id, string, position, size
Calls the System V IPC shmwrite function.
shutdown
shutdown socket, manner
Shuts down the socket specified in the following manner. 0 Stop reading data 1 Stop writing data 2 Stop using this socket altogether
sin
sin expression sin
sleep
sleep n sleep
Evaluates expression as a value in radians and then returns the sine of that value. Returns the sine of $_ if expression is omitted. Causes the running script to 'sleep' for n seconds or forever if n is not given.
socket
socket filehandle, domain, type, protocol
Opens a socket and associates it to the given filehandle. This socket exists within the given domain of communication, is of the given type and uses the given protocol to communicate.
socketpair
socketpair sock1, sock2, domain, type, protocol
Creates a pair of sockets named sock1 and sock2. These sockets exist within the given domain of communication, are of the given type and use the given protocol to communicate.
1132
Function Reference
Syntax
Description
sort
sort subroutine list sort block list sort list
splice
splice array, offset, length, list splice array, offset, length splice array, offset
split
split /pattern/, string, limit split /pattern/, string split /pattern/ split
sprintf
sprintf format, list
sqrt
sqrt expression sqrt
srand
srand expression srand stat filehandle stat expression stat
Takes a list of values and returns it with the elements having been sorted into an order. If subroutine is given, uses that to sort list. If block is given, this is used as an anonymous subroutine to sort list. If neither is given, list is sorted by simple string comparisons. Takes array elements from index offset to (offset+length) and replaces them with the elements of list, if any. If length is removed, removes all the elements of array from index offset onwards. If negative, leaves that many elements at the end of the array. If offset is negative, splice starts from index number (maxindex-offset). Returns the last element removed if in scalar context or undef if nothing was removed. Takes the given string, and returns it as an array of smaller strings where any instances in the string matching pattern have been used as the delimiter for the array elements. If given, limit denotes the number of times the pattern will be searched for in the string. If string is omitted, $_ is split. If pattern is omitted, $_ is split by whitespace. As printf() but prints list to a string using a specified format. Evaluates expression and then returns the square root of either it or $_ if it was left out of the call. Seeds the random number generator.
TE
stat
AM FL Y
Function
Returns a thirteen element array comprising the following information about a file named by expression, represented by filehandle or contained in $_. (by index number) 0 $dev Device number of filesystem 1
$ino
Inode number Table continued on following page
1133
Team-Fly®
Appendix C
Function
Syntax
Description
stat (continued)
stat filehandle stat expression stat
2 $mode File mode 3 $nlink Number of links to the file 4 $uid User ID of file's owner 5 $gid Group ID of file's owner 6 $rdev Device identifier 7 $size Total size of file 8 $atime Last time file was accessed 9 $mtime Last time file was modified 10 $ctime Last time inode was changed 11 $blksize Preferred block size for file I/O 12 $blocks Number of blocks allocated to file
1134
Function Reference
Function study
Syntax
Description
study string
Tells Perl to optimize itself for repeated searches on string or on $_ if string is omitted.
study sub
sub subname block sub subname sub block
substr
substr string, position, length, replacement substr string, position, length substr string, position
Declares a block of code to be a subroutine with the name subname. If block is omitted, this is just a forward reference to a later declaration. If subname is omitted, this is an anonymous function declaration. Returns a substring of string that is length characters long starting with the character at index number position. If given, that substring is then silently replaced with replacement. If length is not given, substr assumes the entire string from position onwards.
symlink
symlink oldfile, newfile
Creates newfile as a symbolic link to oldfile. Returns 1 on success, 0 on failure.
syscall
syscall list
Assumes the first element in the list is the name of a system call and calls it, taking the other elements of the list to be arguments to that call.
sysopen
sysopen filehandle,
Opens file filename under the specified mode and associates it with the given filehandle. permissions is the octal value representing the permissions that you want to assign to the file. If not given, the default is 0666.
filename, mode, permissions sysopen filehandle, filename, mode sysread
sysread filehandle, scalar, length, offset sysread filehandle, scalar, length
sysseek
sysseek filehandle, pos, flag
Reads length number of bytes in from filehandle, placing them in scalar using the system call read. Starts by default from the start of the file, but you can specify offset, the position in the file you wish to start reading from. Returns the number of bytes read, zero if at the end of the file or undef if file does not exist. Sets the system position for the file denoted by filehandle. Flag tells sysseek whether to goto position number pos (flag = 0), number current position + pos (flag = 1), or number EOF + pios (flag = 2). Returns 1 on success, 0 otherwise.
1135
Appendix C
Function
Syntax
Description
system
system list
Forks the process that the current program is running on, lets it complete and then abandons the current program to run the specified system command in list. This will be the first element of list and any arguments to it are stored in subsequent list elements.
syswrite
syswrite filehandle,
Writes length number of bytes from the scalar to the file denoted by filehandle, starting at character number offset if specified. If length is not given, writes the entire scalar to the file.
scalar, length, offset syswrite filehandle, scalar, length syswrite filehandle, scalar tell
tell filehandle tell
Returns the current read\write position for the file marked by filehandle. If filehandle is not given, the info is given for the last accessed file.
telldir
telldir dirhandle
Returns the current readdir position for the directory marked by dirhandle.
tie
tie variable, classname,
Binds the named variable to package class classname, which works on a variable of that type. Passes any arguments (in list) to the new function of the class (TIESCALAR, TIEHASH, or TIEARRAY).
list
tied
tied variable
Returns a reference to the object tied to the given variable.
time
time
Returns the number of non-leap seconds elapsed since Jan 1, 1970. Can be translated into recognizable time values with gmtime() and localtime().
times
times
Returns a four-element list holding the user and system CPU times (in seconds) for both the current process and its child processes. The list is comprised as follows: $user Current process user time $system
1136
Function Reference
Function
Syntax
Description
times (continued)
times
Current process system CPU time $cuser Child process user time $csystem Child process system time
tr///
tr/string1/string2/
Transliterates a string (also known as y///).
truncate
truncate filehandle, length
Truncates the file given by filehandle or named literally by expression to length characters. Returns true on success, false otherwise.
truncate expression, length uc
uc string uc
ucfirst
ucfirst string ucfirst
umask
umask expression umask
undef
undef expression undef
Returns string in upper case, or $_ in upper case if string is omitted. Returns string with the first character in upper case. Works on $_ if string is omitted. Returns the current umask and then sets it to expression if this is given. The umask is a group of three octal numbers representing the access permissions for a file or directory of its owner, a group and other users, where execute = 1, write = 2, and read = 4. So a umask of 0777 would give all permissions to all three levels of user. 0744 would restrict all except the owner to read access only. Removes the value of expression, leaving it undefined, else just returns the undefined value.
unless
unless condition block
Similar to the simple if statement, the statement runs along the lines of 'unless this exists, then {…}'.
unlink
unlink list
Deletes the files specified in list (or $_ if not given), returning the number of files deleted. NB: For UNIX users, unlink() removes a link to each file but not the fields themselves if other links to them still exist. Table continued on following page
1137
Appendix C
Function
Syntax
Description
unpack
unpack template, string
The reverse of pack(). Takes a packed string and then uses template to read through it and return an array of the values stored within it. See pack() for how template is constructed.
unshift
unshift array, list
Adds the elements of list in the same order to the front (index 0) of array, returning the number of elements now in array.
untie
untie variable
Unbinds the variable from the package class it had previously been tied to.
until
until expression block
Executes block until expression becomes true.
use
use module version list
Requires and imports the (listed elements of the) named module at compile time. Checks module being used is the specified version if combined with module and list. use version meanwhile makes sure that the Perl interpreter being used is no older than the stated version.
use module list use module use version
utime
utime atime, mtime, filelist
Sets the access (atime) and modification (mtime) times on files listed in filelist, returning the number of successful changes that were made.
values
values hash
Takes the named hash and returns a list containing copies of each of the values in it.
vec
vec string, offset, bits
Takes string and regards it as a vector of unsigned integers. Then returns the value of the element at position offset given that each element has 2 to the power of bits in it.
wait
wait
Waits for a child process to die and then returns the process ID of the child process that did or -1 if there are no child processes.
waitpid
waitpid pid, flags
Waits for the child process with process ID pid to die.
wantarray
wantarray
Returns true if the subroutine currently running is running in list context. Returns false if not. Returns undef if the subroutine's return value is not going to be used.
1138
Function Reference
Function
Syntax
Description
warn
warn message
Prints message to STDERR, but does not throw an error or exception.
while
while expression block
Executes block as long as expression is true.
write
write filehandle
Writes a formatted record to filehandle, the file named after evaluating expression, or the current default output channel if neither are given.
write expression write y///
y/string1/string2/
Transliterates a string (also known as tr///).
1139
Appendix C
1140
Regular Expressions Pattern Matching Operators Match – 'm//' Syntax: m/pattern/ If a match is found for the pattern within a referenced string (default $_), the expression returns true. Note that if the delimiters // are used, the preceding m is not required.
Modifiers: g, i, m, o, s, x
Substitution – 's///' Syntax: s/pattern1/pattern2/ If a match is found for pattern1 within a referenced string (default $_), the relevant substring is replaced by the contents of pattern2, and the expression returns true.
Modifiers: e, g, i, m, o, s, x
Quoting - 'qr//' Syntax: qr/string/imosx This operator quotes and compiles its string as a regular expression. string is interpolated the same way as pattern in m/pattern/. It returns a Perl value which may be used instead of the corresponding /string/imosx expression.
Appendix D
Since Perl may compile the pattern at the moment of execution of qr() operator, using qr() may have speed advantages in some situations, notably if the result of qr() is used standalone.
Modifiers: i, m, o, s, x
Transliteration –' tr///' or 'y///' Syntax: tr/pattern1/pattern2/ y/pattern1/pattern2/ If any characters in pattern1 match those within a referenced string (default $_), instances of each are replaced by the corresponding character in pattern2, and the expression returns the number of characters replaced. Note that if one character occurs several times within pattern1, only the first will be used. For example, tr/abbc/xyz/ is equivalent to tr/abc/xyz/.
Modifiers: c, d, s
Delimiters Patterns may be delimited by character pairs <>, (), [], {} or any other non-word character. For example, s<pattern1><pattern2> and s#pattern1#pattern2# are both equivalent to s/pattern1/pattern2/.
Binding Operators Binding Operator - '=~' Syntax: $refstring =~ m/pattern/ Binds a match operator to a variable other than $_. Returns true if a match is found.
Negation Operator - '!~' Syntax: $refstring !~ m/pattern/ Binds a match operator to a variable other than $_. Returns true if a match is not found.
Modifiers Match and Substitution The following can be used to modify the behavior of match and substitution operators:
Cancel Position Reset - '/c' Used only with global matches, that is as m//gc, to prevent the search cursor returning to the start of the string if a match cannot be found. Instead, it remains at the end of the last match found.
1142
Regular Expressions
Evaluate Replacement – '/e' Evaluate the second argument of the substitution operator as an expression.
Global Match – '/g' Finds all the instances in which the pattern matches the string rather than stopping at the first match. Multiple matches will be numbered in the operator's return value.
Case-insensitive – '/i' Matches pattern against string whilst ignoring the case of the characters in either pattern or string.
Multi-line Mode – '/m' The string to be matched against is to be regarded as a collection of separate lines, with the result that the metacharacters ^ and $ which would otherwise just match the beginning and end of the entire text now also match the beginning and end of each line.
One-time Pattern Compilation - '/o' If a pattern to match against a string contains variables, these are interpolated to form part of the pattern. Later these variables may change and the pattern will change with it when next matched against. By adding /o the pattern will be formed once and will not be recompiled even if the variables within have changed value.
Single-line Mode – '/s'
Free-form – '/x'
AM FL Y
The string to be matched against will be regarded as a single line of text, with the result that the metacharacter '.' will match against the newline character, which it would not do otherwise.
Allow the use of whitespace and comments inside a match to expand and explain the expression.
Transliteration Complement - '/c'
TE
The following can be used to modify the behavior of the transliteration operator:
Use complement of pattern1, substitutes all characters except those specified in pattern1.
delete - '/d'
Deletes all the characters found but not replaced.
squash - '/s' Multiple replaced characters squashed – only returned once to transliterated string.
Localized modifiers Syntax: /CaseSensitiveTxt((?i)CaseInsensitiveTxt)CaseSensitiveText/ /CaseInsensitiveTxt((?-i)CaseSensitiveTxt)CaseInsensitiveText/i
1143
Team-Fly®
Appendix D
The following inline modifiers can be placed within a regular expression to enforce or negate relevant matching behavior on limited portions of the expression: Modifier
Description
Inline enforce
Inline negate
/i
case insensitive
(?i)
(?-i)
/s
single-line mode
(?s)
(?-s)
/m
multi-line mode
(?m)
(?-m)
/x
free-form
(?x)
(?-x)
Metacharacters Metacharacter
Meaning
[abc]
Any one of a, b, or c.
[^abc]
Anything other than a, b, and c.
\d \D
A digit; a non-digit.
\w \W
A 'word' character; a non-'word' character.
\s \S
A whitespace character; a non-whitespace character.
\b
The boundary between a \w character and a \W character.
.
Any single character (apart from a new line).
(abc)
The phrase 'abc' as a group.
?
Preceding character or group may be present 0 or 1 times.
+
Preceding character or group is present 1 or more times.
*
Preceding character or group may be present 0 or more times.
{x,y}
Preceding character or group is present between x and y times.
{,y}
Preceding character or group is present at most y times.
{x,}
Preceding character or group is present at least x times.
{x}
Preceding character or group is present x times.
Non-greediness for Quantifiers Syntax: (pattern)+? (pattern)*? The metacharacters + and * are greedy by default, and will try to match as much as possible of the referenced string (while still achieving a full pattern match). This 'greedy' behavior can be turned off by placing a ? immediately after the respective metacharacter. A non-greedy match finds the minimum number of characters matching the pattern.
1144
Regular Expressions
Grouping and Alternation '|' For Alternation Syntax: pattern1|pattern2 By separating two patterns with | we can specify that a match on one or the other should be attempted.
'()' For Grouping and Backreferences ('capturing') Syntax: (pattern) This will group elements in pattern – if those elements are matched, a backreference is made to one of the numeric special variables ($1, $2, $3 etc.)
'(?:)' For Non-backreferenced Grouping ('clustering') Syntax: (?:pattern) This will group elements in pattern without making backreferences.
Lookahead/behind assertions '(?=)' for positive lookahead Syntax: pattern1(?=pattern2) This lets us look for a match on 'pattern1 followed by pattern2', without back-referencing pattern2.
'(?!)' for negative lookahead Syntax: pattern1(?!pattern2) This lets us look for a match on 'pattern1 not followed by pattern2', without back-referencing pattern2.
'(?<=)' for positive lookbehind Syntax: pattern1(?<=pattern2) This lets us look for a match on 'pattern1 preceded by pattern2', without back-referencing pattern2. This only works if pattern2 is of fixed width.
'(?
1145
Appendix D
Backreference Variables Variable
Description
\num (num = 1, 2, 3…)
Within a regular expression, \num returns the substring that was matched with the numth grouped pattern in that regexp.
$num (num = 1, 2, 3…)
Outside a regular expression, $num returns the substring that was matched with the numth grouped pattern in that regexp.
$+
This returns the substring matched with the last grouped pattern in a regexp.
$&
This returns the string that matched the whole regexp, this will include portions of the string that matched (?:) groups, which are otherwise not backreferenced.
$`
This returns everything preceding the matched string in $&.
$'
This returns everything following the matched string in $&.
Other '(?#)' for comments Syntax: (?#comment_text) This lets us place comments within the body of a regular expression, an alternative to the /x modifier.
1146
Regular Expressions
1147
Appendix D
1148
Standard Pragmatic Modules The following is a list of the pragmatic modules, which are installed with Perl 5.6. They have been ordered alphabetically. Note that these module names are case sensitive and are given as they should be written in a use statement. Using pragmatic modules affects the compilation of Perl programs. These modules are lexically scoped and so to use, or to uninclude, them with no like this: use attrs; use warnings; no integer; no diagnostics;
is effective only for the duration of the block in which the declaration was made. Furthermore, these declarations may be reversed within any inner blocks in the program.
Appendix E
Module
Function
attributes
Gets or sets the attribute values for a subroutine or variable.
attrs
Gets or sets the attribute values for a subroutine. Deprecated in Perl 5.6 in favor of attributes.
autouse
Moves the inclusion of modules into a program from compile time to runtime. Specifically, it postpones the module's loading until one of its functions is called.
base
Takes a list of modules, requires them and then pushes them onto @ISA. Essentially, it will establish an 'IS-A' relationship with these classes at compile time.
blib
Used on the command line as -Mblib switch to test your scripts against an uninstalled version of the package named after the switch.
caller
Causes program to inherit the pragmatic attributes of the program, which has called it.
charnames
Allows you to specify a long name for a given string literal escape.
constant
Allows you to define constants as a name=>value pair.
diagnostics
Returns verbose output when errors occur at runtime. This verbose output consists of the message that Perl would normally give plus any accompanying text that that error contained in the perldiag manpage.
fields
Takes a list of valid class fields for the package and enables the class fields at compile time.
filetest
Changes the operation of the -r, -w, -x, -R, -W, and -X file test operators.
integer
Changes the mathematical operators in a program to work with integers only and not floating point numbers.
less
Currently not implemented.
lib
Adds the listed directories to @INC.
locale
Enables\disables POSIX locales for built-in operations.
open
Sets default disciplines for input and output.
ops
Restricts potentially harmful operations occurring during compile time.
overload
Allows you to overload built-in Perl operations with your own subroutines.
re
Allows you to alter the way regular expressions behave.
sigtrap
Enables some simple signal handlers.
strict
Enforces the declaration of variables before their use.
subs
Allows you to predeclare subroutine names.
utf8
Enables\disables Unicode support. Note that at the time of writing, Unicode support in Perl was incomplete.
vars
Allows you to predeclare global variable names.
warnings
Switches on the extra syntactic error warning messages.
1150
Standard Pragmatic Modules
1151
Appendix E
1152
Standard Functional Modules
AM FL Y
The standard modules are those that are installed with our distribution of Perl. They are listed here alphabetically. Function
AnyDBM_File
Acts as a universal virtual base class for those wanting to access any of the five types of DBM file.
AutoLoader
Works with Autosplit to delay the loading of subroutines into the program until they are called. These subroutines are defined following the __END__ token in a package file.
AutoSplit B
TE
Module
Splits a program into files suitable for autoloading or selfloading. The Perl compiler module.
B::Asmdata
Contains autogenerated data about Perl ops used in the generation of bytecode.
B::Assembler
Assembles Perl bytecode for use elsewhere.
B::Bblock
Used by B::CC to walk through 'basic blocks' of code.
B::Bytecode
Compiler backend for generating Perl bytecode.
B::C
Compiler backend for generating C source code.
B::CC
Compiler backend for generating optimized C source code. Table continued on following page
Team-Fly®
Appendix F
Module
Function
B::Debug
Walks the Perl syntax tree, printing debug info about ops.
B::Deparse
Compiler backend for generating Perl source code from compiled Perl.
B::Disassembler
Disassembles Perl bytecode back to Perl source.
B::Graph
Compiler backend for generating graph-description documents that show the program's structure.
B::Lint
Module to catch dubious constructs.
B::Showlex
Shows the file-scope variables for a file or the lexical variables for a subroutine if one is specified.
B::Stackobj
Helper module for CC backend.
B::Stash
Shows what stashes are loaded.
B::Terse
Walks the Perl syntax tree, printing terse info about ops.
B::Xref
Compiler backend for generating cross-reference reports.
Benchmark
Contains a suite of routines that let you benchmark your code.
ByteLoader
Used to load byte-compiled Perl code.
bytes
Perl pragma to force byte semantics rather than character semantics.
CGI
The base class that provides the basic functionality for generating web content and CGI scripting.
CGI::Apache
Backward compatibility module. Deprecated in Perl 5.6.
CGI::Carp
Holds the equivalent of the Carp module's error logging functions CGI routines for writing time-stamped entries to the HTTPD (or other) error log.
CGI::Cookie
Allows access and interaction with Netscape cookies.
CGI::Fast
Allows CGI access and interaction to a FastCGI web server.
CGI::Pretty
Generates 'pretty' HTML code on server in place of slightly less pretty HTML written in the CGI script file.
CGI::Push
Provides a CGI interface to server-side push functionality. For example, as used with channels.
CGI::Switch
Backward compatibility module. Deprecated in Perl 5.6.
CPAN
1154
Provides you with the functionality to query, download, and build Perl modules from any of the CPAN mirrors.
Standard Functional Modules
Module
Function
CPAN::FirstTime
Utility for CPAN::Config to ask a few questions about the system and then write a config file.
CPAN::Nox
As CPAN module, but does not use any compiled extensions during its own execution.
Carp
Provides warn() and die() functionality with the added ability to say in which module something failed and what it was.
Carp::Heavy
Carp guts. For internal use only.
Class::Struct
Lets you declare C-style struct-like complex datatypes and manipulate them accordingly.
Config
Allows access to the options and settings used by Configure to build this installation of Perl.
Cwd
Gets the pathname of the current working directory.
DB
Programmatic interface to the Perl debugger's API (Application Programming Interface). NB: This may change.
DB_File
Provides an interface for access to Berkeley DB version 1.x. Note that you can access versions 2.x and 3.x of Berkeley DB with this module but will have only the version 1.x functionality.
Data::Dumper
Returns a stringified version of the contents of an object, given a reference to it.
Devel::DProf
A Perl code profiler. Generates information on the frequency of calls to subroutines and on the speediness of the subroutines themselves.
Devel::Peek
A debugging tool for those trying to write C programs interconnecting with Perl programs.
Devel::SelfStubber
Stub generator for a SelfLoading module.
DirHandle
Provides an alternative set of functions to opendir(), closedir(), readdir() and rewinddir().
Dumpvalue
Dumps info about Perl data to the screen.
DynaLoader
Dynamically loads C libraries when required into your Perl code.
English
Allows you to call Perl's special variables by their 'English' names.
Env
Allows you to access the key\value pairs in the environment hash %ENV as arrays or scalar values.
Errno
Exports (to your code) the contents of the errno.h include file. This contains all the defined error constants on your system.
Exporter
Implements the default import method for modules. Table continued on following page
1155
Appendix F
Module
Function
Exporter::Heavy
The internals of the Exporter module.
ExtUtils::Command
Contains equivalents of the common UNIX system commands for Windows users.
ExtUtils::Embed
Contains utilities for embedding a Perl interpreter into our C/C++ programs.
ExtUtils::Install
Contains three functions for installing, uninstalling, and installing-as-autosplit/autoload programs.
ExtUtils::Installed
Keeps track of what modules are and are not installed.
ExtUtils::Liblist
Determines which libraries should be used in an install and how to use them and sends its finding for inclusion in a Makefile.
ExtUtils::MakeMaker
Used to create makefiles for an extension module.
ExtUtils::Manifest
Utilities to write and check a MANIFEST file.
ExtUtils::Miniperl
Contains one function to write perlmain.c, a bootstrapper between Perl and C libraries.
ExtUtils::Mkbootstrap
Contains one function to write a bootstrap file for use by DynaLoader.
ExtUtils::Mksymlists
Contains one function to write linker options files for dynamic extension.
ExtUtils::MM_Cygwin
Contains methods to override those in ExtUtils::MM_Unix when ExtUtils::MakeMaker is used on a Cygwin system.
ExtUtils::MM_OS2
Contains methods to override those in ExtUtils::MM_Unix when ExtUtils::MakeMaker is used on an OS\2 system.
ExtUtils::MM_Unix
Contains the methods used by ExtUtils::MakeMaker to work.
ExtUtils::MM_VMS
Contains methods to override those in ExtUtils::MM_Unix when ExtUtils::MakeMaker is used on a VMS system.
ExtUtils::MM_Win32
Contains methods to override those in ExtUtils::MM_Unix when ExtUtils::MakeMaker is used on a Windows system.
ExtUtils::Packlist
Contains a standard .packlist file manager.
ExtUtils::testlib
Adds blib/* directories to @INC.
Fatal
Provides a way to replace functions, which return false with functions that raise an exception if not successful.
Fcntl
Loads the libc fcntl.h defines.
File::Basename
Provides functions that work on a file's full path name.
File::CheckTree
Allows you to specify file tests to be made on directories and files within a directory all at once.
1156
Standard Functional Modules
Module
Function
File::Compare
Compares the contents of two files.
File::Copy
Copies files or directories.
File::DosGlob
Implements DOS-like globbing but also accepts wildcards in directory components.
File::Find
Searches/traverses a file tree for requested file.
File::Glob
Implements the FreeBSD glob routine.
File::Path
Creates or deletes a series of directories.
File::Spec
Group of functions to work on file properties and paths.
File::Spec::Functions
Support module for File::Spec.
File::Spec::Mac
Contains methods to override those in File::Spec::Unix when File::Spec is used on a MacOS system.
File::Spec::OS2
Contains methods to override those in File::Spec::Unix when File::Spec is used on an OS/2 system.
File::Spec::Unix
Methods used by File::Spec.
File::Spec::VMS
Contains methods to override those in File::Spec::Unix when File::Spec is used on a VMS system.
File::Spec::Win32
Contains methods to override those in File::Spec::Unix when File::Spec is used on a Win32 system.
File::stat
A by-name interface to Perl's built-in stat() functions.
FileCache
Allows you to keep more files open than the system allows.
FileHandle
Provides an OO-style implementation of filehandles.
FindBin
Locates the directory holding the currently running Perl program.
GDBM_File
Provides an interface for access and makes use of the GNU Gdbm library.
Getopt::Long
Enables the parsing of long switch names on the command line.
Getopt::Std
Enables the parsing of single-character switches and clustered switches on the command line.
I18N::Collate
Compares 8-bit scalar data according to the current locale. Deprecated in Perl 5.004.
IO
Front-end to load the IO modules listed below.
IO::Dir
Provides an OO-style implementation for directory handles. Table continued on following page
1157
Appendix F
Module
Function
IO::File
Based on FileHandle, it provides an OO-style implementation of filehandles.
IO::Handle
Provides an OO-style implementation for I/O handles.
IO::Pipe
Provides an OO-style implementation for pipes.
IO::Poll
Provides an OO-style implementation to system poll calls.
IO::Seekable
Provides seek(), sysseek() and tell() methods for I/O objects.
IO::Select
Provides an OO-style implementation for the select system call.
IO::Socket
Provides an OO-style implementation for socket communications.
IO::Socket::INET
Provides an OO-style implementation for AF_INET domain sockets.
IO::Socket::UNIX
Provides an OO-style implementation for AF_UNIX domain sockets.
IPC::Msg
Implements a System V Messaging IPC object class.
IPC::Open2
Opens a process for both reading and writing.
IPC::Open3
Opens a process for reading, writing, and error handling.
IPC::Semaphore
Implements a System V Semaphore IPC object class.
IPC::SysV
Exports all the constants needed by System V IPC calls as defined in your system's libraries.
Math::BigFloat
Enables the storage of arbitrarily long floating-point numbers.
Math::BigInt
Enables the storage of arbitrarily long integers.
Math::Complex
Enables work with complex numbers and associated mathematical functions.
Math::Trig
Provides all the trigonometric functions not defined in the core of Perl.
NDBM_File
Provides access to 'new' DBM files via tied hashes.
Net::Ping
Provides the ability to ping a remote machine via TCP, UDP and ICMP protocols.
Net::hostent
Replaces the core gethost*() functions with those that return Net::hostent objects.
Net::netent
Replaces the core getnet*() functions with those that return Net::netent objects.
Net::protoent
Replaces the core getproto*() functions with those that return Net::protoent objects.
Net::servent
Replaces the core getserv*() functions with those that return Net::servent objects.
1158
Standard Functional Modules
Module
Function
O
This is the generic front-end for the Perl compiler. The back-ends in the B module group are all addressed with this.
Opcode
Allows you to disable named opcodes when compiling Perl code.
open
Sets Perl pragma to default disciplines for input and output.
Pod::Checker
Provides a syntax error checker for pod documents. Note that this was still in beta at the time of publication.
Pod::Html
pod to HTML converter.
Pod::InputObjects
A set of objects that can be used to represent pod files.
Pod::Man
pod to *roff converter.
Pod::Parser
Base class for creating pod filters and translators.
Pod::Select
Used to extract selected sections of pod from input.
Pod::Text
pod to formatted ASCII text converter.
Pod::Text::Color
pod to formatted, colored ASCII text converter.
Pod::Text::Termcap
Converts pod data to ASCII text with format escapes.
Pod::Usage
Print a usage message from embedded pod documentation.
POSIX
Provides access to (nearly) all the functions and identifiers named in the POSIX international standard 1003.1.
Safe
Creates a number of 'safe' compartments in memory in which Perl code can be tested and the functions for this testing.
SDBM_File
Provides access to sdbm files via tied hashes.
Search::Dict
Provides function to look for a key in a dictionary file.
SelectSaver
Selects a filehandle on creation, saves it and restores it on destruction.
SelfLoader
As Autoloader, works with Autosplit to delay the loading of subroutines into the program until they are called. These subroutines are defined following the __DATA__ token in a package file.
Shell
Allows shell commands to be run transparently within Perl programs.
Socket
Imports the definitions from libc's socket.h header file and makes available some network manipulation functions.
Symbol
Qualifies variable names and creates anonymous globs.
Sys::Hostname
Makes several attempts to get the system hostname and then caches the result.
Sys::Syslog
Perl's interface to the libc syslog(3) calls. Table continued on following page
1159
Appendix F
Module
Function
Term::Cap
Provides the interface to a terminal capability database.
Term::Complete
Provides word completion on the word list in an array.
Term::ReadLine
Provides access to various 'readline' packages.
Test
Provides a simple framework for writing test scripts.
Test::Harness
Implements a test harness to run a series of test scripts and returns the results.
Text::Abbrev
Takes a list and returns a hash containing the elements of the list as the values and unambiguous abbreviations of each element as their respective keys.
Text::ParseWords
Provides functions for parsing a text file into an array of tokens or an array of arrays.
Text::Soundex
Implementation of the Soundex Algorithm.
Text::Tabs
Works through lines of text replacing tabs with spaces, or if spacesaving, replacing spaces with tabs if there are none in the text.
Text::Wrap
Simple paragraph formatter. Takes text, wraps lines around text boundaries, and controls the indenting of the text.
Tie::Array
Base class for tied arrays.
Tie::Handle
Base class definitions for tied handles.
Tie::Hash
Base class definitions for tied hashes.
Tie::RefHash
Allows you to use references as the keys in a hash if it is tied to this module.
Tie::Scalar
Base class definitions for tied scalars.
Tie::SubstrHash
Allows you to rigidly define key and value lengths within the hash for the entire time it is tied to this module.
Time::gmtime
Object-based interface to Perl's built-in gmtime() function.
Time::Local
Provides efficient conversion functions between GMT and local time.
Time::localtime
Object-based interface to Perl's built-in localtime() function.
Time::tm
Internal object used by Time::gmtime and Time::localtime.
UNIVERSAL
The base class for ALL classes (blessed references).
User::grent
Object-based interface to Perl's built-in getgr*() functions.
User::pwent
Object-based interface to Perl's built-in getpw*() functions.
Win32::ChangeNotify
Monitors events related to files and directories.
Win32::Console
Use the Win32 console and character mode functions.
1160
Standard Functional Modules
Module
Function
Win32::Event
Use Win32 event objects from Perl.
Win32::EventLog
Process Win32 event logs from Perl.
Win32::File
Manage file attributes in Perl.
Win32::FileSecurity
Manage FileSecurity discretionary access control lists in Perl.
Win32::IPC
The base class for Win32 synchronization objects.
Win32::Internet
Access to WININET.DLL functions.
Win32::Mutex
Use Win32 mutex objects from Perl.
Win32::NetAdmin
Manage network groups and users in Perl.
Win32::NetResource
Manage network resources in Perl.
Win32::ODBC
Use the ODBC extension for Win32.
Win32::OLE
Use the OLE Automation extensions.
Win32::OLE::Const
Extract the constant definitions from TypeLib.
Win32::OLE::NLS
Use the OLE national language support.
Win32::OLE::Variant
Create and modify OLE VARIANT variables.
Win32::PerfLib
Access the Windows NT performance counter.
Win32::Process
Create and manipulate processes.
Win32::Semaphore
Use Win32 semaphore objects from Perl.
Win32::Service
Manage system services in Perl.
Win32::Sound
Plays with Windows sounds.
Win32::TieRegistry
Manipulate a registry.
Win32API::File
Low-level access to Win32 system API calls for files/dirs.
Win32API::Net
Manage Windows NT LanManager API accounts.
Win32API::Registry
Low-level access to Win32 system API calls from WINREG.H.
XSLoader
Dynamically loads C or C++ libraries as Perl extensions.
1161
Appendix F
1162
Perl Resources
AM FL Y
Books This is a list of books that the reader may find interesting.
Perl Programming
Beginning Perl, Simon Cozens, Wrox Press (ISBN 1861003439).
TE
CGI Programming with Perl (Second Edition), Scott Guelich, Shashir Gundavaram, and Gunther Birznieks, O'Reilly and Associates (ISBN 1565924193). Learning Perl, Randal L. Schwartz and Tom Christiansen, O'Reilly and Associates (ISBN 1565922840). Learning Perl/Tk, Nancy Walsh, O'Reilly and Associates (ISBN 1565923146). Mastering Algorithms with Perl, Jon Orwant, Jarkko Hietaniemi, and John Macdonald, O'Reilly and Associates (ISBN 1565923987). Mastering Regular Expressions, Jeffrey E. F. Friedl, O'Reilly and Associates (ISBN 1565922573). Object Oriented Perl, Damien Conway, Manning (ISBN 1884777791). Perl Cookbook, Tom Christiansen and Nathan Torkington, O'Reilly and Associates (ISBN 1565922433). Programming Perl (Third Edition), Larry Wall, Tom Christiansen, and Jon Orwant, O'Reilly and Associates (ISBN 0596000278). Programming the Perl DBI, Alligator Descartes and Tim Bruce, O'Reilly and Associates (ISBN 1565926994).
Team-Fly®
Appendix G
Linux Based Books Beginning Linux Programming, Neil Matthew and Richard Stones, Wrox Press (ISBN 1861002971) Professional Apache, Peter Wainwright, Wrox Press (ISBN 1861003021) Professional Linux Programming, Neil Matthew and Richard Stones, Wrox Press (ISBN 1861003013)
Web Sites There are a great many resources on the Internet for Perl, here are just a few places we might want to start looking up.
1164
Perl Resources
1165
Appendix G
1166
Support, Errata, and p2p.wrox.com One of the most irritating things about any programming book is when you find that bit of code you've just spent an hour typing simply doesn't work. You check it a hundred times to see if you've set it up correctly and then you notice the spelling mistake in the variable name on the book page. Of course, you can blame the authors for not taking enough care and testing the code, the editors for not doing their job properly, or the proofreaders for not being eagle-eyed enough, but this doesn't get around the fact that mistakes do happen. We try hard to ensure no mistakes sneak out into the real world, but we can't promise that this book is 100% error free. What we can do is offer the next best thing by providing you with immediate support and feedback from experts who have worked on the book and try to ensure that future editions eliminate these gremlins. We also now commit to supporting you not just while you read the book, but once you start developing applications as well through our online forums where you can put your questions to the authors, reviewers, and fellow industry professionals. In this appendix we'll look at how to: ❑
Enroll in the Programmer To Programmer™ forums at http://p2p.wrox.com.
❑
Post and check for errata on our main site, http://www.wrox.com.
❑
E-mail technical support queries or feedback on our books in general.
Between all three of these support procedures, you should get an answer to your problem in no time at all.
Appendix H
The Online Forums at p2p.wrox.com Join the Professional Perl Programming mailing list for author and peer support. Our system provides Programmer To Programmer™ support on mailing lists, forums, and newsgroups all in addition to our oneto-one e-mail system, which we'll look at in a minute. Be confident that your query is not just being examined by a support professional, but by the many Wrox authors and other industry experts present on our mailing lists.
How to Enroll for Support Just follow these simple instructions:
1.
1168
Go to http://p2p.wrox.com in your favorite browser. Here you'll find any current announcements concerning P2P – new lists created, any removed and so on:
Support, Errata, and p2p.wrox.com
2.
Click on the Open Source button in the left hand column.
3.
Choose to access the Pro Perl list.
4.
If you are not a member of the list, you can choose to either view the list without joining it or create an account in the list, by hitting the respective buttons.
5.
If you wish to join, you'll be presented with a form in which you'll need to fill in your e mail address, name, and a password (of at least 4 alphanumeric characters). Choose how you would like to receive the messages from the list and then hit Save.
6.
Congratulations. You're now a member of the Professional Perl Programming mailing list.
Why this System Offers the Best Support You can choose to join the mailing lists to receive mails as they are contributed, or a daily digest, or you can receive them as a weekly digest. If you don't have the time or facility to receive the mailing list, then you can search our online archives. You'll find the ability to search on specific subject areas or keywords. As these lists are moderated, you can be confident of finding good, accurate information quickly. Mails can be edited or moved by the moderator into the correct place, making this a most efficient resource. Junk and spam mail are deleted, and your own email address is protected by the unique Lyris system from web-bots that can automatically hoover up newsgroup mailing list addresses. Any queries about joining, or leaving lists, or any query about the list should be sent to: [email protected].
Checking the Errata Online at www.wrox.com The following section will take you step by step through the process of posting errata to our web site to get that help. The sections that follow, therefore, are: ❑
Finding a list of existing errata on the web site.
❑
Adding your own erratum to the existing list.
There is also a section covering how to e-mail a question for technical support. This comprises: ❑
What your e-mail should include.
❑
What happens to your e-mail once it has been received by us.
Finding an Erratum on the Web Site Before you send in a query, you might be able to save time by finding the answer to your problem on our web site – http://www.wrox.com.
1169
Appendix H
Each book we publish has its own page and its own errata sheet. You can get to any book's page by clicking on Books on the left hand navigation bar. To view the errata for that book, click on the Book errata link on the right hand side of the book information pane, underneath the book information. We update these pages regularly to ensure that you have the latest information on bugs and errors.
Add an Erratum If you wish to point out an erratum to put up on the web site or directly query a problem in the book page with an expert who knows the book in detail then e-mail [email protected], with the title of the book and the last four numbers of the ISBN in the subject field of the e-mail. Clicking on the submit errata link on the web site's errata page will send an e-mail using your e-mail client. A typical email should include the following things: ❑
The name, last four digits of the ISBN, and page number of the problem in the Subject field.
❑
Your name, contact info, and the problem in the body of the message.
1170
Support, Errata, and p2p.wrox.com
We won't send you junk mail. We need the details to save both your time and ours. If we need to replace a disk or CD we'll be able to get it to you straight away. When you send an e-mail it will go through the following chain of support:
Customer Support Your message is delivered to one of our customer support staff who will be the first people to read it. They have files on most frequently asked questions and will answer anything general immediately. They answer general questions about the book and the web site.
Editorial Deeper queries are forwarded to the technical editor responsible for that book. They have experience with the programming language or particular product and are able to answer detailed technical questions on the subject. Once an issue has been resolved, the editor can post the erratum to the web site.
The Authors Finally, in the unlikely event that the editor can't answer your problem, they will forward the request to the author. We try to protect the author from any distractions from writing. However, we are quite happy to forward specific requests to them. All Wrox authors help with the support on their books. They'll mail the customer and the editor with their response, and again all readers should benefit.
What We Can't Answer Obviously with an ever-growing range of books and an ever-changing technology base, there is an increasing volume of data requiring support. While we endeavor to answer all questions about the book, we can't answer bugs in your own programs that you've adapted from our code. So, while you might have loved the chapters on file handling, don't expect too much sympathy if you cripple your company with a routine that deletes the contents of your hard drive. But do tell us if you're especially pleased with the routine you developed with our help.
How to Tell Us Exactly What You Think We understand that errors can destroy the enjoyment of a book and can cause many wasted and frustrated hours, so we seek to minimize the distress that they can cause. You might just wish to tell us how much you liked or loathed the book in question. Or you might have ideas about how this whole process could be improved. If this is the case, you should e-mail [email protected]. You'll always find a sympathetic ear, no matter what the problem is. Above all you should remember that we do care about what you have to say and we will do our utmost to act upon it.
1171
Appendix H
1172