This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
, and list tag < li >. Omini computes the percentage of times each tag is used as an SRR separator in a set of sample Web pages and then ranks the tags in descending order of their percentages.
76
4. SEARCH ENGINE INCORPORATION
• Sibling tag heuristic. This heuristic counts pairs of tags that are immediate siblings in the minimal subtree and ranks the tag pairs in descending order of the counts. If two pairs have the same count, the one that appears first is preferred. This heuristic is motivated by the observations that, given a minimal subtree, the separator should appear the same number of times as the SRRs, and that tag pairs have stronger meanings than individual tags. • Partial path heuristic. This heuristic identifies all tag paths from each candidate tag to any reachable node in the subtree rooted at the candidate tag and ranks candidate tags in descending order of the number of identical paths from them. If two tags have the same number of partial paths, the one with the longer path is preferred. This heuristic is based on the observation that the SRRs returned by the same search engine usually have similar tag structures. Omini employs a probabilistic method to combine the five heuristics into an integrated solution. Specifically, Omini first estimates the success rate of each heuristic based on a set of sample pages (it is the percentage of the sample pages in which the top-ranked candidate separator of the heuristic is correct), then evaluates the overall success rates of different combinations of the five heuristics based on the assumption that the heuristics are independent. The evaluation determined that the combination using all of the five heuristics has the best performance (Buttler et al., 2001). Omini’s SRR separator discovery method was heavily influenced by an earlier work (Embley et al., 1999). The first two heuristics used in Omini’s solution were first proposed in the earlier work, and the third heuristic was revised from a heuristic in the earlier work. Step 3: SRR extraction. Once the SRR separator is identified, it is straightforward to use it to segment the query result section into SRRs. However, it is possible that the separator is not perfect, which leads to some incorrectly extracted SRRs. Two problematic situations are identified (Buttler et al., 2001). The first is that an SRR may have been broken into multiple pieces by the separator. In this case, these pieces need to be merged together to construct the SRR. The second is that some extraneous SRRs (e.g., advertisements) may have been extracted. Omini identifies extraneous SRRs by looking for SRRs that are different from the majority of the extracted SRRs, for having either a different set of tags or different sizes. The possibility that multiple SRRs may have been incorrectly grouped into a single SRR was not addressed in this work. Wrapper rules were not explicitly discussed by Buttler et al. (2001), but it is easy to see that the tag path to the root of the minimal subtree and the SRR separator tag identified for a response page returned by a search engine can be stored as the extraction wrapper for extracting SRRs from new response pages returned by the same search engine. ViNTs
ViNTs (Zhao et al., 2005) is an automatic wrapper generator that was specifically designed to extract SRRs from search engine returned response pages. It is also one of the earliest automatic wrapper generators that utilize both the visual information on response pages and the tag structures of the HTML source documents to generate wrappers. ViNTs makes use of visual features and tag
4.2. SEARCH RESULT EXTRACTION
77
structures as follows. It first uses visual features to identify candidate SRRs. It then derives candidate wrappers from the tag paths related to the candidate SRRs. Finally, it selects the most promising wrapper using both visual features and tag structures. ViNTs takes one or more sample response pages from a search engine as input and generate a wrapper for extracting the SRRs from new response pages returned from the same search engine as output.The sample response pages can be automatically generated by the system through submitting automatically generated sample queries to the search engine. For each input sample response page, its tag tree is built to analyze its tag structures and the page itself is rendered on a browser to extract its visual information. Content-line is the basic building block of the ViNTs approach. It is a group of characters that visually form a horizontal line in the same section on the rendered page. In ViNTs, eight types of content lines are differentiated, such as link line (the line is the anchor text of a hyperlink), text line, link-text line (has both text and link), blank line, and so on. A code is assigned to each type of content lines, called the type code. Each content line has a rendering box, and the left x coordinate of the rendering box is called the position code of the content line. Thus, each content line is represented as a (type code, position code) pair.
Example 4.1 In Fig. 4.6, the first SRR has 5 content lines, the first line is a link line with type code 1, the second and third lines are text lines with type code 2 (in ViNTs, adjacent content lines of the same type are merged into a single line, which means that the second and third content lines will be treated as a single text line), the fourth line is a link-text line with type code 3, and the fifth line is a blank line with type code 8. All these lines have the same position code, say 50, as no line is indented.
L1: L2: L3: L4: L5: L6: L7: L8: L9: L10: L11: L12:
Figure 4.6: A Segment of a Sample Response Page.
78
4. SEARCH ENGINE INCORPORATION
To remove useless content lines from a sample response page, another response page for a nonexisting query string, called no-result page, is utilized. Basically, those content lines in the sample response page that also appear in the no-result page are removed. The remaining content lines are grouped into blocks based on candidate content line separators (CCLSs) that are content lines appearing at least three times. ViNTs requires each sample response page used for wrapper generation to contain at least four SRRs. Each CCLS may contain one or more consecutive content lines, and the content lines of a CCLS are the ending part of each block. Example 4.2 Consider Fig. 4.6. If blank line is a CCLS or link-text line followed by blank line together form a CCLS, the yielded blocks will correspond to the correct SRRs. However, when link line is a CCLS, the following blocks will be formed: (L1), (L2, L3, L4, L5), (L6, L7, L8, L9). Note that (L10, L11, L12) do not form a block as the last line is not a link line.
In ViNTs, each block is characterized by three features: type code – the sequence of type codes of the content lines of the block, position code – the smallest x coordinate of the block that is closest to the left boundary of the rendered page, and shape code – the ordered list of the position codes of the content lines of the block. Given two blocks, distances between their type codes, position codes, and shape codes can be defined. Two blocks are visually similar, if these distances are all below their respective thresholds. Each CCLS that yields a sequence of visually similar blocks is kept. Such a sequence of blocks is called a block group. Intuitively, each block group corresponds to a section on the response page. There may be multiple such sections, for example, one for the real SRRs (i.e., the query result section) and one or more for advertisement records. In general, these blocks do not necessarily correspond to the actual SRRs. For example, when link line is used as a CCLS in Fig. 4.6, two produced blocks, i.e., (L2, L3, L4, L5) and (L6, L7, L8, L9), are visually similar, but they do not correspond to actual SRRs.To find the blocks that correspond to the correct records, ViNTs identifies the first line of each SRR in each of the current blocks. These first lines in different blocks repartitions the content lines in each section into new blocks. ViNTs uses several heuristic rules to identify the first line of a record from a given block. For example, two of these heuristic rules are (1) the only content line that starts with a number is a first line; and (2) if there is only one blank line in a block, the line immediately following the blank line is the first line. For the example in Fig. 4.6, the link line immediately following the blank line in each block will be identified as the first line. Based on the new blocks in each block group, ViNTs generates a candidate wrapper from the tag paths from the root of the tag tree to beginning of each new block in the block group. In ViNTs, wrappers are regular expressions of the format: prefix (X (separator1 | separator2 | …))[min, max], where prefix is a tag path, X is a wildcard for any sequence of tags representing subtrees rooted at these tags (called sub-forest) of the tag tree of the response page, each separator is also a sequence of tags representing a sub-forest of the tag tree14 , “|” is the alternation operator, the concatenation of X 14 Note that the tag-based separators here are different from the content line based separators (i.e., CCLSs) that are used earlier for
content line partition. Sometimes more than one separator may be needed (Zhao et al., 2005).
4.2. SEARCH RESULT EXTRACTION
79
and a separator corresponds to a record (i.e., each occurrence of the child sub-forest rooted at the tags in X and a separator corresponds to a record), min and max are used to select SRRs within a range from a list of SRRs (usually, min and max are set to 1 and the maximum number of records that may appear in a response page, respectively). The prefix determines the minimal subtree t that contains all records in the block group. The separators are used to segment all descendants of t into records. The following example illustrates how the prefix and separator are derived from a sequence of tag paths, from the root of the tag tree to the beginning of each block, in a block group. Each tag path is sequence of tag nodes, and each tag node consists of a tag followed by a direction code, which can be either C or S (Zhao et al., 2005). Tag node
Consider the following four tag paths for three consecutive blocks: P1 : < html >C< head >S< body >C< img >S< center >S< hr > SS< hr > S< dl >C< dt >C< strong >CC P2 : < html >C< head >S< body >C< img >S< center >S< hr >SS< hr > S< font >S