| | | •HTML/BODY/TABLE/TR/TD[2] •HTML/BODY/TABLE/TR[1]/TD[2] •HTML/BODY/TABLE[1]/TR[1]/TD[2] •HTML/BODY[1]/TABLE[1]/TR[1]/TD[2] •HTML[1]/BODY[1]/TABLE[1]/TR[1]/TD[2] •HTML[1]/BODY[1]/TABLE[1]/TR[1]/TD[2] •HTML[1]/BODY[./FORM]/TABLE[1]/TR[1]/TD[2] •HTML[1]/BODY[./FORM/INPUT]/TABLE[1]/TR[1]/TD[2] •HTML[1]/BODY[1]/TABLE[../FORM]/TR[1]/TD[2] •HTML[1]/BODY[1]/TABLE[../FORM/INPUT]/TR[1]/TD[2] •HTML[1]/BODY[1]/TABLE[1]/TR[../../FORM][1]/TD[2] •HTML[1]/BODY[./FORM]/TABLE[../FORM]/TR[../../FORM][1]/TD[2]
…
Visual Environment for DOM-Based Wrapping and Client-Side Linkage of Web Applications
Figure 6. An HTML document with its DOM tree
<TITLE>Sample HTML A Query Form Result AnchorText Date=April 12, 2002
fore, the restriction of the language to the HTML path gives a solution to the problem of choice of expressions. Another approach to solve this problem might be the utilization of the common substructure in XPath expressions for more than two HTML elements. Sugibuchi and the second author of this chapter have developed an interactive method to construct Web wrappers for extracting relational information from Web documents. (Sugibuchi et.al, 2004).Their approach employs a generalization method of XPath expressions, and succeeds to interactively extract intended portions of information from Web documents. Figure 6 shows an HTML document of a Web application with its DOM tree representation. For example, the circled portion in the document in Figure 6 corresponds to the circled node whose HTML-path is:
HTML[1]/BODY[1]/FORM[1]/INPUT[1] This is the HTML path of an input element of this Web application.
Algorithms to Construct HTML Paths Next, we show two algorithms for finding an HTML path to a given HTML element. There are two basic tactics to find an HTML path, the top-down way and the bottom-up way. Figure 7 illustrates these two strategies. The following algorithms, 1 and 2, construct an HTML path in a top-down way, and in a bottom-up way, respectively. Both algorithms are written in the C# style. The class HTML path represents HTML paths and provides methods to select nodes in an HTML document.
Visual Environment for DOM-Based Wrapping and Client-Side Linkage of Web Applications
In general, the bottom-up approach of algorithm 2 runs faster than the top-down approach of algorithm 1. However, due to the DOM implementation in the Internet Explorer 6.0, the bottom-up approach often constructs a wrong path. For this reason, we currently adopt the top-down approach in our implementation.
Evaluation of an HTML Path The value identified by an HTML path: HTML[1]/BODY[1]/A[1]/@href is the string “http://ca.meme.hokudai.ac.jp”.
Algorithm 1. Top-down Approach public static HTMLPath TopDownSearch(IHTMLDOMNode from_node, IHTMLDOMNode target_node){ HTMLPath path=null; Hashtable hash=new Hashtable(); foreach(IHTMLDOMNode child in (IHTMLDOMChildrenCollection)from_node.childNodes){ if(hash.ContainsKey(child.nodeName)){ hash[child.nodeName]=((int)hash[child.nodeName])+1; }else{ hash[child.nodeName]=1; } if(child==target_node){ path=new HTMLPath(); LocationStep step= new LocationStep("child", child.nodeName, (int)hash[child.nodeName]); path.Add(step); break; }else{ if(child.nodeType==1){ path=TopDownSearch(child,target_node); if(path!=null){ LocationStep step= new LocationStep("child", child.nodeName, (int)hash[child.nodeName]); path.Insert(0,step); break; } } } } return path; }
Visual Environment for DOM-Based Wrapping and Client-Side Linkage of Web Applications
Figure 7. Top-down search and bottom-up search for the HTML-Path to a given element
|