|
DiamondSilk's purpose is to discover structure in web documents and to make this structured data available to users and other websites. DiamondSilk goes through three separate phases to accomplish this: discovery, acquisition, and querying.
Phase Ia: Link Filter Discovery
In this first phase, we interact with a human operator to discover two types of rules: which sorts of URLs to follow from a web site's front page, and how we should parse the resulting pages. Since we need valid URLs to train / discover rules for filtering pages, we will first attempt to discover a rule for following URLs on a given site.
The users starts us off with the starting URL of the site; often the name by which the site is known (e.g., "News.com"). They also provide an English name ("C|Net's News.com"), a category ("News: Tech") and the frequency with which the site should be checked by our engine.
After this, they are displayed pages, one at a time, and asked whether or not those pages contain homogenous content; i.e., are "leaf nodes" and not just links to other pages. After the user selects "Yes" or "No" to eight given pages, DiamondSilk starts trying to form hypotheses as to what common rule holds for all leaf node URLs. Currently, this is implemented by attempting a handful of heuristics that worked well across several sites.
Once we get five guesses right in a row with a hypothesis, we store the hypothesis and move on to filter discovery.
Architecture Diagram
Phase Ib: Filter Discovery
At this point, the user is presented with a page that we believe to be a "leaf node" of the site. The HTML is stripped down to its bare essentials to minimize download time and to keep the user from accidentally wandering out of the training process. (Also because we will need to modify it, as seen shortly.) We then iterate through each of the fields appropriate to the site's category. Sites in the News category, for instance, require Title, Author, and Body fields. The user is asked to select the text that corresponds to each subsequent field. Javascript allows us a bit of a shortcut: instead of forcing the user to cut and paste the text, our page can simply read in the text that the user has highlighted on the page.
On the subsequent page, the user's text is in turn highlighted by the system (using the <SPAN> tag with style sheets allows us to modify the background color of a chunk of text in a web document) to ensure the accuracy of the selection. The user can either redo the selection or continue on with the next field.
When all of the fields have been entered into the system, the system forms a hypothesis about how page structure should be parsed. It then executes this hypothesis on a second page that it believes to be a leaf node and displays the results, field by field, to the user for their verification.
The original algorithm that we used relied upon incremental training to successively realize accurate filter results. While our UI continues to support incremental training, the current algorithm is of the "one shot" variety; training either works or does not. It is possible, likely even, that future versions of this algorithm will support iterative training.
The user has now completed their interactive session and is shown a "thank you" page. For the entirety of the session, state is maintained on the client (web browser) end and in the database, but not on the application level. This should mean that the service is robust against application server downtime and also that a large number of currently "open" sessions will not bog down the system. Additionally, since the state is stored on the client side, the "back" button actually acts as expected in the browser. (This is an issue for many less-than-completely stateless web applications.)
Architecture Diagram
Phase II: Content Acquisition
Once a training session has "graduated" into the database, it is spidered proportionally to the frequency of its updates. We can have a number of spiders running, all on different machines. Each spider runs in an infinite loop: it asks the scheduler what site to spider next, spiders it, and notes to the scheduler that it is done. It then immediately requests the next site to spider.
The spider first retrieves the link filter from the database. Then it fetches the main page from the site, enumerating all of the links found therein. It tests each link with the link filter and each link that it has never before visited but which passes the link filter is scraped. The spider applies the site filter to each page in succession and dumps the content to the database. The spider uses a simple localized cache to avoid fetching the same page twice.
Phase III: Queries
At this point, we now have data in our database that is structured consistently across all categorical sites. We present an interface to the user to allow her to pose queries to this structured database. The user chooses a category of sites to which she will pose the query. Queries can be composed of constraints on any field to contain, not contain, start with, end with, be, or not be a given value. Additionally, the specific sites included in the query and the recency of the information can further constrain the query.
A SQL statement is dynamically constructed from the query constraints and is sent to the database for processing. The user is then presented with a sortable table (in groups of 10) of the resulting matches. Entries with excessive data (such as the body of a news story) are truncated with the option to be viewed in their entirety.
Architecture Diagram
EXPERIENCE AND LESSONS
The task of structuring web data as a general proposition is ludicrously enormous. We sensed that this would be a large project from the start, but did not fully grasp the precise enormity of the task at hand. The project touches upon, and could greatly derive benefit from: Parsing, Semi-structured Data Storage & Retrieval, Recognition Algorithms, Machine Learning, Networking & Optimization Theory, and many other cutting edge disciplines.
Some of the tasks that we believed would be hardest turned out to be trivial; some of those that we hardly gave thought to ended up dominating our time sufficiently to make full completion of the project an impossibility despite weeks of 10-20 hour workdays.
Filters - Common Prefix
One key such example is the algorithm used for automatic filter discovery and filter parsing. (The two naturally go hand in hand!) Our first take at a filter discovery algorithm was simple: look at N pages of text and, for a given field, see what prefix they all had in common -- e.g., what text would always come before the author in all documents in a given site? Similarly, what common suffix do they share? Such a filter would have been quite efficient to parse: look for the common prefix and suffix in each document and simply grab the text in between. When it became clear that this naïve approach might not work well, we allowed for a prefix to occur a fixed number of times: if "<b><p>" always preceded the author's name, for instance, and was always the third occurrence of "<b><p>" in the document.
While we spent a great deal of time testing and honing this algorithm, we did not spend a sufficient amount of time investigating its practical utility across a range of websites. Consequently, it was not until 10 days before our deadline that we discovered that this algorithm was a complete flop.
Filters - Tag Stack
Within 24 hours, however, we had engineered a prototype of a completely new algorithm that relied not on the particular text surrounding a field, but the fundamental structure of the HTML document and where its beginning and end were located in that structure. Handed a single document and user-input answer, the algorithm first discovers the location in the document from whence the user's snippet came. This part of the algorithm had to be able to skip over recursively embedded tables, HTML tags, variance in whitespace, high-ASCII characters, etc. in the document while matching the pasted text.
Once we have the field's beginning and end, we can begin extracting the document's structure. Starting just above the field's beginning and working our way upwards, we note every tag we run across. If we run across a /TABLE tag, then we should skip the subsequent nest of tables - namely, we should skip up to the TABLE tag that corresponds to the close tag that we saw and skip any further tables embedded therein. We perform a similar (but inverted) algorithm for the text starting just after the field's end.
This "tag stack" technique worked reasonably well, and could parse structure in relatively crude documents. One thing that became increasingly apparent was that the smaller the tag stack was, the more general our algorithm would be. That is to say, the smallest possible tag stack for a given document location is not only computationally faster to parse, but is more general and thus will accommodate more inter-page variations. Our algorithm reduced a page with over a thousand tags to a tag stack a few hundred long.
Filters - Tag Stack w/FSA
It next became clear that we should take further advantage of table structure by skipping over adjacent rows and columns: if the content in a page lies in the second row, second column, the HTML in the first row, first column (or indeed in any other cell) should not affect the parsing of the tag stack and consequently should not be a part of it. This schema at first appeared complex, but then we realized that what we really wanted for the parsing was a simple Finite State Automaton (FSA). This in mind, we went back and recoded the algorithm as an FSA with HTML table intelligence. This reduced the 1000+ tag page to about 130 tags. This algorithm could parse any very consistently structured site.
Filters - Tag Stack w/FSA + 2nd pass optimization
The web is not, however, known for its prolific usage of thoroughly structured HTML, even on major sites. It shortly became clear that, for non-self generated documents, we were going to need to optimize/shorten our tag stack considerably. Integrating further optimization directly into the existing parsing design would have been very difficult; instead we realized that we could use an optimizer in a second stage that would remove unnecessary tags.
We ended up implementing optimization in two phases. The first eliminates excess HTML tags outside of the table cell in which the field is located: in the example "<body><table></table>FOO<table><tr><td>" FOO is extraneous: the tag stack could look like <td>, <tr> , <table>, (table). The tag stack is parsed "backwards" so starting at the bottom of this stack, the parser would know to first skip over a table and then locate the first row, first column of the next table. The second phase attempts to eliminate unneeded tags from the same cell in which the field is located. It does this by linearly searching for a unique tag within the localized context (the first such tag in the cell); once found, all other tags in the localized context are dropped. In the example "<p><b><i></i><a></a></b><br>" I really only need to store "<br>" and I'd be all done, since I'd still be able to authoritatively relocate the beginning of the field, which is the whole point of the algorithm.
This enhancement was greatly needed and dropped our tag count to 39 tags when combined with the Finite State Automaton: a factor of four improvement. This algorithm is able to parse a website that has a reasonable amount of consistency in its structure; it is what DiamondSilk is currently using. Insofar as the authors of this document are aware, this represents a unique approach to the problem of automated filter discovery.
Link Filters
When mounting our first serious assault upon automated link filter discovery, it became clear that doing a provably optimal job would require an algorithm that seems as if it could be NP-complete: the correct rule might be best represented as the logical inclusion and/or exclusion of any number of rules. Simply enumerating these combinations is an exponentially difficult task.
As such, we were going to have to do something a touch easier; we ended up deciding on trying a number of handmade heuristics. With the user's help, we divide a sample of the site's links into a "yes pile" and a "no pile." Then, by testing each URL for a series of heuristics, such as number of digits in the name, number of digits in the query, directory name, directory depth, and file extension, we find a characteristic that all the good links have in common (and none of the bad ones do, or vice versa) and test each new link with that heuristic and see if it fits the commonality of either pile. Fitting into one pile and not the other allows the link filter to make a reasonable guess about the link.
Following Frames
While parsing a web page to find the appropriate links or to reformat it for training purposes, we found we will most likely run into a frameset. Our parser uses a clever recursive mechanism to follow through to the frames in the frame set, any number of layers deep, to find the content.
Dynamic SQL Generation
In order to make the user's query, which can take many combinations of clauses, an accurate and effective SQL query, we've designed the query interface such that the user's request for each of the fields is packaged into a neat triplet (field number, type of containing, value to search for) and stuffed into an array, which can then be interated and, triplet by triplet, formed into SQL WHERE clauses.
FUTURE WORK
During the course of the project, we experienced many moments of exciting and creative invention. Above and beyond the basics of the project, we discovered that there are dozens of features that would make DiamondSilk more interesting, efficient, or thorough. Due to time and/or resource constraints, we were unable to implement many of our exciting ideas. However, we kept track of them so that in the future they may be integrated into DiamondSilk v2.0 (or higher!).