What is DiamondSilk?
DiamondSilk applies structure to the web.
Most data on the web today is unstructured: the HTML for a page on
buy.com doesn't tell a program what item is being sold, or how much it costs.
However, most of these sites have a simple mechanism for converting their
databases into HTML pages. DiamondSilk, with a user's help, finds an
inverting mechanism for this data to turn HTML into knowledge. Our
intelligent spider then periodically revisits the site to scrape new data
into our knowledge warehouse, and our query engine allows users (or other
websites) to exploit this information to create anything from a price
comparison engine to a news service. For more information, check out the DiamondSilk Technical Documentation.
Why is it called DiamondSilk?
The purpose of DiamondSilk is to create a solid structure out of the unstructured data on the Web, and to offer a smooth, clean search experience. Hence, it is tough as diamonds and smooth as silk. We hope. :)
What is a filter?
A filter is an integral part of the functionality of the DiamondSilk system.
It is used to automatically determine which URLs in a site link to valid content
in a given category, and then it automatically "filters" data from a page on
the site and puts that data into the DiamondSilk database with the proper category
and attribute name associated with it. By giving the system some examples of
suitable links and showing it what parts of a page should be in the database,
you "train" a filter to do these things automatically.
Why should I add a filter?
The more filters that exist within the DiamondSilk system, the more complete
the database will be. This is because DiamondSilk uses the filters to decide
which sites to visit and what information to pull from them. By creating a
new filter, you can help DiamondSilk offer more useful data to its users.
How do I add a filter?
There are two training steps required to add a filter. The first is to teach
the filter about the site you want filtered. You choose a site (or an area
of a large site) and specify the category that suits this site. The best
page in a site to use for this is the front page or an index that contains links
to the content that you want to filter. Specifying how often the site gets
updated tells the system how often to look at it for new data. Once you submit
this information, you are shown a random series of pages that were linked from the site and asked
to specify whether they contain valid content. If the page shown does not contain
content that you want searched by the filter you are creating, click No. If the
page shown contains attributes that you want to filter, click Yes.
Be careful with
this step, as a mistake will confuse the system and may cause it to try to filter
pages that don't contain the right kind of data. (Note: You can watch the system "learn" by
noting its guesses after giving it a few examples to train on!)
The second step is to teach the filter what kind of data to look for. You are given
a page determined to be valid by the patterns detected in
the pages you specified in the previous step. (If the page given does not contain valid
content, click Reject to be given a different one.) For each attribute associated
with this site's category, you are asked for the content fitting this attribute.
Simply highlight this text with your mouse and click Submit. The highlighted text
will be used to teach the filter how to find that attribute in a page. You will also
be shown what you have submitted; if you find you have made a mistake, simply click
Try Again and you will be brought back a step to try again.
After you have submitted a complete example page, the filter will try to guess to
see if it has the right answer.
Click Correct if the highlighted text fits the attribute shown.
When the filter guesses
correctly for all the attributes in the category, it's done training and is now suitable
for filtering automatically.
Click Incorrect if the wrong text is highlighted. Unfortunately, in this version
of DiamondSilk, an incorrect guess means that this site is not suitable for filtering.
Add a filter
What's so special about searching the DiamondSilk database?
Most search engines on the Web only search for one thing: a keyword found in the
content of a page. DiamondSilk is different because of its automated categorization.
Instead of searching the whole Web for a certain word or phrase, DiamondSilk
allows you to search within certain categories of sites and certain parts of pages,
enabling a more refined and informative search. Once you've found the content
you want, you can also sort the results by their different attributes so you can
quickly and easily get the data you want, without wading through the results of a
traditional search engine.
How do I search the DiamondSilk database?
You first get to choose which category to search. Subcategories are shown indented;
clicking a parent category allows you to search all of its subcategories. You may
then choose which attributes of that category to search, what kind of matching
should be applied and what word or phrase to match with. How long ago a site was
"harvested" means how long ago it was first found on this site, and you can request
how far back in the records to search. You may also choose which
sites within this category to search. Clicking Select All will check all of the
checkboxes for you, and Select None will uncheck them; be sure at least one box is checked,
otherwise you will get no results!
Once you receive your search results, you may sort them by attribute by clicking on
the corresponding column heading. To view a page found in the search, simply click
on the link to its URL and it will open in a new window. If a result is too long
to completely fit in the table, click on the [...] to see that entry in its entirety.
If there are more than ten pages found which match your criteria, click "Next 10 Matches" to see the rest; click "Last 10 Matches" to go back to the previous results.
Search the database
Where did DiamondSilk come from?
DiamondSilk is the Stanford University Computer Science Senior Project of
David Weekly and Valerie Kucharewski. It was conceived by David Weekly in
the Autumn of 1999 under the guidance of Stanford CS professor Armando Fox.
Valerie Kucharewski joined the team
in January of 2000. David and Valerie worked together through the Spring
of 2000 to implement the complete system.
Acknowledgements
Thank you to Armando Fox for taking a chance on some undergrad advisees and
for being our guiding light and voice of reason. Thanks to Grant for the
food and love, and to Vanessa for the love and seaweed.
Valerie would especially like to thank Grant for his devotion, Mari for the hugs, Jen for the understanding tolerance, Lauren for the interest, Hannah and Craig for the help, Cam and Lindsay for the surprise visit, Dad for caring, Mom for listening, Nick for the encouragement, and finally David for not hating her when she cracked the whip.
David would like to thank Vanessa, who provided encouragement (electronic and non) at all times; his family, ceaselessly praying for him; the friendly residents of Maison Francaise with an ever-so-tasty assortment of things to eat; and Valerie, without whom