Extracting information from microformatted websites




Modeling microformats
 
We propose modeling the microformats of lowercase semantic web by means of semantic networks. As an example let us consider the web site: http://upcoming.yahoo.com/tag/sports, which contains some  microformats related with events. We take each class in order to produce a semantic network. See the Figure:



A fragment of its code is:

<tr class="vevent">
  <td>
         <abbr class="dtstart" title="2010-08-13T16:16:00-04:00">Aug 13+</abbr>
  </td>
  <td>
        <a href="http://upcoming.yahoo.com/event/407814" class="url summary">
                     Amplafi first test event
        </a>
  </td>
  <td>
        <a href="http://upcoming.yahoo.com/place/hVUWVhqbBZlZSrZU">
             <abbr class="location" title="Corner Billiards Bar, 110 East 11th Street,
                                     between 3rd and 4th Avenues, New York City, 10003">
                 New York
             </abbr>
        </a>
   </td>
</tr>


Observe the class attributes, they are useful in order to create concepts (nodes) in the corresponding semantic network. The below Figure shows a semantic network composed by a list of veventn microformats. Sons of urlP1 are de microformats of the website of the above example.



A semantic network modeling microformats of a set of web pages allows us to represent and discover semantic relationships between information located in those web pages. This is useful to build more accurate tools for web searching.

Extracting information

Let us consider a tourist visiting New York. If she wishes to go to some sport events, she can check the newspaper or search on the web. Even if she chooses to search on the web, she must read many web pages and determine what are the preferred events. Typically, she writes a query in the search web page and waits for a list of results related to the introduced keywords. Each result must be analyzed separately.

A modern approach, based on semantic web, could be the following: The user launches a software agent, makes a query, and the agent automatically extracts the upcoming sport events in a set or web pages. Additionally, because the events can be processed automatically, each event can be added to an  electronic appointment book. Thanks to the microformats this is possible.

A typical session begins with a search of semantic relations. For this, we launch the semantic relation searcher as it is depicted in the following Figure.



Then, the filtered URL's from de Google search are collected in order to prepare the sample for analysis, the user can choose the sample size. For this, the tool offers a page (in the tool's interface) which is called Sample where users can edit the list of URL's, see Figure.



Once we have a well defined sample, the next step is to click on the button Sem. Analysis of the tool's interface. The tool loads each web page and traverses the proper (X)HTML code searching microformats. The tool's interface has an area to report the number of microformats found at each visited web page (see Figure).



Now, the user can click on the button Sem. relations in order to view a more friendly presentation of the results report. If the user clicks on the presented report (View microcodes button), a list of microformats corresponding to the web pages is shown.



Finally, we wish to view the results of the web slicing process; therefore, we click on the Display extract button, and a web page with the extracted microformats is produced. In below Figure we present some snapshots of the slices. The first and second snapshots show some discovered artistic events; and the third one contains a list of sport events that constitute the answer to the motivated exercise.




Downloads

Download the semantic analyzer... soon