General description
Related papers
Team
Implementation
Experiments
Downloads
Nowadays, Internet is the main source of information for millions of people and enterprises. However, the information in Internet has not been classified yet and, consequently, the search for information is one of the most important tasks and processes performed by users and systems. In particular, for WWW human users the search for information is the main (time-consuming) task performed. In order to face this problem both the industrial and the academic communities have developed many methods and tools to index and search web pages. The most extended solution is the use of search engines such as Google and Yahoo; however, while current search engines can be a suitable solution to find a particular webpage, they are useless to find the relevant information in such a page. Hence, once a webpage is found, the user must search on it in order to verify if the information needed is in there. This is a problem which until now has not been satisfactorily solved and, thus, there is not an extended solution. In this paper we present a tool able to automatically extract from a webpage the information (text, images, etc.) related to a filtering criterium without the use of semantic specifications or lexicons and without the need for offline parsing or compilation processes.
The following examples are real examples of information filtering using the Web Filtering toolbar. In our first example, we consider a user that is browsing on the Apple Store in order to buy an iPhone. When we open the main webpage of the Apple Store, we see that there is much information (including images and menus) not related to iPhones. Therefore, the user is forced to read unnecessary information in order to find what she is looking for.
Main webpage of the Apple Store
Now, consider that we have a tool available able to filter all the information not related to iPhones. Our algorithm is able to filter a webpage and only show the relevant information according to the given filtering criterion. For instance, with the Apple store webpage, the algorithm would produce the new filtered webpage shown below.
Main webpage of the Apple Store filtered with tolerance 0
Observe that even images and the main horizontal menu have been filtered out. If the user considers that this information is not enough, or has been filtered too much, she can augment the information shown. In this case, the webpage is automatically reprocessed to include more information:
Main webpage of the Apple Store filtered with tolerance 1
Now let us consider another scenario where a user has loaded the Facebook’s page of its creator Mark Zuckerberg:
Mark Zuckerberg’s Facebook page
Observe that Facebook uses the left area of the pages for a publicity area with annoying auto-changing advertisements. In this case, we can use the filtering tool in the opposite way than before. We can delete all the information related to a given word. For instance, we can specify that we want to delete all the “ads”. The resultant webpage is the following:
Mark Zuckerberg’s Facebook page filtered
Information Retrieval from single webpages
In our third example, let us assume (again, this assumption is a real example) that we are looking in Internet for a list of papers published in common by the researchers Germán Vidal and Josep Silva. A first solution could be to find Germán Vidal's webpage and look for him list of publications. To do so we can use the Google search engine and specify a search criterium; for instance, we can type "Germán Vidal papers". The output of Google is shown below.
Unfortunately, the result produced by
Google---and by the rest of current web search engines---is only a list of
webpages. Therefore, once a webpage is selected, we have to manually
review the information contained in this webpage in order to find the
information searched.
As it can be seen in the following figure:
this process is time-consuming because there are more than 100 publications in this webpage, and only 20 are in common with Josep Silva. Therefore, the user must review all the publications to know which of them are Josep Silva's publications.
Now, let us assume that we have available a tool which allows us to
filter a webpage according to a filtering criterium. Since we are
looking for Josep Silva's publications, we can type the filtering
criterium "Josep Silva". Then, we press the filtering button and a
new page is automatically generated from the previous one by
deleting all the information which is not related to Josep Silva.
The result is the list of publications in common by Germán Vidal
and Josep Silva:
This tool does not need to use proxies [7] or to pre-process [8] the filtered webpages. It can work online with any webpage.
Information Retrieval from multiple webpages
In our last example we show a more powerful application. We use the toolbar to automatically extract information from various webpages and put it together in a pleasant and understandable way in a single webpage.
Consider the following three webpages that are interconnected. They correspond to information about grants in the Spanish ministry of education. In order to find this information, a user must read the first page and find the links to the other webpages.
The last webpage has been automatically extracted from the original webpages with the criterium "becas" (in English "grants"). Observe also that the bar can work with any language. The information of this pages have been extracted automatically by the the tool after an analysis of the links. For instance, in order to produce the final webpage, the following links wer explored and analyzed:
[1] J.M. Gómez Hidalgo, F. Carrero García, E. Puertas Sanz. Named Entity Recognition for Web Content Filtering International Conference on Applications of Natural Language, NLDB2005, pages 286-297, 2005
[2] W3C Consortium, Resource Description Framework (RDF). www.w3.org/RDF
[3] W3C Consortium, Web Ontology Language (OWL). www.w3.oeg/2001/sw/wiki/OWL
[4] Microformats.org. The Official Microformats Site. http://microformats.org/, 2009.
[5] R. Khare, T. Çelik Microformats: a Pragmatic Path to the Semantic Web. Proceedings of the 15h International Conference on World Wide Web. International World Wide Web Conference. Poster Sessions pages 865-866, 2006.
[6] R. Khare. Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing, 10(1):68–75, 2006.
[7] Suhit Gupta, Gail E. Kaiser et al. Automating Content Extraction of HTML Documents. World Wide Archive vol.8 issue.2, pages 179-224, 2005.
[8] Po-Ching Li, Mind-Dao Liu, Ying-Dar Lin, Yuang-Cheng Lai Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE – Transactions on Information and Systems vol. E91-D, pages 251-257, 2008.
[9] W3C Consortium, Document Object Model (DOM). www.w3.org/DOM
[10] J. Silva, Information Filtering and Information Retrieval with the Web Filtering Toolbar. Electronic Notes in Theoretical Computer Science, vol. 235, pages 125-136, 2008.
[11] R. Baeza-Yates, C. Castillo, Crawling the Infinite Web: Five levels are enough. WAW, Lecture Notes in Computer Science, vol.3243, pages 156-167. Ed. Springer 2004.
[12] A. Micarelli, F. Gasparetti, Adaptative Focused Crawling. The Adaptative Web, pages 231-262, 2007.
[13] Jakob Nielsen. “Designing Web Usability: The Practice of Simplicity”; New Riders Publishing, Indianapolis ISBN 1-56205-810-X; 2010.
This work has been partially supported by
Josep Silva (Coordinator)
Mercedes García
Tatiana Tomás
Sergio López
Carlos Castillo
Héctor Valero
Architecture
Modules of the Web Filtering toolbar
How to install the Web Filtering Toolbar
The Web Filtering Toolbar is distributed as an xpi file. An xpi package is basically a ZIP file that, when opened by the browser utility, installs a browser extension. Currently, this extension applies to both Mozilla and Firefox browsers. Since an xpi package contains all the necessary information to install an extension, the user only has to drag the xpi and drop it over the browser window. The extension will be automatically installed.
How to use the Web Filtering Toolbar
In the normal case, using the filter is as easy as to type a text t inside a text box and press the button "Filter". Then, the tool produces a slice of the current webpage w.r.t. the filtering criterium (t,f) where f are flags which represent the default options of the tool. However, this options can be easily changed by the user when needed.
In its current version, our tool can be parameterized with four flags which determine the shape of the final slice:
[Keep Structure] When activated, "keep structure" ensures that the components of the slice keep the same position than in the original webpage; i.e., the webpage' structure is kept and, thus, the final slice will contain blank areas (see the next two figures). If the structure is not kept, all the data in the final slice is reorganized so that no empty spaces exist.
[Format Size iFrames] Iframes allow us to embed a webpage inside another webpage. If "format size iframes" is activated, the size of the iframes of the original webpage is adapted to the slice. Otherwise, the original size is kept. Usually, the webpage of an iframe is bigger than the are reserved for the iframe; hence, the iframe uses scrolls. Often, the slice extracted from an iframe is small, and thus, reformatting the size of the iframes avoids unnecessary empty areas produced by the scroll.
[Keep Tree] When a node in the tree representing a document belong to the slice, it is possible to include also all the nodes belonging to the path between this node and the root (the ancestors). To do this, keep tree must be activated. For instance, in the slice of the figure bellow (b) keep tree was activated, while in the slice of the figure bellow (c) it was not.
(a) Original webpage |
(b) keep tree activated |
(c) keep tree inactivated |
[Tolerance] The default tolerance of the slicer is 0. With a tolerance of 0, only the relevant nodes and their descendants are included in the slice. If the tolerance is incremented, then those nodes which are close to the relevant nodes are set as relevant. Concretely, if the tolerance is incremented by one, the parents (and their descendants) of relevant nodes are set as relevant. For instance, in the previous figure (a), with a tolerance of 0 and keep tree deactivated, the slice produced is shown in the previous figure (c). With a tolerance of 1, nodes 1---the parent of 5---and 4---the descendant of 1---would be included in the slice. With a tolerance of 2, all the nodes would be included in the slice.
Download the official Web Filtering Toolbar addon from Firefox |
We conducted several experiments to measure the performance of the tool. The goal of the experiment was to identify the information in a given domain that is related to a particular query of the user. For each domain, the tool explored several webpages with a timeout of 10 seconds and extracted from them the relevant parts. We determined the actual relevant content of each webpage by downloading it and manually selecting the relevant content (both text and multimedia objects).
A summary of the results follows:
URL | Query | Pages | Retrieved | Correct | Missing | Recall | Precision | F1 |
www.ieee.org | student | 10 |
4615 |
4594 |
68 |
98.54% |
99.54% |
99.03% |
www.upv.es | student | 19 |
8618 |
8616 |
232 |
97.37% |
99.97% |
98.65% |
www.un.org/en | Haiti | 8 |
6344 |
6344 |
2191 |
74.32% |
100% |
85.26% |
www.esa.int | launch | 14 |
4860 |
4860 |
417 |
92.09% |
100% |
95.88% |
www.nasa.gov | space | 16 |
12043 |
12008 |
730 |
94.26% |
99.70% |
96.90% |
www.mityc.es | turismo | 14 |
12521 |
12381 |
124 |
99% |
98.88% |
98.93% |
www.mozilla.org | firefox | 7 |
6791 |
6791 |
14 |
99.79% |
100% |
99.89% |
www.edu.gva.es | universitat | 28 |
10881 |
10856 |
995 |
91.60% |
99.79% |
95.51% |
www.unicef.es | Pakistán | 9 |
5415 |
5415 |
260 |
95.41% |
100% |
97.65% |
www.ilo.org | projects | 14 |
1269 |
1269 |
544 |
69.99% |
100% |
82.34% |
www.mec.es | beca | 24 |
5527 |
5513 |
286 |
95.06% |
99.74% |
97.34% |
www.who.int | medicines | 14 |
8605 |
8605 |
276 |
96.89% |
100% |
98.42% |
www.si.edu | asian | 18 |
26301 |
26269 |
144 |
99.45% |
99.87% |
99.65% |
www.sigmaxi.org | scientist | 8 |
26482 |
26359 |
241 |
99.08% |
99.54% |
99.30% |
www.scientificamerican.com | sun | 7 |
5795 |
5737 |
97 |
98.33% |
98.99% |
98.65% |
ecir2011.dcu.ie | news | 8 |
1659 |
1503 |
18 |
98.81% |
90.59% |
94.52% |
dsc.discovery.com | arctic | 9 |
29097 |
29043 |
114 |
99.60% |
99.81% |
99.70% |
www.nationalgeographic.com | energy | 12 |
41624 |
33830 |
428 |
98.75% |
81.27% |
89.16% |
physicsworld.com | nuclear | 15 |
10249 |
10240 |
151 |
98.54% |
99.91% |
99.22% |
The meaning of every column is:
URL: Initial webpage used in the search.
Query: term used in the search.
Pages: number of pages explored by the webfiltering toolbar
Retrieved: number of DOM nodes retrieved by the webfiltering toolbar.
Correct: number of retrieved
nodes that were relevant.
Missing: number of relevant nodes not
retrieved by the webfiltering toolbar.
Recall: number of relevant nodes retrieved
divided by the total number of relevant nodes (in all the analyzed webpages).
Precision: number of relevant nodes retrieved divided by the
total number of retrieved nodes.
F1: it is computed as (2 * P * R) / (P + R) being P the precision and R the recall.
In a second experiment we used the toolbar without any time limit. The goal of this experiment was to study how many pages can retrieve the toolbar in a search space. We limited the search space to a specific domain so that we were able to know the amount of pages in the search space. This was computed with the Apache crawler Nutch: the whole domain was indexed starting from the initial webpage and the
amount of indexed documents was counted. Finally, we compared the search space indexed with the search space explored by the tool.
URL | Query | Toolbar | Nutch | Recall |
www.ieee.org | competition | 20 |
57 |
35.08% |
www.upv.es | architecture | 19 |
70 |
27.14% |
www.un.org | violence | 10 |
129 |
7.75% |
www.esa.int | venus | 142 |
803 |
17.68% |
www.nasa.gov | astronaut | 144 |
527 |
27.32% |
www.mityc.es | programa | 43 |
106 |
40.56% |
www.edu.gva.es | universitat | 33 |
55 |
60% |
www.unicef.es | niños | 12 |
87 |
13.79% |
www.ilo.org | child | 121 |
755 |
16.02% |
www.mec.es | estudiante | 28 |
47 |
59.57% |
www.who.int | alcohol | 14 |
31 |
45.16% |
www.si.edu | biology | 8 |
238 |
3.3% |
www.dsc.discovery.com | tornado | 165 |
246 |
67.07% |
www.nationalgeographic.com | projects | 27 |
282 |
9.57% |
The meaning of every column is:
URL: Initial webpage used in the search.
Query: term used in the search.
Toolbar: number of pages explored by the webfiltering toolbar
Nutch: number of pages in the domain (size of the search space).
Recall: recall of the webfiltering toolbar with respect to the whole search space (Nutch).
Here you can download the Web Filtering Toolbar:
|
Last update: 03/04/2010 18:09:12 |