BolitaAzul Version 1.5 available implementing information retrieval

Scissors   T h e   W e b   F i l t e r i n g   P r o j e c t

Index:

Bolita General description
Bolita Related papers
Bolita Team
Bolita Implementation
Bolita Experiments
Bolita Downloads


  G e n e r a l   D e s c r i p t i o n

Nowadays, Internet is the main source of information for millions of people and enterprises. However, the information in Internet has not been classified yet and, consequently, the search for information is one of the most important tasks and processes performed by users and systems. In particular, for WWW human users the search for information is the main (time-consuming) task performed. In order to face this problem both the industrial and the academic communities have developed many methods and tools to index and search web pages. The most extended solution is the use of search engines such as Google and Yahoo; however, while current search engines can be a suitable solution to find a particular webpage, they are useless to find the relevant information in such a page. Hence, once a webpage is found, the user must search on it in order to verify if the information needed is in there. This is a problem which until now has not been satisfactorily solved and, thus, there is not an extended solution. In this paper we present a tool able to automatically extract from a webpage the information (text, images, etc.) related to a filtering criterium without the use of semantic specifications or lexicons and without the need for offline parsing or compilation processes.

Examples

bola Information Filtering

The following examples are real examples of information filtering using the Web Filtering toolbar. In our first example, we consider a user that is browsing on the Apple Store in order to buy an iPhone. When we open the main webpage of the Apple Store, we see that there is much information (including images and menus) not related to iPhones. Therefore, the user is forced to read unnecessary information in order to find what she is looking for.

Apple

Main webpage of the Apple Store

Now, consider that we have a tool available able to filter all the information not related to iPhones. Our algorithm is able to filter a webpage and only show the relevant information according to the given filtering criterion. For instance, with the Apple store webpage, the algorithm would produce the new filtered webpage shown below.

Apple

Main webpage of the Apple Store filtered with tolerance 0

Observe that even images and the main horizontal menu have been filtered out. If the user considers that this information is not enough, or has been filtered too much, she can augment the information shown. In this case, the webpage is automatically reprocessed to include more information:

Apple

Main webpage of the Apple Store filtered with tolerance 1

Now let us consider another scenario where a user has loaded the Facebook’s page of its creator Mark Zuckerberg:

Facebook

Mark Zuckerberg’s Facebook page

Observe that Facebook uses the left area of the pages for a publicity area with annoying auto-changing advertisements. In this case, we can use the filtering tool in the opposite way than before. We can delete all the information related to a given word. For instance, we can specify that we want to delete all the “ads”. The resultant webpage is the following:

Facebook

Mark Zuckerberg’s Facebook page filtered

 

bola Information Retrieval from single webpages

In our third example, let us assume (again, this assumption is a real example) that we are looking in Internet for a list of papers published in common by the researchers Germán Vidal and Josep Silva. A first solution could be to find Germán Vidal's webpage and look for him list of publications. To do so we can use the Google search engine and specify a search criterium; for instance, we can type "Germán Vidal papers". The output of Google is shown below.

German Vidal's papers

Unfortunately, the result produced by Google---and by the rest of current web search engines---is only a list of webpages. Therefore, once a webpage is selected, we have to manually review the information contained in this webpage in order to find the information searched.
As it can be seen in the following figure:

German Vidal's papers

this process is time-consuming because there are more than 100 publications in this webpage, and only 20 are in common with Josep Silva. Therefore, the user must review all the publications to know which of them are Josep Silva's publications.

Now, let us assume that we have available a tool which allows us to filter a webpage according to a filtering criterium. Since we are looking for Josep Silva's publications, we can type the filtering criterium "Josep Silva". Then, we press the filtering button and a new page is automatically generated from the previous one by deleting all the information which is not related to Josep Silva.
The result is the list of publications in common by Germán Vidal and Josep Silva:

Germán Vidal and Josep Silva's publications

This tool does not need to use proxies [7] or to pre-process [8] the filtered webpages. It can work online with any webpage.

bola Information Retrieval from multiple webpages

In our last example we show a more powerful application. We use the toolbar to automatically extract information from various webpages and put it together in a pleasant and understandable way in a single webpage.

Consider the following three webpages that are interconnected. They correspond to information about grants in the Spanish ministry of education. In order to find this information, a user must read the first page and find the links to the other webpages.

Web
     
 

The last webpage has been automatically extracted from the original webpages with the criterium "becas" (in English "grants"). Observe also that the bar can work with any language. The information of this pages have been extracted automatically by the the tool after an analysis of the links. For instance, in order to produce the final webpage, the following links wer explored and analyzed:

 

 R e l a t e d   P a p e r s  a n d  P r o j e c t s

[1] J.M. Gómez Hidalgo, F. Carrero García, E. Puertas Sanz. Named Entity Recognition for Web Content Filtering International Conference on Applications of Natural Language, NLDB2005, pages 286-297, 2005

[2] W3C Consortium, Resource Description Framework (RDF). www.w3.org/RDF

[3] W3C Consortium, Web Ontology Language (OWL). www.w3.oeg/2001/sw/wiki/OWL

[4] Microformats.org. The Official Microformats Site. http://microformats.org/, 2009.

[5] R. Khare, T. Çelik Microformats: a Pragmatic Path to the Semantic Web. Proceedings of the 15h International Conference on World Wide Web. International World Wide Web Conference. Poster Sessions pages 865-866, 2006.

[6] R. Khare. Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing, 10(1):68–75, 2006.

[7] Suhit Gupta, Gail E. Kaiser et al. Automating Content Extraction of HTML Documents. World Wide Archive vol.8 issue.2, pages 179-224, 2005.

[8] Po-Ching Li, Mind-Dao Liu, Ying-Dar Lin, Yuang-Cheng Lai Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE – Transactions on Information and Systems vol. E91-D, pages 251-257, 2008.

[9] W3C Consortium, Document Object Model (DOM). www.w3.org/DOM

[10] J. Silva, Information Filtering and Information Retrieval with the Web Filtering Toolbar. Electronic Notes in Theoretical Computer Science, vol. 235, pages 125-136, 2008.

[11] R. Baeza-Yates, C. Castillo, Crawling the Infinite Web: Five levels are enough. WAW, Lecture Notes in Computer Science, vol.3243, pages 156-167. Ed. Springer 2004.

[12] A. Micarelli, F. Gasparetti, Adaptative Focused Crawling. The Adaptative Web, pages 231-262, 2007.

[13] Jakob Nielsen. “Designing Web Usability: The Practice of Simplicity”; New Riders Publishing, Indianapolis ISBN 1-56205-810-X; 2010.

This work has been partially supported by

 T e a m

Josep Silva (Coordinator)

Mercedes García

Tatiana Tomás

Sergio López

Carlos Castillo

Héctor Valero

I m p l e m e n t a t i o n

bola Architecture

Architecture

Ar

Modules of the Web Filtering toolbar

bola How to install the Web Filtering Toolbar

The Web Filtering Toolbar is distributed as an xpi file. An xpi package is basically a ZIP file that, when opened by the browser utility, installs a browser extension. Currently, this extension applies to both Mozilla and Firefox browsers. Since an xpi package contains all the necessary information to install an extension, the user only has to drag the xpi and drop it over the browser window. The extension will be automatically installed.

bola How to use the Web Filtering Toolbar

In the normal case, using the filter is as easy as to type a text t inside a text box and press the button "Filter". Then, the tool produces a slice of the current webpage w.r.t. the filtering criterium (t,f) where f are flags which represent the default options of the tool. However, this options can be easily changed by the user when needed.

Options

In its current version, our tool can be parameterized with four flags which determine the shape of the final slice:

bola [Keep Structure] When activated, "keep structure" ensures that the components of the slice keep the same position than in the original webpage; i.e., the webpage' structure is kept and, thus, the final slice will contain blank areas (see the next two figures). If the structure is not kept, all the data in the final slice is reorganized so that no empty spaces exist.

Cambridge

Cambridge

bola [Format Size iFrames] Iframes allow us to embed a webpage inside another webpage. If "format size iframes" is activated, the size of the iframes of the original webpage is adapted to the slice. Otherwise, the original size is kept. Usually, the webpage of an iframe is bigger than the are reserved for the iframe; hence, the iframe uses scrolls. Often, the slice extracted from an iframe is small, and thus, reformatting the size of the iframes avoids unnecessary empty areas produced by the scroll.

bola [Keep Tree] When a node in the tree representing a document belong to the slice, it is possible to include also all the nodes belonging to the path between this node and the root (the ancestors). To do this, keep tree must be activated. For instance, in the slice of the figure bellow (b) keep tree was activated, while in the slice of the figure bellow (c) it was not.

          Tree Filtered Tree Filtered Tree
(a) Original webpage
(b) keep tree activated
(c) keep tree inactivated

bola [Tolerance] The default tolerance of the slicer is 0. With a tolerance of 0, only the relevant nodes and their descendants are included in the slice. If the tolerance is incremented, then those nodes which are close to the relevant nodes are set as relevant. Concretely, if the tolerance is incremented by one, the parents (and their descendants) of relevant nodes are set as relevant. For instance, in the previous figure (a), with a tolerance of 0 and keep tree deactivated, the slice produced is shown in the previous figure (c). With a tolerance of 1, nodes 1---the parent of 5---and 4---the descendant of 1---would be included in the slice. With a tolerance of 2, all the nodes would be included in the slice.

Nota Download the official Web Filtering Toolbar addon from Firefox

 

 

  E x p e r i m e n t s

We conducted several experiments to measure the performance of the tool. The goal of the experiment was to identify the information in a given domain that is related to a particular query of the user. For each domain, the tool explored several webpages with a timeout of 10 seconds and extracted from them the relevant parts. We determined the actual relevant content of each webpage by downloading it and manually selecting the relevant content (both text and multimedia objects).

A summary of the results follows:

URL Query Pages Retrieved Correct Missing Recall Precision F1
www.ieee.org student
10
4615
4594
68
98.54%
99.54%
99.03%
www.upv.es student
19
8618
8616
232
97.37%
99.97%
98.65%
www.un.org/en Haiti
8
6344
6344
2191
74.32%
100%
85.26%
www.esa.int launch
14
4860
4860
417
92.09%
100%
95.88%
www.nasa.gov space
16
12043
12008
730
94.26%
99.70%
96.90%
www.mityc.es turismo
14
12521
12381
124
99%
98.88%
98.93%
www.mozilla.org firefox
7
6791
6791
14
99.79%
100%
99.89%
www.edu.gva.es universitat
28
10881
10856
995
91.60%
99.79%
95.51%
www.unicef.es Pakistán
9
5415
5415
260
95.41%
100%
97.65%
www.ilo.org projects
14
1269
1269
544
69.99%
100%
82.34%
www.mec.es beca
24
5527
5513
286
95.06%
99.74%
97.34%
www.who.int medicines
14
8605
8605
276
96.89%
100%
98.42%
www.si.edu asian
18
26301
26269
144
99.45%
99.87%
99.65%
www.sigmaxi.org scientist
8
26482
26359
241
99.08%
99.54%
99.30%
www.scientificamerican.com sun
7
5795
5737
97
98.33%
98.99%
98.65%
ecir2011.dcu.ie news
8
1659
1503
18
98.81%
90.59%
94.52%
dsc.discovery.com arctic
9
29097
29043
114
99.60%
99.81%
99.70%
www.nationalgeographic.com energy
12
41624
33830
428
98.75%
81.27%
89.16%
physicsworld.com nuclear
15
10249
10240
151
98.54%
99.91%
99.22%

The meaning of every column is:

URL: Initial webpage used in the search.
Query: term used in the search.
Pages: number of pages explored by the webfiltering toolbar
Retrieved: number of DOM nodes retrieved by the webfiltering toolbar.
Correct: number of retrieved nodes that were relevant.
Missing: number of relevant nodes not retrieved by the webfiltering toolbar.
Recall: number of relevant nodes retrieved divided by the total number of relevant nodes (in all the analyzed webpages).
Precision: number of relevant nodes retrieved divided by the total number of retrieved nodes.
F1: it is computed as (2 * P * R) / (P + R) being P the precision and R the recall.

In a second experiment we used the toolbar without any time limit. The goal of this experiment was to study how many pages can retrieve the toolbar in a search space. We limited the search space to a specific domain so that we were able to know the amount of pages in the search space. This was computed with the Apache crawler Nutch: the whole domain was indexed starting from the initial webpage and the amount of indexed documents was counted. Finally, we compared the search space indexed with the search space explored by the tool.

URL Query Toolbar Nutch Recall
www.ieee.org competition
20
57
35.08%
www.upv.es architecture
19
70
27.14%
www.un.org violence
10
129
7.75%
www.esa.int venus
142
803
17.68%
www.nasa.gov astronaut
144
527
27.32%
www.mityc.es programa
43
106
40.56%
www.edu.gva.es universitat
33
55
60%
www.unicef.es niños
12
87
13.79%
www.ilo.org child
121
755
16.02%
www.mec.es estudiante
28
47
59.57%
www.who.int alcohol
14
31
45.16%
www.si.edu biology
8
238
3.3%
www.dsc.discovery.com tornado
165
246
67.07%
www.nationalgeographic.com projects
27
282
9.57%

The meaning of every column is:

URL: Initial webpage used in the search.
Query: term used in the search.
Toolbar: number of pages explored by the webfiltering toolbar
Nutch: number of pages in the domain (size of the search space).
Recall: recall of the webfiltering toolbar with respect to the whole search space (Nutch).

  D o w n l o a d s

Here you can download the Web Filtering Toolbar:


_______________Web Filtering Toolbar_______________
star Web Filtering Toolbar 1.5 (Beta Version) Descargar
 

This version:
+ Introduces the functionality of information retrieval from multiple webpages.
+ The search functionality has been improved with highlighting functionality.
+ The tolerance uses a new improved algorithm.
+ New keep structure algorithms have been implemented.

 
star Web Filtering Toolbar 1.4 Descargar
  This version:
+ Completely reestructures the source code with an object-oriented architecture.
 
star Web Filtering Toolbar 1.3 Descargar
  This version:
+ Corrects some detected small bugs in the source code.
+ Adds multi-language support.
+ Adds frames support.
 
star Web Filtering Toolbar 1.2 Descargar
 

In this version:
+ Some algorithms have been optimized to speed up the filtering process.
+ New functionality, such as the "highlight" button, has been added.

 
star Web Filtering Toolbar 1.1 Descargar
 

This version solves several problems detected in the previous version:
+ A unique prefix has been added to all functions (name crashes are avoided now)
+ The options panel is now available through the Firefox Add-ons window
+ The size of the complement is now 1Mb instead of 2Mb
+ Several unnecessary files have been removed
+ Corrected a problem in the XUL interface specification
+ Several execution warnings have been solved

 
star Web Filtering Toolbar 1.0 Descargar

  

MiST GPLIS DSIC UPV
Last update: 03/04/2010 18:09:12