Downloads

Description: Category: Developer/Mantainer: Link:
Geo-WordNet: annotation of WordNet 2.0 and WordNet 3.0 geographical location synsets with their coordinates. Geographical Information Retrieval, Toponym Disambiguation Davide Buscaldi
GeoSemCor2.0 : a geographically annotated version of SemCor for WordNet 2.0 Toponym Disambiguation Davide Buscaldi
EmotiCorpus: A set of wikiquote humor-annotated pages (in Italian). Automatic Humour Recognition Davide Buscaldi
CiaoSenso/CDD A Conceptual-Density based Word Sense Disambiguation tool. Word Sense Disambiguation Davide Buscaldi
SVM model: An SVM model trained for Arabic Named Entities Recognition on Newswire documents. The input file should be:

1- In romanized characters, using Buckwalter mapping table;
2- With clitics segmented, i.e. tokenized text (Mona Diab's tokenizer can be used for this purpose);
3- One word per column.

You should have YamCha installed and use the following command:

yamcha -m SVMmodel.model < inputFile > outputFile

The output file will contain two columns: first one for words and second one for tags.
Arabic NER Yassine Benajiba
ANERCorp: A Corpus of more than 150,000 words annotated for the NER task. Arabic NER Yassine Benajiba
ANERGazet: A collection of 3 Gazetteers, (i) Locations: a Gazetteer containing names of continents, countries, cities, etc.; (ii) People: a Gazetteer containing names of people recollected manually from different Arabic websites; and finally (iii) Organizations: containing names of Organizations like companies, football teams, etc. Arabic NER Yassine Benajiba
Documents: more than 11,000 Arabic Wikipedia Articles in SGML format (the format adopted in the CLEF and also the one accepted by the JIRS system). Arabic QA/IR Yassine Benajiba
Arabic JIRS: The Arabic JIRS experiments were carried using the set of Arabic documents, questions and answers which can be found in this same page. The results have shown that a much better coverage can be achieved when the Arabic data is light stemmed. Arabic QA Yassine Benajiba
List of Questions: This is a list of 200 questions of different types. The proportion of each type of questions is the same proportion adopted in CLEF. Arabic QA Yassine Benajiba
Enriched List of Questions: Set of TREC and CLEF questions in Arabic (see above List of Questions) enriched with a query expansion process.

These questions have been expanded using an Arabic WordNet-based semantic Query Expansion process divided into four types: By Synonyms, By Definitions, By Subtypes and By Supertypes.

(We would like to thank Lahsen Abouenour, Ecole Mohammadia d'Ingenieurs Rabat Morocco, for allowing to make available this collection).
Arabic QA Lahsen Abouenour / Paolo Rosso
List of Correct Answers: For each of the questions presented in my list of questions, I give you here a list of correct answers for each question. This list is very important for automatic evaluation. Arabic QA Yassine Benajiba
Arabic language rules (in Arabic): Somebody has mailed me this pps file which summarizes all the Arabic rules, unfortunately there is no English version of the file. I would have translated it myself because it's really worth it but the file contains 812 slides!!. Arabic language Yassine Benajiba
Arabic Wikipedia XML corpus (R30 version): I have selected the 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari in order to provide a testbed for the single-label categorization task in the Arabic language. The gold standard is provided, as well as the tokenized and untokenized versions of this corpus. Clustering and
Categorization
(Arabic language)
David Pinto
The CICLing-2002 clustering corpus: This a pre-processed version of 48 scientific abstracts from the CICLing 2002 conference (computational linguistics) which may be used to manually verify the results obtained in the clustering task of narrow domain short texts.

(We would like to thank Dr. Alexander Gelbukh for allowing to make available this collection)
Clustering and
Categorization
- English language
- Narrow domain short texts
David Pinto
The single-label hep-ex clustering corpus: This corpus is a pre-processed version of the collection of scientific abstracts compiled by the University of Jaén, Spain named hep-ex [REF].

(We would like to thank Dr. Alfonso Ureña López and Dr. Arturo Montejo Ráez for providing availability to this interesting text collection)
Clustering and
Categorization
- English language
- Narrow domain short texts
David Pinto
The KnCr clustering corpus: This is a new narrow-domain short text corpus in the medicine domain which was constructed by downloading the last sample of documents provided in MEDLINE and selecting only those which are related with the "Cancer" domain.
Clustering and
Categorization
- English language
- Narrow domain short texts
David Pinto
Blogs clustering corpora: This is a set of corpora made up of discussion lines extracted from two blogs websites: boing-boing and slashdot.

The categories (discussion lines), the blogs and gold standard are provided.
Clustering and
Categorization
- English language
- short texts
- blogs
Dani Pérez/David Pinto
CLiPA corpus: This is a corpus composed of a set of 5 original text fragments (written in English) which have been plagiarised by multiple persons and machine translators (including versions in Spanish and Italian). The corpus has been designed for the development (and test) of Cross-Lingual Plagiarism Analysis applications.
Plagiarism
Analysis
- English/Spanish/Italian languages
- texts fragments
Alberto Barrón Cedeño
Blogs Analysis corpus: The corpus is integrated by 8 sets. Every set contains 2.400 documents automatically retreived from LiveJournal and Wikipedia. The corpus is organised as follows: i) The [mfs] versions contain the documents labelled with POS tags and the mosf frequent sense according to WordNet. ii) The [xml] versions contain the sets converted into the Senseval-2 formatted XML. The corpus has been designed for analysing humour features in the Blogosphere.
Humour
Recognition
- English language
- Blogs and articles
Antonio Reyes
PAN-PC-09 corpus: This corpus contains documents in which artificial plagiarism has been inserted automatically. The corpus can be used to evaluate two kinds of plagiarism detection tasks: (i) External plagiarism detection; and (ii) Intrinsic plagiarism detection. Automatic Plagiarism Detection  Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, Paolo Rosso
Co-derivatives corpus: This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia in German, English, Hindi and Spanish (around 5,000 documents per language). For each language, some of the most frequently consulted articles in Wikipedia have been considered as pivot and ten of its revisions were downloaded, which compose the set of co-derivatives. The corpus has three versions: (i) original (articles without further manipulation); (ii) clean (articles after case folding and punctuation marks elimination); and (iii) stopwords free (articles after case folding and punctuation marks and stopwords elimination). Co-derivatives
and text similarity analysis
- Languages: German, English, Hindi and Spanish
- Articles
Alberto Barrón Cedeño / Andreas Eiselt
CL-PL-09 corpus: The corpus includes texts in Dutch, English, French,
German, Polish, and Spanish. It is divided into two sections: (i) comparable, including texts on the same topic extracted from Wikipedia; and (ii) parallel, including texts extracted from the JRC-Acquis corpus. In both cases, documents on the six languages are included (be parallel or just on the same topic). The objective is considering two of the most common cross-language plagiarism detection tasks: detection of exact translations and detection of related documents.
Cross-Language Plagiarism Detection Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, Paolo Rosso
Drug Target Corpus: It consists of positive and negative drug target abstracts from DrugBank and PubMed. It was created with abstracts published from 1995 to 2003.
Biomedicine,
Roxana Danger
English-Spanish dictionary of weighted morphological forms: It contains an exhaustive list of forms weighted according to the distributions of corresponding grammar classes in reference corpora.
Cross-Language Information Retrieval
Cross-Language Plagiarism Detection
Machine Translation
Grigori Sidorov,
Alberto Barrón-Cedeño
Opinion analysis coprus: It contains 3,000 opinions on the domain of tourism. These opinions have been obtained from the TripAdvisor blog.
Analysis Opinion, Sentiment Analysis Enrique Vallés Balaguer,

Paolo Rosso

 

Twitter Hashtags Corpus: The corpus contains 50.000 texts automatically retrieved from Twitter. It is divided in 5 data sets; each one contains 10.000 texts. Different hashtags were considered in order to collect the texts: #humor, #irony, #politics, #technology. Further information, please send an email to Antonio Reyes
Humor Recognition, Irony Detection, Sentiment Analysis Antonio Reyes

 

Twitter Data sets: The corpus contains 40.000 tweets automatically retrieved from Twitter. It is divided in 4 sets; each one contains 10.000 texts. Different hashtags were considered in order to collect the texts: #irony, #education, #humor, #politics. Two more data sets are included. They contain tweets regarding the hashtag #Toyota. Each set has 250 tweets divided in positive and negative attitude. Further information, please send an email to Antonio Reyes
Irony Detection Antonio Reyes

 

Features Inventory: This file contains the elements to represent all the dimensions regarding Signatures feature.
Irony Detection Antonio Reyes

 

Ironic Quotes: This file contains about 1000 ironic quotes manually retrieved from the Web.
Irony Detection Antonio Reyes

 

Amazon Data Sets: This file contains different comments regarding four products. These comments contain ironic, funny, satiric and sarcastic content.
Figurative Language Processing, Humor Recognition, Irony Detection Antonio Reyes

 

PAN-PC-10 corpus: This corpus contains documents in which artificial and simulated plagiarism has been inserted. The corpus can be used to evaluate two kinds of plagiarism detection tasks: (i) External plagiarism detection; and (ii) Intrinsic plagiarism detection. Automatic Plagiarism Detection

Martin Potthast, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, Paolo Rosso

SPADE corpus: This toy corpus is composed of a set of source codes written in Python and source codes manually tranlated source codes into C and Java. Those translations represent a partial re-use from the Python sources. The corpus has been designed for the development (and test) of Cross-Lingual Source Code Re-use/Plagiarism Analysis applications. Source Code Re-Use Detection

Enrique Flores, Alberto Barrón-Cedeño, Paolo Rosso,Lidia Moreno

CL!TR PAN@FIRE: The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files encoded in UTF-8. The source documents are taken from English Wikipedia. The source documents include Wiki-mark up. Cross-Language Plagiarism Detection

Alberto Barrón-Cedeño, Paolo Rosso,Sobha Lalitha,Paul Clough, Mark Stevenson