Research Interests: |
|
|
| Corpora: | |||
| Title | Link | Year | Related paper |
|
PAN Plagiarism Corpus PAN-PC-09 This corpus contains documents in which artificial plagiarism has been inserted automatically. The corpus can be used to evaluate two kinds of plagiarism detection tasks: (i) External plagiarism detection: given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. (ii) Intrinsic plagiarism detection: given a set of suspicious documents the task is to identify all plagiarized text passages, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task. |
|
2009 | ![]() |
|
Co-derivatives corpus This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia including documents in German, English, Hindi and Spanish (around 5,000 documents per language). For each language, some of the most frequently consulted articles in Wikipedia have been considered as pivot and ten of its revisions were downloaded, which compose the set of co-derivatives. The corpus has three versions: Original, Clean and Stopwords free. The first one contains the Wikipedia articles without further manipulations. The second one contains the articles after case folding and punctuation marks elimination. The last one contains the articles after case folding and punctuation marks and stopwords have been eliminated. |
|
2009 | ![]() |
| Developments: | |||
| Title | Year | Documentation | |
|
TERMEXT - another TERM EXtraction Tool This tool is an adaptation of the C-value/NC-value algorithm. A description of such adaptations can be found in the paper entitled An Improved Term Recognition Method for Spanish. The method has been designed to process documents written in Spanish. However, after adding the corresponding linguistic rules into a text file, any language could be processed. Currently the documentation is in Spanish. However, the source code (in Python) includes a lot of documentation written in English. This prototype has been developed in the Natural Language Group at UNAM. An on-line version of the prototype (not the one I implemented) is available HERE |
2008 | ![]() |
|
| Final Degree Projects: | |||
| Title: | Grade: | Year: | Download |
| Detección automática de plagio en texto | M.Sc | 2008 | ![]() |
| Extracción automática de términos en contextos definitorios | M.Sc | 2007 | ![]() |
|
Sistema de avisos a través de página web en el servidor de correo electrónico del Instituto de Ingeniería |
B.Eng. | 2004 | . |