Research Interests:

  • Information Extraction
  • Plagiarism Detection (References collection)
  • Text similarity analysis
  • Downloads

    Corpora:
    Title Link Year Related paper
    PAN Plagiarism Corpus PAN-PC-09
    This corpus contains documents in which artificial plagiarism has been inserted automatically. The corpus can be used to evaluate two kinds of plagiarism detection tasks: (i) External plagiarism detection: given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. (ii) Intrinsic plagiarism detection: given a set of suspicious documents the task is to identify all plagiarized text passages, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.
    2009
    Co-derivatives corpus
    This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia including documents in German, English, Hindi and Spanish (around 5,000 documents per language). For each language, some of the most frequently consulted articles in Wikipedia have been considered as pivot and ten of its revisions were downloaded, which compose the set of co-derivatives. The corpus has three versions: Original, Clean and Stopwords free. The first one contains the Wikipedia articles without further manipulations. The second one contains the articles after case folding and punctuation marks elimination. The last one contains the articles after case folding and punctuation marks and stopwords have been eliminated.
    2009
    Developments:
    Title Year Documentation
    TERMEXT - another TERM EXtraction Tool
    This tool is an adaptation of the C-value/NC-value algorithm. A description of such adaptations can be found in the paper entitled An Improved Term Recognition Method for Spanish.
    The method has been designed to process documents written in Spanish. However, after adding the corresponding linguistic rules into a text file, any language could be processed.
    Currently the documentation is in Spanish. However, the source code (in Python) includes a lot of documentation written in English.
    This prototype has been developed in the Natural Language Group at UNAM. An on-line version of the prototype (not the one I implemented) is available HERE
    2008
    Final Degree Projects:
    Title: Grade: Year: Download
    Detección automática de plagio en texto M.Sc 2008
    Extracción automática de términos en contextos definitorios M.Sc 2007
    Sistema de avisos a través de página web en el servidor de correo
    electrónico del Instituto de Ingeniería
    B.Eng. 2004 .