Back to Home

Welcome to the Tutorial : Experimenting IR/NLP with Terrier



Introduction

The research areas of Information Retrieval (IR), Natural Language Processing (NLP), Text Mining etc are inter-linked by mostly a common input that is text. To do research in these areas, we need to handle text efficiently. Quite often we need to compare text and sometimes across languages. There are several ways to compare text e.g. Vector Space Models, Probabilistic Models, Language Models etc. There are many challenges to carry research in these areas: i) the data-sets are growing in size enormously ii) a lot new is happening i.e. new baselines are coming up. In this era, we can not afford to implement all the baselines. So an Information Retrieval Library gives you all these things and it is important to use one of them for the research objectives. That saves both, time and effort, at the same time gives us the confidence (also to the reviewers) about the implementation of the baselines.

There are quite a few alternatives for IR libraries e.g. Lucene, Lemur, Xapian, Terrier, Sphinx to name some main-stream options. All of them provide, basic IR functionality. So depending on your preference of programming language, you can choose one. Terrier has some advantage over all others. Terrier has almost all the IR models while others have hardly one or two. Terrier is also much faster and efficient in terms of speed and memory. It is very easy to extend Terrier API for our customised use.

This tutorial aims to share the developer knowledge of the Terrier API. The basic need from a IR system for experiments in IR/NLP will be explained and how it can easily be satisfied with Terrier will be explained. The motive of the tutorial will be to help the participants use terrier in their experiments irrespective to their background with programming languge. The tutorial will be well equipped with the extensive hands-on sessions. Various parts of the API will be explored by different case studies like how to fetch term or document related statistics from the index, how to create a term-document matrix of the corpus, how to customise indexing and retrieval and so on. In the end, there will be an application case study (IRIS - Chat system), which will cover how the API can be useful for the IR part of the application. The whole idea is to make participants comfortable with the Terrier API.


Material

Time and Venue

Session 1: Monday, 15 April from 3pm to 5pm in room Potential 1 @13N
Session 2: Monday, 22 April from 3pm to 5pm in room Potential 1 @ 13N


Requirements