Introduction


This edition of CL!TR focuses on journalistic text re-use. News agencies are a prolific source of text on the Web and a valuable source of text in multiple languages. News stories generated by different authors, whether independently or derived, typically exist as separate entities and consequently there is a need to link them.


Linking news stories covering the same events written in different languages offers a number of benefits. For example, in a multilingual environment, such as India, where the same news story is covered in multiple languages, a reader might want to refer to the local language version of a news story. News stories covering the same event(s), published in different languages, may also be rich sources of both parallel and comparable text, for example, parallel fragments in the news story, e.g. direct quotes or translation equivalents; comparable fragments, e.g. paraphrases. Therefore identification of similar news stories written in multiple languages offers a valuable multilingual resource. In the case of Indian languages there exist limited language resources for NLP and IR tasks. For instance, identifying comparable and parallel documents on the web would offer a potential (and abundant) source for deriving bilingual dictionaries and training statistical MT systems (Munteanu & Marcu, 2005; Barker & Gaizauskas, 2012).


In 2012, the aim is to identify the same story written in multiple languages (a problem of cross-language news story detection). The task will involve identifying and linking news stories covering the same event, but published in different languages. In the coming editions of CL!NSS the aim will be to extract equivalent text fragments (parallel and comparable) and finally to identify cases of potential co-derivation between documents (a common scenario in journalism as content is shared between news agencies and newspapers). The latter task has been extensively studied in monolingual settings, but not as deeply in cross-language ones. We divide the problem of CL!NSS into three distinct tasks:


1. Story detection: given a story in one language find the note covering the same story but written in a different language.


2. Fragment detection: given a pair of similar (comparable) news reports, extract parallel text fragments (e.g. sentences, phrases etc.).


3. Story/fragments classification (derived or non-derived): in some cases a pair of news reports will exist in a relationship where a pair of articles (fragments) are co-derived, i.e. one of the stories has been written based on using the other one.


Definitions


A news story communicates to its readers information about an event or series of events (an event being something that happens at a specific time and location). For example, a news story might follow events in Syria. An article/report refers to the story which is published on a specific date and appears in a newspaper or online. The article will typically report the story (or part of the story) from a particular aspect/viewpoint for a particular audience (e.g. written in a specific language). A news story will often consist of a collection of articles. Some stories will be ‘one-off’ (those describing events which occur only on one day); others ‘running’ where events across more than one day will be reported. Another example might be Wimbledon, an event which occurs annually. A particular article might describe the outcome of the final match of the tournament.


As previously stated, locating similar news stories has a number of potential uses. However, a key issue is deciding when two news articles are similar. One would assume that the similar the news articles the more comparable they are and subsequently more useful, e.g. as a source of comparable text. To provide a basis for judging similar news stories, we adopt the scheme devised by Barker & Gaizauskas (2012). This scheme is well-suited to the CL!NSS task and is applicable for monolingual and cross-language news stories comparison. The scheme is based on identifying the content and structure of news articles consisting of: Focal event, Background event and News event.


Focal event: The main event or events which provide a focus for the news story. The focus here is considered as a very specific level of information. Very often the most recent event in an unfolding news story, they also provide a particular angle or perspective for the report. For example, "Nagaland Congress seeks NPF backing for Pranab". Here the focal event of the news story is to seek the backing by some entity from another entity in support of Pranab for Presidential elections in India.


Background event: an event that plays a supporting role in the text, providing context for the focal events. It may include: related events leading up to the focal events; examples of similar past events; and definitions, explanations or descriptions of things, people and or places which play a role in the focal events.


News event: a group of related events, broader than and including the focal event, which may be reported over time in different news text instalments. This is related to the concept of "real-world event". All the news stories which are related to a particular event taking place in the world share the same News event. For example, all the news articles related to current Presidential elections in India including the early articles on the possible candidates, controversies raised in between to the last stories of the completion of the election and the results of the election fall under the same News event.


Fig. 1. summarises the proposed tasks for CL!NSS and highlights different forms of similarity that may exist between news stories. Let A and B be a pair of news reports written in different languages. A and B can loosely be on the same theme/topic (or category). This is typically the goal of IR systems: to identify documents which are relevant to a given query (where relevance typically reflects topicality). Information about the news story, such as its category (Entertainment, Sports, News etc.), together with the date of publication, is sometimes available in the metadata.




Figure 1. Summary of tasks in CL!NSS and the relationship between a pair of news articles Pair(A, B)
Task Cycle



The second level in Fig. 1 identifies where two news reports are basically describing the same focal events, i.e. they could be the same news report produced in multiple languages. The focus of the news stories are similar and the same events are basically reported in each article. The third level signifies the situation in which fragments between the texts are clearly the same (e.g. they are translation equivalents), although may be subject to forms of paraphrasing or use of colloquial language. This is usually referred as the shared content between them where the granularity of content can be at sentence or sub-sentence level. The final level in Fig. 1 represents the situation in which the similar text fragments are actually derived from each other (e.g. one text fragment is copied from the other or both come from a common source).


Task Description




The focus of CL!NSS this year is to evaluate the identification of news stories with same focal event in a cross-language environment. For the given source collection, S, containing news stories in Indian languages, Li, and the target collection, T, containing news stories in English, Lt, the task is to link each news story in T to its corresponding version in S for each Li. The news stories are considered the same if they describe the same focal event. For example, “Housing minister Somanna attacked with slipper” and “BJP leaders condemn attack on Somanna” are two news stories with two different focal events: the former describes the main news event whilst the latter is the consequences of the event. The framework of this year’s task is shown in Fig. 2. The languages included in S will be Hindi and Gujarati (and possibly Marathi).


Figure 2. Framework of the CL!NSS task for 2012
Task Framework

The task is similar to a (cross-language) copy detection task where the query is an entire document and “similar” documents must be found from a set of known documents. The task is not trivial because similar stories may exist with varying degrees of overlapping (e.g. a story written in English and used as the query text may be a subset of a longer story written in a different language, and vice-versa). Table 1 provides an example of relevant and non-relevant English-Hindi text pair. Although the source articles share the same news event as the target, the focal events are similar for source article 1 (relevant), but differ for source article 2 (non-relevant).

Article Details of the Article Same Focal Event?
wrt Target Article
Target Article Aarushi case: Court rejects Nupur Talwar's bail plea
(http://timesofindia.indiatimes.com/city/delhi/aaru
shi-murder-case-court-rejects-nupur-talwars-bail-pl
ea/articleshow/12963223.cms)
Source Article 1 नूपुर को नहीं मिली बेल, जेल में ही रहेंगी
(Nupur is not granted the bail, will stay in jail only)
(http://navbharattimes.indiatimes.com/articleshow/1296
3227.cms)
Positive/Relevant
Source Article 2 नूपुर को बेल या रहेंगी जेल में ही, फैसला आज
(Nupur gets the bail or stays in jail: decision today)
(http://navbharattimes.indiatimes.com/articleshow/12
949018.cms)
Ngative/Non-relevant
Table 1. Example English-Hindi text pairs describing the same event but different focal events


References


  1. Dragos Munteanu and Daniel Marcu (2005). Improving Machine Translation Performance by Exploiting Comparable Corpora. Computational Linguistics, 31 (4), pp. 477-504, December
  2. Emma Barker and Robert Gaizauskas (2012). Assessing the Comparability of News Texts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12).

Links

Home
Task Description  
Corpus  
Evaluation  
Working Notes  
Program Committee  
Registration/Discussion
Run Submission  
Program  
Contact: clinss@dsic.upv.es

Current Events

PAN @ CLEF

Previous Events

PAN @ FIRE'11