Projects
MEDCo Annotation Project
Demands for information extraction and text mining have been increasing rapidly in the biologyical and medical sciences due to information overload caused by fast advances in genome-related field. For the purpose of automatically extracting useful information from texts written by scientists, one of the critical components is co-reference resolution.
This project is to annotate co-reference information in MedLine abstracts (GENIA Collection) and full biology papers in the same domain. It produces an important resource for co-reference resolution in the biomedical domain, and benefits information extraction and other text-mining applications.
This is a joint project of Institute for Infocomm Research (I2R) team, Singapore and Tsujii Laboratory, Tokyo University, which includes two phrases, abstract annotation (Feb 2003 - Aug 2006) and full paper annotation (Dec 2006 – Nov 2007). Tsujii Lab provides the funding support and biology validation of linguistic annotation done by I2R team. Dr. Tateisi Yuka from Tsujii's Lab coordinated the biology validation with 5 biology Master and PhD Students from Tokyo U on abstract annotation. Dr. Jin-Dong Kim from Tsujii’s Lab coordinated the second phrase on full paper annotation.

The following members are involved in the linguistic annotation from I2R team.

Members
Project Manager:
Su Jian
Annotation Scheme Designer:
Hong Hua Qing
With inputs from Su Jian, Yang Xiao Feng and Zhou Guo Dong
Annotators:
Chong Lai Khar
Fan Zhen Zhen
Yeo Poh Khim
Hong Hua Qing
Ong Peishan Jasmine
Heng Wei Chu
Programming Support:
Zhang Jie
Chen Bin
Yang Xiao Feng
So far we have annotated 1,999 abstracts, with 16,819 sentences and 460k words. There are 45,982 markables, among which 32,464 are anaphoric and 13,518 are discourse-new. Four types of co-reference relations are anotated, namely, identic (IDENT), pronominal (PRON), appositive (APPOS), relative (RELAT). Inter-annotator agreement on 15 abstracts are 0.83 in terms of Krippendorff’s Alpha. This indicates that the corpus is of high quality for the research of coreference resolution, as it’s higher than 0.67, the threshold for the corpus to be useful according to (Passoneau, R. 2004).
Leveraging on part of annotations, coreference resolver was built (Xiao Feng Yang, et. al, 2004a), (Xiao Feng Yang, et. al, 2004b) in another project, Information Extraction on Biology Literature (IEBL).
During the abstract annotation, we have explored the annotation of Part / Whole Relations. It turns out that more study is needed. One of the further efforts in IEBL project on other anaphora resolution has been published in (Bin Chen, et. al, 2008).
The full paper annotation in MEDCo project covers 24 full papers. I2R team further annotated another 19 full papers after MEDCo project. There're 2,835 sentences and 243,664 words with the 43 full papers in total. There are 20,196 markables, among which 14,736 are anaphoric and 5,460 are discourse new. Two full articles consisting of 8769 words are used for calculating inter annotator agreement. It shows the agreement is 0.807 in terms of Krippendorff's Alpha. This is slightly lower than the one we got for abstracts, which appears reasonable as the annotation of full papers are much more difficult than abstracts. Still it's much higher than 0.67 usefulness threshold.
Sample Files
File 1 (Abstract): [xml file]    File 2 (Full Paper): [xml file]

(Safari users: please use 'view source' to see the XML code)

 

Most annotations here are leveraged on MMAX2 (Christoph Müller, Michael Strube, 2006), which is available here.

The abstract portion of the corpus has been released here. The coreference links of genes or proteins from this portion are further polished by BioNLP Shared Task 2011 organizers for the supporting task: Protein/Gene Coreference Task. The full paper portion will be released accordingly as well.

REFERENCES

Su, Jian, Yang, Xiaofeng, Hong, Huaqing, Tateisi, Yuka, Tsujii, Jun'ichi, Coreference Resolution in Biomedical Texts: a Machine Learning Approach Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl Seminar Proceedings. 08131 - Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives, 2008.

Bin Chen, Xiaofeng Yang, Jian Su, Chew Lim Tan, Other-Anaphora Resolution in Biomedical Texts with Automatic Mined Patterns, Proceedings of the 22nd International Conference on Computational Linguistics (CoLing 2008), pages 121-128, Manchester, August 2008.

Christoph Müller, Michael Strube (2006): Multi-Level Annotation of Linguistic Data with MMAX2. In:Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp. 197-214. (English Corpus Linguistics, Vol.3 ).

Passoneau, R. (2004). Computing reliability for coreference annotation. In Proceedings of the International Conference on Language Resouces (LREC), Lisbon.

Xiaofeng Yang, Jian Su, Guodong Zhou and Chew Lim Tan. A NP-Cluster Based Approach to Coreference Resolution. P226-232, Proceedings of 20th International Conference on Computational Linguistics (COLING'2004). Aug 23-27, 2004, Geneva, Switzerland.

XiaoFeng Yang, GuoDong Zhou, Jian Su and Chew-Lim Tan. Improving Noun Phrase Coreference Resolution by Matching Strings. Proceedings of 1st International Joint Conference on Natural Language Processing (IJCNLP'2004), March 22-24, 2004, Sanya, China, pp226-333.