Projects
MEDCo Annotation Project
Demands for information extraction and text mining have been increasing rapidly in the biological and medical sciences due to information overload caused by fast advances in genome-related field. For the purpose of automatically extracting useful information from texts written by scientists, one of the critical components is co-reference resolution.
This project is to annotate co-reference information in MedLine abstracts (GENIA portion). It will provide an important resource for co-reference resolution in the biomedical domain, and will benefit information extraction and other text-mining applications.

So far we have annotated 1,999 abstracts, with 16,819 sentences and 460k words. The whole corpus contains 45,982 markables, among which 32,464 are anaphoric and 13,518 are discourse-new. Four types of co-reference relations are anotated, namely, identic (IDENT), pronominal (PRON), appositive (APPOS), relative (RELAT).

This is a joint project of Institute for Infocomm Research(I2R) team, Singapore and Tsujii Laboratory, Tokyo University. Tsujii Lab provides the funding support and biology validation of linguistic annotation done by I2R team. Dr. Tateisi Yuka from Tsujii's Lab coordinated the biology validation with 5 biology Master and PhD Students from Tokyo U. The following personnel involve in the linguistic annotation.

Personnel
Project Manager:
Su Jian
Annotation Scheme Designer:
Hong Hua Qing
Annotators:
Fan Zhen Zhen
Yeo Poh Khim
Hong Hua Qing
Chong Lai Khar
Ong Peishan Jasmine
Heng Wei Chu
Consultants:
Su Jian
Zhou Guo Dong
Yang Xiao Feng
Programming Support:
Zhang Jie
Yang Xiao Feng
Annotation Scheme
Sample Files
File 1:   [xml file]

 

The corpus release information will be announced in this website. You could also email Dr. Su Jian so that we could inform you once the corpus is ready for the release.