 |
MEDCo Annotation Project |
 |
|
 |
Demands for information extraction and text mining have been increasing rapidly
in the biologyical and medical sciences due to information overload caused by fast advances
in genome-related field. For the purpose of automatically extracting useful information from
texts written by scientists, one of the critical components is co-reference resolution. |
 |
|
 |
This project is to annotate co-reference information in MedLine abstracts
(GENIA
Collection) and full biology papers in the same domain. It will provide an
important resource for co-reference resolution in the biomedical domain, and will benefit
information extraction and other text-mining applications. |
 |
|
 |
This is a joint project of Institute for Infocomm Research (I2R) team,
Singapore and Tsujii Laboratory, Tokyo University, which includes two phrases, abstract
annotation (Feb 2003 - Aug 2006) and full paper annotation (Dec 2006 – Nov 2007). Tsujii
Lab provides the funding support and biology validation of linguistic annotation done
by I2R team. Dr. Tateisi Yuka from Tsujii's Lab coordinated the biology validation with
5 biology Master and PhD Students from Tokyo U on abstract annotation. Dr. Jin-Dong
Kim from Tsujii’s Lab coordinated the second phrase on full paper annotation.
|
 |
|
 |
The following members are involved in the linguistic annotation from I2R team.
|
 |
|
 |
 |
Members |
 |
 |
Project
Manager: |
Su Jian |
 |
Annotation Scheme Designer: |
Hong Hua Qing
With inputs from Su Jian, Yang Xiao Feng and Zhou Guo Dong |
 |
Annotators: |
Chong Lai Khar |
Fan Zhen Zhen |
Yeo Poh Khim |
Hong Hua Qing |
Ong Peishan Jasmine |
Heng Wei Chu |
 |
 |
Programming Support: |
Zhang Jie |
Chen Bin |
Yang Xiao Feng |
|
 |
|
 |
So far we have annotated 1,999 abstracts, with 16,819 sentences and 460k words.
There are 45,982 markables, among which 32,464 are anaphoric and
13,518 are discourse-new. Four types of co-reference relations are anotated, namely,
identic (IDENT), pronominal (PRON), appositive (APPOS), relative (RELAT).
Inter-annotator agreement on 15 abstracts are 0.83 in terms of Krippendorff’s Alpha.
This indicates that the corpus is of high quality for the research of coreference
resolution, as it’s higher than 0.67, the threshold for the corpus to be useful
according to (Passoneau, R. 2004).
|
 |
|
 |
Leveraging on part of annotations, coreference resolver was built
(Xiao Feng Yang, et. al, 2004a), (Xiao Feng Yang, et. al, 2004b) in
another project, Information Extraction on Biology Literature (IEBL).
|
 |
|
 |
During the abstract annotation, we have explored the annotation of Part /
Whole Relations. It turns out that more study is needed. One of the further
efforts in IEBL project on other anaphora resolution has been published
in (Chen Bin, et. al, 2008).
|
 |
|
 |
The annotation on full papers includes 43 full papers, with 2,835 sentences and 243,664 words.
There are 20,196 markables, among which 14,736 are anaphoric and 5,460 are discourse new. Two full articles consisting of 8769 words are used for calculating inter annotator agreement. It shows the agreement is 0.807 in terms of Krippendorff's Alpha. This is slightly lower than the one we got for abstracts, which appears reasonable as the annotation of full papers are much more difficult than abstracts. Still it's much higher than 0.67 usefulness threshold. |
 |
|
 |
 |
Sample Files |
 |
 |
File 1 (Abstract): [xml file]
File 2 (Full Paper): [xml file] (Safari users: please use 'view source' to see the XML code) |
 |
The corpus release information will be announced in this website. You could also email
Dr. Su Jian so that we could inform you once
the corpus is ready for the release.
|
 |
REFERENCES
|
 |
Passoneau, R. (2004). Computing reliability for coreference annotation. In Proceedings
of the International Conference on Language Resouces (LREC), Lisbon.
|
 |
Xiaofeng Yang, Jian Su, Guodong Zhou and Chew Lim Tan. A NP-Cluster Based Approach
to Coreference Resolution. P226-232, Proceedings of 20th International Conference on
Computational Linguistics (COLING'2004). Aug 23-27, 2004, Geneva, Switzerland.
|
 |
XiaoFeng Yang, GuoDong Zhou, Jian Su and Chew-Lim Tan. Improving Noun Phrase Coreference
Resolution by Matching Strings. Proceedings of 1st International Joint Conference on Natural
Language Processing (IJCNLP'2004), March 22-24, 2004, Sanya, China, pp226-333.
|
 |
Chen Bin, Xiaofeng Yang, Jian Su, Chew Lim Tan, Other-Anaphora Resolution in Biomedical
Texts with Automatic Mined Patterns, Proceedings of the 22nd International Conference on
Computational Linguistics (CoLing 2008), pages 121-128, Manchester, August 2008.
|
|