Automatic Reference Resolution in Astronomy Articles
Yunhyong Kim, Humanities Advanced Technology Information Institute, University of Glasgow, UK
Authors: Yunhyong Kim (HATII, University of Glasgow) and Bonnie Webber (Department of Informatics, University of Edinburgh)
Scientific data comes in many different forms: for instance, it could be in a database, scientific article or even in memos, notes or emails sent on the spur of the moment. To make use of any piece of scientific missive, it is important to put it into context by linking it to other related information in a comprehensible network. This leads us to contemplate methods for creating links; one such method lies in the world of citations. In academic research, the citation index of a paper or article plays a role in determining the impact of the research. This index is defined by the number of times the article is cited by other articles. Given no other information, to acquire fast-track knowledge of core work in a chosen field, it often pays to look at articles which lie in the intersection of articles cited in several papers within the area. To link articles in the way of citations can therefore be quite useful. In fact, publishers of on-line journals (e.g. , [3], [8]) have already implemented active hyper-links to articles cited within the articles that appear in their journal, and services such as CiteSeer (e.g. [4]) list other papers which cite and are cited by the retrieved article. However, this is mostly confined to linking explicit citations (see [2] and [6] for information on automatic citation linking) and remains mostly in the realm of research articles. Where the context of a citation is as important as
the fact that a work has been cited (as in [7]), it is as important to identify and link subsequent implicit citations in the
form of anaphoric pronouns as it is to identify and link explicit citations. Few attempts have been made to do this, either in the scientific literature or any other form of digital documentaiton. The research in this paper presents results in the automatic detection and linking of implicit pronominal references to citations within articles in Astronomy. The benefits of detecting and linking implicit references anaphoric to citations are manifold: this would enable not only the retrieval of cited articles within a paper but also the sentences which relate the content of the cited articles (thereby providing context to the citation), this would help to measure the salience of articles (based on the intuition that articles referenced many times are more likely to be important and closely related to the current article than those which are not), this would help the summarisation of an article (cf. [9]), and would help facilitate information mining because keywords of the topic tend to appear within reference sentences (see [7]). Although the work relates to scientific articles one can also envision leaving the arena of scientific articles to a wider domain of texts to apply the detection method outlined in this paper to mine noun phrases which constitute informal references to scientific work. This may provide a means by which to distinguish, for example, emails or letters with scientific topics from emails which do not, leading to information which would normally escape the scientific community. Note that the work in the paper focusses on the pronoun "they" because it is the most common form of implicit citation in articles on astronomy.The approach relies on the categories of the verbs following the pronoun and the distance to previous citations. The extracted verb categories (extracted using the part-of-speech tag labeller CANDC ([5]) and a home-made chunker) were statistically modeled using MaxEnt ([1], [10]). The level of accuracy (best performance on developmental test data 96.09%, on fresh data 92.41%, on a small sample of Biology articles 84.81%, on the same Biology articles but with a reduced classifier 94.93%) achieved suggests that the method is effective and encourages further investigation of the approach.
References
[1] Berger, Adam and Della Pietra, Stephen and Della Pietra, Vincent, 1996, "A maximum entropy approach to natural language processing", Computational Linguistics, {\bf Vol 22, Number 1}, 39-71.
[2] Bergmark, Donna, 2000, "Automatic Extraction of Reference Linking Information from Online Documents", Cornell Computer Science Department, Technical Report TR 2000-1821.
[3] Blackwell Publishing, http://www.blackwellpublishing.com/
[4] CiteSeer, http://citeseer.ist.psu.edu/
[5] Curran, James and Clark, Stephen, 2003, "Investigating GIS and Smoothing for Maximum Entropy Taggers", Proceedings, Aunnual Meeting, European Chapter of the Assoc. of Computational Linguistics, 91-98.
[6] Hitchcock, S and Carr, L and Harris, S and Hey, J M N and Hall, W, 1997, "Citation Linking: Improving access to online journals", Second ACM International Conference on Digital Libraries, 115-122.
[7] Nakov, P., Shwartz, A., Hearst, M., 2004, "Citances: Citation Sentences for Semantic Analysis of Bioscience Text", Workshop on Search and Discovery in Bioinformatics at SIGIR'04, Sheffield, UK.
[8] PubMed, http://www.ncbi.nih.gov/entrez/
[9] Teufel, Simone and Moens, Marc, 2002, "Summarising Scientific Articles - Experiments with Relevance and Rhetorical Status", Computational LInguistics, {\bf Vol 28, Number 4}, http://www.cl.cam.ac.uk/users/sht25/publications.html
[10] Le, Zhang, 2004, {\it Maximum Entropy Toolkit for Python and C++},http://www.nlplab.cn/zhangle/maxent\_toolkit.htm
keywords: data, information, astronomy, citation, reference, context, linking, detection