Automated Genre Classification for Ingest and Appraisal Metadata

Yunhyong Kim


Authors: Yunhyong Kim and Seamus Ross
Affiliation: Digital Curation Centre & HATII University of Glasgow


Metadata  are vital for the effective management and reuse of scientific information.  Manually to find, examine, and integrate every piece of information relevant to a scientific research topic is a time consuming and expensive activity. Having information such as the type, author, content, subject of the object and references to the object readily available would facilitate a focused research network.  It follows that the automatic extraction of such metadata would be highly useful for the scientific community.  Within the digital preservation and curation community, there is a consensus that persistent, cost-contained, manageable, and accessible digital collections depend on the automation of appraisal, selection, and ingest of digital materials ([5],[8]).  The automating the extraction of metadata would be an invaluable step.

In ERPANET's Packaged Object Ingest Project (POIP) ([3]) it was indicated that, even where it proved possible to use auto-extraction tools (see [7], and more recently [6]) to acquire the technical metadata,  the ingest process remained labour intensive due to a lack of automatic extraction tools for descriptive, structural, or semantic information.  Although new tools have emerged (e.g. [2]) , the tools tend to be applicable only on structured documents (e.g. HTML or XML) and do not provide a means by which to   extract a sufficiently rich level of metadata (e.g. content summary).

In this paper we demonstrate genre classification as a valuable first step in automatic semantic metadata extraction.  For the research community, genre classification enables information retrieval and mining  within designated genres. For ingest, genre classification would lead to better performance in further metadata extraction by enabling a concentrated examination of specific genres based on known properties and behaviours.  Initially, we aim to recognise sixty genres or, at the least, recognise groups of genres comprising sixty genres.  We propose a multi-facetted approach built on looking at the documents from five perspectives; as an object exhibiting specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sourcs.  Here we describe some experiments in which image processing was combined with language models; they are meant to be indicative of the promise underlying this multi-facetted approach. The method employs Affiliation:  Maximum Entropy and Naive Bayes Models to statistically examine features extracted from black and white images and plain texts of PDF documents. There have been other studies in genre classification (e.g. [1]) but the present work distinguishes itself by attempting to tackle a wider range of genres using a multi-facetted approach on PDF files.


Bibliography
[1] Boese E.S.: 'Stereotyping the web: genre classification of web documents', Master's thesis, Colorado State University (2005).
[2] DC-dot, Dublin Core metadata editor, http://www.ukoln.ac.uk/metadata/dcdot/
[3] ERPANET: Packaged Object Ingest Project, http://www.erpanet.org/events/ 2003/rome/presentations/ ross\_rusbridge\_pres.pdf
[4] Han H., Giles L., Manavoglu E., Zha H., Zhang Z. and Fox E. A.: 'Automatic Document Metadata Extraction using Support Vector Machines', Proc. 3rd ACM/IEEE-CS conf. Digital libraries (2000) 37-48.
[5] Hedstrom M., Ross S., Ashley K., Christensen-Dalsgaard B., Duff W., Gladney H., Huc C., Kenney A. R., Moore R., and Neuhold E.: 'Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation', Report of the European Union DELOS and US National Science Foundation Workgroup on Digital Preservation and Archiving (2003) http://delos-noe.iei.pi.cnr.it/activities/internationalforum/Joint-WGs/digitalarchiving/Digitalarchiving.pdf.
[6] National Archives UK, {\it DROID (Digital Object Identification)}, http: //www. nationalarchives. gov.uk/ aboutapps/pronom/droid.htm
[7] National Library of New Zealand, {\it Metadata Extraction Tool}, http://www. natlib. govt.nz/en/whatsnew/4initiatives.html\#extraction
[8] Ross S and Hedstrom M.: 'Preservation Research and Sustainable Digital Libraries', International Journal of Digital Libraries (Springer) (2005) DOI: 10.1007/s00799-004-0099-3.

keywords: metadata, genre classification, digital object, digital library, PDF, Maximum Entropy, Naive Bayes