Automatic Metadata Extraction From Museum Specimen Labels

P. Bryan Heidorn, Qin Wei

Abstract


This paper describes the information properties of museum
specimen labels and machine learning tools to automatically extract Darwin Core
(DwC) and other metadata from these labels processed through Optical Character
Recognition (OCR). The DwC is a metadata profile describing the core set of access
points for search and retrieval of natural history collections and observation
databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements
from these labels. The automated text extraction tools are provided as a web service
so that users can reference digital images of specimens and receive back an extended
Darwin Core XML representation of the content of the label. This automated
extraction task is made more difficult by the high variability of museum label
formats, OCR errors and the open class nature of some elements. In this paper we
introduce our overall system architecture, and variability robust solutions
including, the application of Hidden Markov and Naïve Bayes machine learning models,
data cleaning, use of field element identifiers, and specialist learning models. The
techniques developed here could be adapted to any metadata extraction situation with
noisy text and weakly ordered elements.

Full Text:

PDF