Accessing Natural History

Discoveries in Data Cleaning, Structuring, and Retrieval

 
 

Cultural heritage institutions harbour a vast treasure of information. However, this treasure of information is often confined to the walls of the archive, museum, or library. This thesis is about improving access to cultural heritage collections through digitisation and enrichment.

In this thesis, three themes that improve information access in a digital information collection from the Dutch National Museum for Natural History Naturalis were investigated: data cleaning, information structuring, and object retrieval.


Two methods for automatic cleanup of databases are presented: a data-driven and a knowledge driven method. Both methods detect a large number of inconsistencies in the data, but the experiments show that they also detect different types of errors and are thus complementary.


Next, an automatic ontology construction method is presented. This method makes implicit domain information present in the database from Naturalis explicit by linking it to the online encyclopaedia Wikipedia.


Finally, a system for data retrieval are presented in which three different types of domain knowledge in three different stages of the retrieval process are used. First, knowledge from external resources and rules is used to interpret the queries to formulate more precise queries. Then, the same types of knowledge is used to expand queries with synonyms to increase recall. To rank results by relevance, knowledge from the domain ontologies and query analysis is used. Mira provides a significant improvement in data access as it decreases the number of unanswered queries.

 

Ph.D. Thesis, Tilburg University, 2010

ISBN: 978-90-8559-027-9

197 pages


Download pdf