Institute of Information Management (IIM)

Project lead

Prof. Dr. Philipp Schaer

Prof. Dr. Philipp Schaer

Institut für Informationswissenschaft (IWS)

Smart Harvesting II

Logo der Deutschen Forschungsgesellschaft (Image: DFG)

In the DFG-funded Smart Harvesting II project, we develop software-based solutions for the collection and processing of semi-structured web data, e.g., tables of contents of scientific journals or conference volumes for literature databases such as dblp or sowiport.

Due to the prevailing heterogeneity of such raw data, this work is very personnel- and time-intensive in the case of manual entry. In cases that already rely on technical support, specialized programs, so-called wrappers, are used. These programs must be created and maintained by expert software developers. Part of our project is therefore to develop low-maintenance wrappers that can be easily operated by non-programmers, e.g., librarians or documentaries, and adapted to frequently redesigned, dynamic web applications.
For this purpose, we rely on the open-source query language OXPath - an extension of XPath, which allows a declarative imitation of the interaction with a website and in this course can extract data selectively. Initial experience in a workshop with librarians and exercise groups with students has shown that basic knowledge of XML and XPath is sufficient to get started in the process of creating and maintaining OXPath wrappers. The data obtained can be used in a variety of ways. In Smart Harvesting II, additional Internet sources are integrated into the database monitoring, for example, or used to clean up and prepare the
data, e.g., by searching additional fields such as short biographies with Named Entity Recognition for author names that are as complete as possible in order to improve author disambiguation.

At a Glance

Category Description
Research project Smart Harvesting II 
Administration Prof. Dr. Philipp Schaer  Staff page
Faculty Informations- und Kommunikationswissenschaften  More
Institute Institute of Information Management
Institut of Information Science 
Persons involved Mandy Neumann  Staff page
Partners dblp (Dr. Michael Ley) (http://dblp.uni-trier.de)
GESIS (Jun.-Prof. Dr. Brigitte Mathiak) (http://www.gesis.org) 
Sponsors German Research Foundation (DFG) – Funding Programme "Electronic Publications" 
Duration 2016 - 2019 
Website

Institute of Information Management (IIM)

Project lead

Prof. Dr. Philipp Schaer

Prof. Dr. Philipp Schaer

Institut für Informationswissenschaft (IWS)


M
M