AUTOMATION OF SUBJECT INDEXING USING METHODS FROM ARTIFICIAL INTELLIGENCE

The ZBW collects, processes and indexes literature in the area of economics from all over the world. In order to ensure their findability ZBW annotates those resources with high-quality metadata which then serve as a database for the ZBW search portal EconBiz and can also be reused by other parties.

The number of publications, and of digital publications in particular, is rising in economics as well. At the same time, new technological options become available due to current developments in computer science and in information science which we adopt and integrate into our indexing strategy in order to ensure a continued coverage and quality of our metadata.

The automation of various workflows in the indexing process and their seamless integration with intellectual cataloguing and subject indexing is a permanent task at ZBW. Accordingly, we focus on the question how we can provide the research results that were obtained at ZBW systematically and sustainably as working instruments in our productive environment. We continuously evaluate automated procedures, develop them further in the context of our research activities, and transfer them into productive operations. This is done in cooperation with the parties within ZBW that are involved, and we also keep discussions going with national and international partners from research and from other information infrastructure institutions who face similar issues and challenges.

Automation of Subject Indexing (AutoSE)

In a research-based project (AutoIndex, until 2018) a prototypical solution for subject indexing based on open source machine learning software was developed at ZBW which combined several methods, thereby achieving a higher performance. In 2019 the activities towards the automation of subject indexing were officially converted from a project to the status of a permanent task and given a new name: AutoSE. The initial mission of AutoSE was the conception and implementation of a suitable software architecture in order to deploy the results from our applied research as a productive service for subject indexing.

Since 2020 AutoSE uses the open source toolkit Annif which was developed by the National Library of Finland (NLF) as a framework for a combination of various state-of-the-art models, including stwfsa which was designed by ZBW to yield optimal results in combination with our vocabulary. Annif as a core component is accompanied by mechanisms developed by us for hyperparameter optimisation, quality control and integration with the metadata flows at ZBW. The AutoSE team contributes to the dissemination and development of Annif by organizing Annif tutorials together with NLF and by checking regularly if parts of the components and optimisation mechanism that are developed at ZBW can be integrated into Annif for reuse by other institutions.

The productive AutoSE service has been launched in the spring of 2021. The service checks the EconBiz database every hour for new resources, takes “short texts“ (title, author keywords) as input and from that generates descriptors from the Standardthesaurus Wirtschaft (STW) which aim to adequately summarize the resource in question. Various rule-based postprocessing routines increase and secure the quality of the metadata thus generated. The quality-tested descriptors are written back into the EconBiz database directly and are also provided via an API to the "Digitaler Assistent" (DA-3) which is the system that is used to assist domain experts in their subject indexing at ZBW.

When developing such a service it is essential to work together closely with the subject indexing experts of the institution. In the domain of machine learning, a central paradigm is the human in the loop – an intelligent cooperation between humans and machines in order to solve problems. We use intellectually annotated data for training, our vocabulary STW and its mappings are curated intellectually, and the subject indexers at ZBW continually give assessments of AutoSE suggestions during their daily work via the DA-3 system, and we take their feedback as a base for the continuous improvement of our current solutions.

Research and Development in the context of AutoSE

In parallel, we continually conduct experiments in the context of our applied research in order to improve our methods. On top of the methods that we already use, we evaluate state-of-the-art results from subfields of the area of Artificial Intelligence, such as Deep Learning – for example transformer models which are especially promising for multilingual subject indexing. Besides the automated generation of descriptors, neural networks can be used to optimize the way in which individual models are combined, and to estimate the expected subject indexing quality on the document level (such an approach was already implemented by the AutoSE team and is currently in use in the productive AutoSE service: qualle) so that documents can be conveyed directly towards the (automated or intellectual) subject indexing method that is the most appropriate for this kind of resource.

Additional topics that afford themselves for automation are e.g. the extraction of structural elements from electronic fulltexts (keywords, abstracts, table of content) in order to facilitate cataloguing and subject indexing, or the extraction of frequent terms in the context of automated subject indexing for the continuous development of our vocabulary STW. In both areas, first results were obtained from two Master‘s theses that were written in the context of AutoIndex, and these topics shall be explored further for future automation activities at ZBW.

 

Publications

For presentations and publications concerning AutoSE see the publication list of Anna Kasprzik.

Publications on the subject (including publications from the forerunner project AutoIndex) can also be found in the ZBW Publication Archive. Please search for the key words Automatic Subject Indexing.