Automation of Subject Indexing using Methods from Artificial Intelligence

The ZBW collects, processes and indexes literature in the domain of economics from all over the world. In order to ensure their findability, the ZBW annotates those resources with high-quality metadata which then serve as a database for the ZBW search portal EconBiz and can also be reused by other parties.

The number of publications, and of digital publications in particular, is rising in economics as well. At the same time, new technological options become available which we pick up and evaluate. Suitable procedures are refined in the context of our research activities and transferred into usable tools for indexing to ensure the continued scope and quality of our metadata.

Automation of Subject Indexing (AutoSE)

In the project AutoIndex, a prototypical solution for subject indexing based on open source machine learning software was developed at the ZBW. In 2019, the automation of subject indexing was officially converted to the status of a permanent task: AutoSE. The additional mission was the implementation of a software architecture to deploy selected machine learning procedures from our applied research as a productive service for subject indexing.

Since 2020, AutoSE has been using the open source toolkit Annif which was developed by the National Library of Finland (NLF) as a framework for a combination of various state-of-the-art models, including stwfsa which was designed by the ZBW. Annif as a core component is accompanied by mechanisms developed by us for hyperparameter optimisation, quality control and integration with the metadata flows at the ZBW. The team contributes to the dissemination and development of Annif by organising Annif tutorials together with NLF and by checking regularly if components implemented at the ZBW can be integrated into Annif for reuse by other institutions.

The AutoSE service has been productive since July 2021. The service checks the EconBiz database every hour for new resources, and assigns adequate descriptors from the Standard Thesaurus for Economics (STW) based on text drawn from the metadata. The quality-tested descriptors are written back into the database directly and are also provided via an API to the "Digitaler Assistent" (DA-3) to assist domain experts in their subject indexing.

It is essential for us to collaborate closely with the subject indexing experts. In machine learning, a central paradigm is the human in the loop – a cooperation between humans and machines in order to solve problems. We use intellectually annotated data for training, our vocabulary STW and its mappings are curated intellectually, and the subject indexers continually give assessments of AutoSE suggestions during their daily work via the DA-3 system that we use as a tool for quality assurement.

Research and Development in the context of AutoSE

In parallel, we continually conduct experiments in the context of our applied research to improve our methods. On top of the methods that we already use, we evaluate state-of-the-art results from Artificial Intelligence, such as Deep Learning – for example transformer models which are especially promising for multilingual subject indexing. Besides the automated generation of descriptors, neural networks can be used to optimise the way in which individual models are combined, and to estimate the expected subject indexing quality at the document level (such an approach was already implemented by us and is currently in use in productive service: qualle) so that documents can be conveyed directly towards the (automated or intellectual) subject indexing method that is the most appropriate.

Publications

For presentations and publications concerning AutoSE see the publication list of Anna Kasprzik.

Publications on the subject (including publications from the forerunner project AutoIndex) can also be found in the ZBW Publication Archive. Please search for the key words Automatic Subject Indexing.