ZBW FORSCHUNGSKOLLOQUIUM

Nils Witt
Understanding the Influence of Hyperparameters on Text Embeddings for Text Classification Tasks
15. September 2017, 10:00 Uhr, ZBW Kiel, Raum B-024

Abstract:
Many applications in the natural language processing domain require the tuning of machine learning algorithms, which involves adaptation of hyperparameters. We performed experiments by systematically varying hyperparameter settings of text embedding algorithms to obtain insights about the influence and interrelation of hyperparameters on the model performance on a text classification task using text embedding features. For some parameters (e.g., size of the context window) we could not find an influence on the accuracy while others (e.g., dimensionality of the embeddings) strongly influence the results, but have a range where the results are nearly optimal. These insights are beneficial to researchers and practitioners in order to find sensible hyperparameter configurations for research projects based on text embeddings. This reduces the parameter search space and the amount of (manual and automatic) optimization time.

Lukas Galke
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval - Full paper
Reranking-based Recommender System with Deep Learning - Short paper
20. September 2017, 10:15 Uhr, ZBW Kiel, Raum B-024

Abstract full paper:
We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a Boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover's distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models' sensitivity to document length by using either only the title or the full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.

Abstract short paper:
An enormous volume of scientific content is published every year. The amount exceeds by far what a scientist can read in her entire life. In order to address this problem, we have developed and empirically evaluated a recommender system for scientific papers based on Twitter postings. In this paper, we improve on the previous work by a reranking approach using Deep Learning. Thus, after a list of top-k recommendations is computed, we rerank the results by employing a neural network to improve the results of the existing recommender system. We present the design of the deep reranking approach and a preliminary evaluation. Our results show that in most cases, the recommendations can be improved using our Deep Learning reranking approach.

Vergangene Talks

Martin Toepfer
Recent Advances in Automatic Subject Indexing
08. Juni 2017, 15:30 Uhr, ZBW Kiel, Raum B-024
Automatic subject indexing is a key technology for digital libraries. In this talk, we look at certain advances in this field, such as consideration of system architectures with respect to challenging data characteristics, shown by experiments on titles and author keywords in the economic domain.

Alexander Herwix, Universität Köln
Design Science Research - Where we are now and where we can go
7. April 2017, 11:00 Uhr, ZBW Kiel, Raum B-024
There will be a short introduction into Design Science Research to begin with.
Abstract:
One core idea of scientific enquiry is the generation of reliable abstract knowledge about phenomena that is applicable in a variety of contexts. Recognizing the inherent relationship between scientific enquiry and the design of artifacts, Simon (1996) lay the foundations for a new approach to science, explicitly recognizing this relationship: „the sciences of the artifical“. Since then, much progress has been made and design science research (DSR) has emerged as a recognized form of scientific enquiry (e.g., Hevner and Chatterjee 2010; Hevner et al. 2004; March and Smith 1995; Nunamaker Jr et al. 1990). This talk presents an emerging perspective and vision for DSR that is being developed in the course of the authors’ dissertation project. A novel framework for the analysis and design of generative systems builds the foundation for this endeavor. In the first half, the talk presents a short overview of the current state of DSR in IS and highlights open challenges. For example, DSR is an inherently boundary-spanning research activity and, thus, faces the challenge of having to integrate and distribute insights between multiple knowledge bases. In the second half, a novel approach to DSR aimed at overcoming these challenges is presented in the form of a framework and vision for a generative information system for DSR.

References:
Hevner, A. R., and Chatterjee, S. 2010. "Design Research in Information Systems." Springer.
Hevner, A. R., March, S. T., Park, J., and Ram, S. 2004. "Design Science in Information Systems Research," MIS Quarterly (28:1), pp. 75-105.
March, S. T., and Smith, G. F. 1995. "Design and Natural Science Research on Information Technology," Decision Support Systems (15), pp. 251-266.
Nunamaker Jr, J. F., Chen, M., and Purdin, T. D. 1990. "Systems Development in Information Systems Research," Journal of management information systems (7:3), pp. 89-106.
Simon, H. A. 1996. Sciences of the Artificial. Cambridge, MA: MIT Press.

M.Sc. Johann Schaible, GESIS, Köln
TermPicker: Recommending Vocabulary Terms for Reuse When Moduling Linked Open Data

Öffentlicher Teil der Disputation zum Dr. rer. nat.
17. Februar 2017, 14:00 Uhr, LMS2 Ü/2; Ludewig-Meyn-Str. 2, 24118 Kiel
Abstract

Agnes Mainka, Heinrich-Heine-Universität Düsseldorf
Smart, smarter, smartest: An empirical investigation of informational world cities
19. Januar 2017, 14:30 Uhr, Raum 550, Hamburg
Contemporary and future cities are often labelled as “smart cities,” “digital cities” or “ubiquitous cities,” “knowledge cities,” and “creative cities.” “Informational city” is used as an umbrella term to unify recent city research and bring them into line with divergent trends of information-related city research. This is an interdisciplinary endeavour incorporating on the one side computer science and information science as well as on the other side urban studies, city planning, architecture, city economics, and city sociology. The framework consists of six blocks, namely information and knowledge related infrastructures, labour market, the corporate structure, soft locational factors, political will, and cityness. In my talk, I will present a conceptual framework for research on informational cities as well as results from empirical studies on cities all over the world. The evaluated case studies will concentrate on the factors infrastructure, political will, and cityness.

Falk Böschen
A Comparison of Approaches for Unsupervised Extraction of Text from Scholarly Figures

15. Dezember 2016, 15:30 Uhr, Raum B-024
So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.

Sarah Ben Slama
Visualization of Linked Open Data

(Seminarvortrag)
22. September 2016, 15:30 Uhr, Raum B-024

Kaltrina Nuredini
Enriching the knowledge of altmetrics studies by exploring social media metrics for Economic and Business Studies journals

5. September 2016, 15:30 Uhr, Raum B-024
We present a case study of articles published in 30 journals from Economics and Business Studies (EBS) by using social media metrics from Altmetric.com. Our results confirm that altmetric information is significantly better present for recent articles. The Top 3 most used altmetric sources in EBS-journals are Mendeley, Twitter, and News. Low but positive correlations (r=0.2991) are identified between citation counts and Altmetric Scores on article level but they increase on journal level (r=0.614). However, articles from highly cited journals do neither receive high online attention nor are they better represented on social media.

Abdallah Salama
tba

10. August 2016, 11:00 - 12:30 Uhr, Raum E-205

Tim Dopke
Aspect Based Summaries and Identification of Salient Arguments in German Language Product Reviews

Christian Zirkelbach
ExplorViz - Live Trace Visualization for System and Program Comprehension in Large Software Landscapes

(20-minütiger Vortrag und Diskussion)
27. Juli 2016, 15:30 Uhr, Raum B-024
In many enterprises, the number of deployed applications is constantly increasing. Those applications - often several hundreds - form large software landscapes. The comprehension of such landscapes is frequently impeded due to, for instance, architectural erosion, personnel turnover, or changing requirements. Furthermore, events such as performance anomalies can often only be understood in correlation with the states of the applications. Therefore, an efficient and effective way to comprehend such software landscapes in combination with the details of each application is required. In order to face this challenge, we created ExplorViz - a live trace visualization approach to support system and program comprehension in large software landscapes. It features two perspectives: a landscape-level perspective using UML elements and an application-level perspective following the 3D software city metaphor. Our software is capable of monitoring, analyzing, and processing a huge amount of conducted method calls in large software landscapes. Furthermore, we utilize different display and interaction concepts for the software city metaphor beyond classical 2D displays and 2D pointing devices. We conducted several lab and controlled experiments to verify our approach. ExplorViz is available as open-source software on www.explorviz.net. Additionally, we provide extensive experimental packages of our evaluations to facilitate the verifiability and reproducibility of our results.

Prof. Dr. Sanja Bauk
Modeling smart network for enhancing occupational safety at the seaport
15. Juni 2016
The presentation considers some problems of the lack of contemporary ICT solutions in the developing seaports which function in transitional environments. Within this context, an appropriate affordable smart wireless network model for enhancing on port workers’ occupational safety and health is proposed at the logical level. The results of some experiments in Matlab and Opnet simulation tools over the physical and link layers of the communication channels are briefly presented, along with the directions for further research work in the field.

Jesper Zedlitz
Familienforschung zum Ersten Weltkrieg Sisyphusarbeit auf 31.000 Seiten
12. Mai 2016
Funktioniert die Erschließung historischer Quellen alleine mit ehrenamtlichen Helfern wirklich? Im Vortrag wird vorgestellt, wie innerhalb von 2,5 Jahren mit über 700 Freiwilligen die deutschen Verlustlisten des 1. Weltkriegs - eine mit 31.000 Seiten umfangreiche und bisher kaum nutzbare Quelle - mit Hilfe eines neuartigen Online-Werkzeugs indexiert wurden. Neben der Technik des Online-Werkzeugs zur strukturierten Erfassung gedruckter historischer Quellen wird ein Blick auf das Nutzerverhaltens innerhalb des Crowdsourcing-Projektes geworfen.

Prof. Norbert Luttenberger
Hierarchies with Holes - on different approaches to trade classification (including the Deutsche Reichsstatistik)
25. Juni 2015
In diesem Vortrag werden zunächst drei unterschiedliche Warenklassifikationen vorgestellt, die für statistische, zollamtliche und wissenschaftliche Zweck genutzt werden. Darunter befindet sich auch die sog. Standard International Trade Classification Rev. 4, die von United Nations Statistics Division entwickelt wurde und u.a. die Basis für diverse wissenschaftliche Zwecke bildet. Es wird gezeigt, wie diese Klassifikation in mehreren Schritten aus ihrer heutigen Form (einer Excel-Tabelle) in eine OWL-basierte Ontologie verwandelt werden kann. Dabei werden vor allem konzeptuelle Fragen diskutiert. Es wird auch auf Erfahrungen mit den Tools Protégé und Pellet eingegangen.

Falk Böschen
Multi-oriented Text Extraction from Information Graphics
3. September 2015
Existing research on analyzing information graphics assume to have a perfect text detection and extraction available. However, text extraction from information graphics is far from solved. To fill this gap, we propose a novel processing pipeline for multi-oriented text extraction from infograph- ics. The pipeline applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition. We evaluate our method on 121 infographics extracted from an open access corpus of scientific publica- tions. The results show that our approach is effective and significantly outperforms a state-of-the-art baseline.

Arne Klemenz
Semantification of Web Query Interfaces based on a Schema.org Extension to Improve Access to Web Databases
This presentation addresses how to gain visibility of information from web databases as part of the so called Deep Web. Information from web databases could provide great value to satisfy changing information needs. Shared vocabularies like Schema.org were published and contributed to the machine readability of information provided on the web. But most of this information stays still hidden in web databases. Therefore, this presentation addresses the limitations of Schema.org’s schemata and proposes an extension to Schema.org to fill the existing gaps. The proposed extension will improve the accessibility of content provided in web databases and contribute to the innovation of web search by lifting web query interfaces of web databases to the level of machine understandable Semantic Web interfaces.

Falk Böschen
Optical Character Recognition on Information Graphics: Towards a better Retrieval of Infographics
Information graphics or short infographics are widely used from social networks and mass media to research publications and presentations. Despite this popularity, searching for infographics is still limited today. Like other images they are only indexed by annotations and keywords generated from surrounding text. But the infographics themselves contain valuable textual information that is not necessarily present in the caption or surrounding text. Existing research on infographics assumes perfect OCR results although common systems can’t deal with the huge variety of text (size, orientation, font, color) in these graphics. In this talk we will present an automated approach to improve the Optical Character Recognition using the Open Source software Tesseract on infographics. Our algorithm preprocesses the graphic, binarizes it, detects text elements and their orientations using data mining and computer vision techniques and delivers optimized sub-images to Tesseract. Further improvements as well as post processing is planned. In addition we plan to use domain specific knowledge, if available, to enhance the recognition. For example we plan to extend Tesseracts dictionaries with the STW.

Chifumi Nishioka
User Profiling from Social Media considering Temporal Decay and its Application for Item Recommendation
18. September 2015
The information overload is a conventional problem for the web. One way to ease this challenge is the use of recommender systems, which suggest items to each user based on a user profile. Recently many people publish their own thoughts and ideas on social media platforms on a daily basis. Thus, social media retain a huge amount of user data and actually many studies have dealt with user profiling from social media. In this talk, we will talk about the design of the recommender systems according to the three dimensions, including profiling method(for user/document), temporal decay, and document content. Referring to profiling method, we try to find out the best method to distill features of users and candidate documents. In the dimension of temporal decay, three temporal decay functions, including no decay, sliding window and exponential decay would be compared. In existing studies, each extracted user interest has been scored by the frequency of appearance. However, for example, a user, who repeatedly mentioned “Back to the future” a few years ago, may have little interest in it at present. In the third dimension, we evaluate whether document profiling from not only bibliographic information but also full texts helps to improve the performance of the recommender system. We will conduct the empirical study in the scenario of recommending EconBiz open access documents based on a user’s tweets.

Marcel Hebing
Implementing the Data Portal 'DDI on Rails' as Part of a Metadata-Driven Infrastructure for Panel Surveys.
30. Oktober 2015
In social and economic research a significant amount of work is based on secondary data (data that are not collected by the researcher who analyses them). In secondary analysis, the researcher depends on the data producers in two aspects: data quality and usability. Data of good quality represent adequately a particular aspect of reality. Usability originates notably from a conceivable documentation, enabling the researcher to interpret the data correctly. Based on the "Generic Longitudinal Business Process Model" (GLBPM) the thesis designs a reference architecture for panel studies, highlighting the idea of a metadata-driven process to improve both data quality and usability. The principals of the reference architecture are applied to the implementation of "DDI on Rails", a web application supporting data dissemination and discovery.

Kaltrina Nuredini
Social media and its usage in enhancing the search and provision of literature in the research process
Change is happening – human beings have reformed their social communication in today’s world. Different services which are at low cost and ease of use for social interaction will constantly prove to be successful. Furthermore, over the last years, social media tools have been introduced. Their usage have found a massive harnessing both in personal lives as a platform for interaction and increasingly at work an entirely new way of collaborating and problem-solving. Within this platform a sea of information can be collected through e-mails, wall posts, blogs, wiki-pages, videos etc. that leads to virtual social spaces filled with new insights and opportunities. In this sense, the introduction of this new platform – the social media, provide interesting implications for different fields of study especially there is also a great deal of interest by researchers. Further, social media is relatively affecting their research process. Even though researchers are familiar with these tools, there is still a gap in understanding of how these tools could be used in a digestible way while searching for literature.

In the vast majority of cases, researchers use online resources such as research profiling services about their work which have social aspects and links to research papers that can easily enhance their discoverability. On the other hand, researchers also use digital libraries to access literature resources for their studies. What would it be if social media is used in digital libraries, would it help researchers improve their search and provision of literature in their research process? Therefore, the aim of this research is to answer this question, moreover, to determine the effective use of social media in the research process. This plan will proceed in some stages. During the first stage, altmetrics will be used. Thirty most well-known journals in economics are selected. The articles inside journals will be analyzed on journal level with the help of the academic social network – Mendeley. API call in Mendeley offers document details for each article in a specified journal with various defined metrics in this case three metrics will be selected: user discipline, academic status and country. These data will be analyzed and the results would be implemented in EconBiz. The idea of using this methodology as a first step is to display those three dimensions of influence to the researchers which might be supportive for them while searching for their literature.