Monthly Archives: August 2019

SEMANTiCS 2019 Interview: Katja Hose

Today’s post features an interview with our DBpedia Day keynote speaker Katja Hose, a Professor of Computer Science at Aalborg University, Denmark. In this Interview, Katja talks about increasing the reliability of Knowledge Graph Access as well as her expectations for SEMANTiCS 2019

Prior to joining Aalborg University, Katja was a postdoc at the Max Planck Institute for Informatics in Saarbrücken. She received her doctoral degree in Computer Science from Ilmenau University of Technology in Germany.

Can you tell us something about your research focus?

The most important focus of my research has been querying the Web of Data, in particular, efficient query processing over distributed knowledge graphs and Linked Data. This includes indexing, source selection, and efficient query execution. Unfortunately, it happens all too often that the services needed to access remote knowledge graphs are temporarily not available, for instance, because a software component crashed. Hence, we are currently developing a decentralized architecture for knowledge sharing that will make access to knowledge graphs a reliable service, which I believe is the key to a wider acceptance and usage of this technology.

How do you personally contribute to the advancement of semantic technologies?

I contribute by doing research, advancing the state of the art, and applying semantic technologies to practical use cases.  The most important achievements so far have been our works on indexing and federated query processing, and we have only recently published our first work on a decentralized architecture for sharing and querying semantic data. I have also been using semantic technologies in other contexts, such as data warehousing, fact-checking, sustainability assessment, and rule mining over knowledge bases.

Overall, I believe the greatest ideas and advancements come when trying to apply semantic technologies to real-world use cases and problems, and that is what I will keep on doing.

Which trends and challenges do you see for linked data and the semantic web?

The goal and the idea behind Linked Data and the Semantic Web is the second-best invention after the Internet. But unlike the Internet, Linked Data and the Semantic Web are only slowly being adopted by a broader community and by industry.

I think part of the reason is that from a company’s point of view, there are not many incentives and added benefit of broadly sharing the achievements. Some companies are simply reluctant to openly share their results and experiences in the hope of retaining an advantage over their competitors. I believe that if these success stories were shared more openly, and this is the trend we are witnessing right now, more companies will see the potential for their own problems and find new exciting use cases.

Another particular challenge, which we will have to overcome, is that it is currently still far too difficult to obtain and maintain an overview of what data is available and formulate a query as a non-expert in SPARQL and the particular domain… and of course, there is the challenge that accessing these datasets is not always reliable.

As artificial intelligence becomes more and more important, what is your vision of AI?

AI and machine learning are indeed becoming more and more important. I do believe that these technologies will bring us a huge step ahead. The process has already begun. But we also need to be aware that we are currently in the middle of a big hype where everybody wants to use AI and machine learning – although many people actually do not truly understand what it is and if it is actually the best solution to their problems. It reminds me a bit of the old saying “if the only tool you have is a hammer, then every problem looks like a nail”. Only time will tell us which problems truly require machine learning, and I am very curious to find out which solutions will prevail.

However, the current state of the art is still very far away from the AI systems that we all know from Science Fiction. Existing systems operate like black boxes on well-defined problems and lack true intelligence and understanding of the meaning of the data. I believe that the key to making these systems trustworthy and truly intelligent will be their ability to explain their decisions and their interpretation of the data in a transparent way.

What are your expectations about Semantics 2019 in Karlsruhe?

First and foremost, I am looking forward to meeting a broad range of people interested in semantic technologies. In particular, I would like to get in touch with industry-based research and to be exposed 

The End

We like to thank Katje Hose for her insights and are happy to have her as one of our keynote speakers.

Visit SEMANTiCS 2019 in Karlsruhe, Sep 9-12 and get your tickets for our community meeting here. We are looking forward to meeting you during DBpedia Day.

Yours DBpedia Association

SEMANTiCS Interview: Dan Weitzner

As the upcoming 14th DBpedia Community Meeting, co-located with SEMANTiCS 2019 in Karlsruhe, Sep 9-12, is drawing nearer, we like to take that opportunity to introduce you to our DBpedia keynote speakers.

Today’s post features an interview with Dan Weitzner from WPSemantix who talks about timbr-DBpedia, which we blogged about recently, as well as future trends and challenges of linked data and the semantic web.

Dan Weitzner is co-founder and Vice President of Research and Development of WPSemantix. He obtained his Bachelor of Science in Computer Science from Florida Atlantic University. In collaboration with DBpedia, he and his colleagues at WPSemantix launched timbr, the first SQL Semantic Knowledge Graph that integrates Wikipedia and Wikidata Knowledge into SQL engines.

Dan Weitzner

1. Can you tell us something about your research focus?

WPSemantix bridges the worlds of standard databases and the Semantic Web by creating ontologies accessible in standard SQL. 

Our platform – timbr is a virtual knowledge graph that maps existing data-sources to abstract concepts, accessible directly in all the popular Business Intelligence (BI) tools and also natively integrated into Apache Spark, R, Python, Java and Scala. 

timbr enables reasoning and inference for complex analytics without the need for costly Extract-Transform-Load (ETL) processes to graph databases.

2. How do you personally contribute to the advancement of semantic technologies?

We believe we have lowered the fundamental barriers to adoption of semantic technologies for large organizations who want to benefit from knowledge graph capabilities without firstly requiring fundamental changes in their database infrastructure and secondly, without requiring expensive organizational changes or significant personnel retraining.  

Additionally, we implemented the W3C Semantic Web principles to enable inference and inheritance between concepts in SQL, and to allow seamless integration of existing ontologies from OWL. Subsequently, users across organizations can do complex analytics using the same tools that they currently use to access and query their databases, and in addition, to facilitate the sophisticated query of big data without requiring highly technical expertise.  
timbr-DBpedia is one example of what can be achieved with our technology. This joint effort with the DBpedia Association allows semantic SQL query of the DBpedia knowledge graph, and the semantic integration of the DBpedia knowledge into data warehouses and data lakes. Finally, timbr-DBpedia allows organizations to benefit from enriching their data with DBpedia knowledge, combining it with machine learning and/or accessing it directly from their favourite BI tools.

3. Which trends and challenges do you see for linked data and the semantic web?

Currently, the use of semantic technologies for data exploration and data integration is a significant trend followed by data-driven communities. It allows companies to leverage the relationship-rich data to find meaningful insights into their data. 

One of the big difficulties for the average developer and business intelligence analyst is the challenge to learn semantic technologies. Another one is to create ontologies that are flexible and easily maintained. We aim to solve both challenges with timbr.

4. Which application areas for semantic technologies do you perceive as most promising?

I think semantic technologies will bloom in applications that require data integration and contextualization for machine learning models.

Ontology-based integration seems very promising by enabling accurate interpretation of data from multiple sources through the explicit definition of terms and relationships – particularly in big data systems,  where ontologies could bring consistency, expressivity and abstraction capabilities to the massive volumes of data.

5. As artificial intelligence becomes more and more important, what is your vision of AI?

I envision knowledge-based business intelligence and contextualized machine learning models. This will be the bedrock of cognitive computing as any analysis will be semantically enriched with human knowledge and statistical models.

This will bring analysts and data scientists to the next level of AI.

6. What are your expectations about Semantics 2019 in Karlsruhe?

I want to share our vision with the semantic community and I would also like to learn about the challenges, vision and expectations of companies and organizations dealing with semantic technologies. I will present “timbr-DBpedia – Exploration and Query of DBpedia in SQL”

The End

Visit SEMANTiCS 2019 in Karlsruhe, Sep 9-12 and find out more about timbr-DBpedia and all the other new developments at DBpedia. Get your tickets for our community meeting here. We are looking forward to meeting you during DBpedia Day.

Yours DBpedia Association

RDF2NL: Generating Texts from RDF Data

RDF2NL is featured in the following guest post by Diego Moussalem, (Dice Research Group & Portuguese DBpedia Chapter).

Hi DBpedians,

During the DBpedia Day in Leipzig, I gave a talk about how to use the facts contained in the DBpedia Knowledge Graph for generating coherent sentences and texts.

We essentially rely on Natural Language Generation (NLG) techniques for accomplishing this task. NLG is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001).

Why not generate text from Knowledge graphs? 

The generation of natural language from the Semantic Web has been already introduced some years ago (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014; Staykova, 2014). However, it has gained recently substantial attention and some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). Still, English is the only language which has been widely targeted. Thus, we proposed RDF2NL which can generate texts in other languages than English by relying on different language versions of SimpleNLG.

What is RDF2NL?

While the exciting avenue of using deep learning techniques in NLG approaches (Gatt and Krahmer, 2017) is open to this task and deep learning has already shown promising results for RDF data (Sleimi and Gardent, 2016), the morphological richness of some languages led us to develop a rule-based approach. This was to ensure that we could identify the challenges imposed by each language from the SW perspective before applying Machine Learning (ML) algorithms. RDF2NL is able to generate either a single sentence or a summary of a given resource. RDF2NL is based on Ngonga Ngomo et.al LD2NL and it also uses the Brazilian, Spanish, French, German and Italian adaptations of SimpleNLG to the realization task.

An example of RDF2NL application:

We envisioned a promising application by using RDF2PT which aims to support the automatic creation of benchmarking datasets to Named Entity Recognition (NER) and Entity Linking (EL) tasks. In Brazilian Portuguese, there is a lack of gold standards datasets for these tasks, which makes the investigation of these problems difficult for the scientific community. Our aim was to create Brazilian Portuguese silver standard datasets which are able to be uploaded into GERBIL for easy evaluation. To this end, we implemented RDF2PT ( Portuguese version of RDF2NL) in BENGAL , which is an approach for automatically generating NER benchmarks based on RDF triples and Knowledge Graphs. This application has already resulted in promising datasets which we have used to investigate the capability of multilingual entity linking systems for recognizing and disambiguating entities in Brazilian Portuguese texts. Some results you can find below:
NER – http://gerbil.aksw.org/gerbil/experiment?id=201801050043
NED – http://gerbil.aksw.org/gerbil/experiment?id=201801110012

More application scenarios

  • Summarize or Explain KBs to non-experts
  • Create news automatically (automated journalism)
  • Summarize medical records
  • Generate technical manuals
  • Support the training of other NLP tasks
  • Generate product descriptions (Ebay)

Deep Learning into RDF2NL

After devising our rule-based approach, we realized that RD2NL is really good by selecting adequate content from the RDF triples, but the fluency of its generated texts remains a challenge. Therefore, we decided to move forward and work with neural network models to improve the fluency of texts as they have already shown promising results in the generation of translations. Thus, we focused on the generation of referring expressions, which is an essential part while generating texts, it basically decides how the NLG model will present the information about a given entity. For example, the referring expressions of the entity Barack Obama can be “the former president of USA”, “Obama”, “Barack”, “He” and so on. Afterward, we have been working on combining different NLG sub-tasks into single neural models for improving the fluency of our texts.

GSoC on it – Stay tuned!  

Apart from trying to improve the fluency of our models, we relied previously on different language versions of SimpleNLG to the realization task. Nowadays, we have been investigating the generation of multiple languages by using a unique neural model. Our student has been working hard to provide nice results and we are basically at the end of our GSoC project. So stay tuned to know the outcome of this exciting project.

Many thanks to Diego for his contribution. If you want to write a guest post, share your results on the DBpedia Blog, and thus give your work more visibility and outreach, just ping us via dbpedia@infai.org.

Yours

DBpedia Association

DBpedia Live Restart – Getting Things Done

Part VI of the DBpedia Growth Hack series (View all)

DBpedia Live is a long term core project of DBpedia that immediately extracts fresh triples from all changed Wikipedia articles. After a long hiatus, fresh and live updated data is available once again, thanks to our former co-worker Lena Schindler whose work we feature in this blog post. Before we dive into Lena’s report, let’s have a look at some general info about DBpedia Live:

Live Enterprise Version

OpenLink Software provides a scalable, dedicated, live Virtuoso instance, built on Lena’s remastering. Kingsley Idehen announced the dedicated business service in our new DBpedia forum. .
On the Databus, we collect publicly shared and business-ready dedicated services in the same place where you can download the data. Databus allows you to download the data, build a service, and offer that service, all in one place. Data up-loaders can also see who builds something with their data

Remastering the DBpedia Live Module

Contribution by Lena Schindler

After developing the DBpedia REST API as part of a student project in 2018, I worked as a student Research Assistant for DBpedia. My task was to analyze and patch severe issues in the DBpedia Live instance. I will shortly describe the purpose of DBpedia Live, the reasons it went out of service, what I did to fix these, and finally, the changes needed to support multi-language abstract extraction.


Overview

The DBpedia Extraction Framework is Scala-based software with numerous features that have evolved around extracting knowledge (as RDF) from Wikis. One part is the DBpedia Live module in the “live-deployed” branch, which is intended to provide a continuously updated version of DBpedia by processing Wikipedia pages on demand, immediately after they have been modified by a user. The backbone of this module is a queue that is filled with recently edited Wikipedia pages, combined with a relational database, called Live Cache, that handles the diff between two consecutive versions of a page. The module that fills the queue, called Feeder, needs some kind of connection to a Wiki instance that reports changes to a Wiki Page. The processing then takes place in four steps: 

  1. A wiki page is taken out of the queue. 
  2. Triples are extracted from the page, with a given set of extractors. 
  3. The new triples from the page are compared to the old triples from the Live Cache.
  4. The triple sets that have been deleted and added are published as text files, and the Cache is updated. 

Background

DBpedia Live has been out of service since May 2018, due to the termination of the Wikimedia RCStream Service, upon which the old DBpedia Live Feeder module relied. This socket-based service provided information about changes to an existing Wikimedia instance and was replaced by the EventStreams service, which runs over a single HTTP connection using chunked transfer encoding, and is following the Server-Sent Event (SSE) protocol. It provides a stream of events, each of which contains information about title, id, language, author, and time of every page edit of all Wikimedia instances.

Fix

Starting in September 2018, my first task was to implement a new Feeder for DBpedia Live that is based on this new Wikimedia EventStreams Service. For the Java world, the Akka framework provides an implementation of a SSE client. Akka is a toolkit developed by Lightbend. It simplifies the construction of concurrent and distributed JVM applications, enabling both Java and Scala access. The Akka SSE client and the Akka Streams module are used in the new EventStreamsFeeder (Akka Helper) to extract and process the data stream. I decided to use Scala instead of Java, because it is a more natural fit to Akka. 

After I was able to process events, I had the problem that frequent interruptions in the upstream connection were causing the processing stream to fail. Luckily, Akka provides a fallback mechanism with back-off, similar to the Binary Exponential Backoff of the Ethernet protocol which I could use to restart the stream (called “Graph” in Akka terminology).

Another problem was that in many cases, there were many changes to a page within a short time interval, and if events were processed quickly enough, each change would be processed separately, stressing the Live Instance with unnecessary load. A simple “thread sleep” reduced the number of change-sets being published every hour from thousands to a few hundred.

Multi-language abstracts

The next task was to prepare the Live module for the extraction of abstracts (typically the first paragraph of a page, or the text before the table of contents). The extractors used for this task were re-implemented in 2017. It turned out to be a configuration issue first, and second a candidate for long debugging sessions, fixing issues in the dependencies  between the “live” and “core” modules. Then, in order to allow the extraction of abstracts in multiple languages, the “live” module needed many small changes, at places spread across the code-base, and care had to be taken not to slow down the extraction in the single language case, compared to the performance before the change. Deployment was delayed by an issue with the remote management unit of the production server, but was accomplished by May 2019.

Summary

I also collected my knowledge of the Live module in detailed documentation, addressed to developers who want to contribute to the code. This includes an explanation of the architecture as well as installation instructions. After 400 hours of work, DBpedia Live is alive and kicking, and now supports multi-language abstract extraction. Being responsible for many aspects of Software Engineering, like development, documentation, and deployment, I was able to learn a lot about DBpedia and the Semantic Web, hone new skills in database development and administration, and expand my programming experience using Scala and Akka. 

“Thanks a lot to the whole DBpedia Team who always provided a warm and supportive environment!”

Thank you Lena, it is people like you who help DBpedia improve and develop further, and help to make data networks a reality.

Follow DBpedia on LinkedIn, Twitter or Facebook and stop by the DBpedia Forum to check out the latest discussions.

Yours DBpedia Association