Monthly Archives: October 2018

The Release Circle – A Glimpse behind the Scenes

As you already know, with the new DBpedia strategy the mode of releases changed, too.  The DBpedia release process follows a three-step approach – Extraction – ID-Management and Fusion. Releases are currently published on a monthly basis. In this post, we give you insight into what the single steps of the release process comprise and what our developers actually do when preparing a DBpedia release.

Extraction  – Step one of the Release

The good news is: Our new release mode is taking shape and noticeable picked up speed. Finally the 2018-08 and, additionally the 2018.09.12 and the 2018.10.16 Releases are now available in our LTS repository.

The 2018-08 Release was generated on the basis of the Wikipedia datasets extracted in early August and currently comprises 136 languages. The extraction release contains the raw extracted data generated by the DBpedia extraction-framework. The post-processing steps, such as data-deduplication or URI-normalization are omitted and moved to later parts of the release process. Thus, we can provide direct, transparent access to the generated data in every step. Until we manage two releases per month, our data is mostly based on the second Wikipedia datasets of the previous month. In line with that, the 2018.09.12 release is based on late August data and the recent 2018.10.16 Release is based on Wikipedia datasets extracted on September 20th. They all comprise 136 languages and contain a stable list of datasets since the 2018-08 release.

Our releases are now ready for parsing and external use. Additionally, there will be a new Wikidata-based release this week.

ID-Management – Step two of the Release

For a complete “new DBpedia” release the DBpedia ID-Management and Fusion of the data have to be added to the process. The Databus ID Management is a process to unify various different IRIs identifying the same entities coined from different data providers. Taking datasets with overlapping domains of interest from multiple data providers, the set of IRIs denoting the entities in the source datasets are determined heuristically (e.g. excluding RDF/OWL types/classes).

Afterwards, these selected IRIs a numeric primary key, the ‘Singleton ID’. The core of the ID Management process happens in the next step: Based on the large set of owl:sameAs assertions in the input data with high confidence, the connected components induced from the corresponding sameAs-graph is computed. In other words: The groups of all entities from the input datasets (transitively) reachable from one to another are determined. We dubbed these groups the sameAs-clusters. For each sameAs-cluster we pick one member as representant, which determines the ‘Cluster ID’ or ‘Global Identifier’ for all cluster members.

Apart from being an essential preparatory step for the Fusion, these Global Identifiers serve purpose in their own right as unified Linked Data identifiers for groups of Linked Data entities that should be viewed as equivalent or ‘the same thing’.

A processing workflow based on Apache Spark to perform the process described on above for large quantities of RDF input data is already in place and has been run successfully for a large set of DBpedia inputs consisting of:

 

Fusion – Step three of the Release

Based on the extraction and the ID-Management, the Data Fusion finalizes the last step of the  DBpedia release cycle. With the goal of improving data quality and data coverage, the process uses the DBpedia global IRI clusters to fuse and enrich the source datasets. The fused data contains all resource of the input datasets. The fusion process is based on a functional property decision to decide the number of selected values ( owl:FunctionalProperty determination ). Further, the value selection for this functional properties is based on a preference dependent on the originated source dataset. For example, preferred values for En-DBpedia over DE-DBpedia.

The enrichment improves entity-properties and -values coverage for resources only contained in the source data. Furthermore, we create provenance data to keep track of the origin of each triple. This provenance data is also used for the http-based http://global.dbpedia.org resource view.

At the moment the fused and enriched data is available for the generic, and mapping-based extractions. More datasets are still in progress.  The DBpedia-fusion data is uploading to http://downloads.dbpedia.org/repo/dev/fusion/

 

Please note we are still in the midst of the beta testing for our data release tool, so in case you do come across any errors, reporting them to us is much appreciated to fuel the testing process.

Further information regarding the releases progress can be found here: http://dev.dbpedia.org/

Next steps

We will add more releases to the repository on a monthly basis aiming for a bi-weekly release mode as soon as possible. In between the intervals, any mistakes or errors you find and report in this data can be fixed for the upcoming release. 

Currently, the generated metadata in the DataID-file is not stable. This will fluctuate and still needs to be improved and will change in the near future. 

This blog post was written with the help our DBpedia developers Robert Bielinski, Markus Ackermann and Marvin Hofer who were responsible for the work done with respect to the DBpedia releases. We like to thank them for their great work. 

Yours DBpedia Association

Retrospective: GSoC 2018

With all the beta-testing, the evaluations of the community survey part I and part II and the preparations for the Semantics 2018 we lost almost sight of telling you about the final results of GSoC 2018. Following we present you a short recap of this year’s students and projects that made it to the finishing line of GSoC 2018.

 

Et Voilà

We started out with six students that committed to GSoC projects. However, in the course of the summer, some dropped out or did not pass the midterm evaluation. In the end, we had three finalists that made it through the program.

Meet Bharat Suri

… who worked on “Complex Embeddings for OOV Entities”. The aim of this project was to enhance the DBpedia Knowledge Base by enabling the model to learn from the corpus and generate embeddings for different entities, such as classes, instances and properties.  His code is available in his GitHub repository. Tommaso Soru, Thiago Galery and Peng Xu supported Bharat throughout the summer as his DBpedia mentors.

Meet Victor Fernandez

.. who worked on a “Web application to detect incorrect mappings across DBpedia’s in different languages”. The aim of his project was to create a web application and API to aid in automatically detecting inaccurate DBpedia mappings. The mappings for each language are often not aligned, causing inconsistencies in the quality of the RDF generated. The final code of this project is available in Victor’s repository on GitHub. He was mentored by Mariano Rico and Nandana Mihindukulasooriya.

Meet Aman Mehta

.. whose project aimed at building a model which allows users to query DBpedia directly using natural language without the need to have any previous experience in SPARQL. His task was to train a Sequence-2-Sequence Neural Network model to translate any Natural Language Query (NLQ) into the corresponding sentence encoding SPARQL query. See the results of this project in Aman’s GitHub repositoryTommaso Soru and Ricardo Usbeck were his DBpedia mentors during the summer.

Finally, these projects will contribute to an overall development of DBpedia. We are very satisfied with the contributions and results our students produced.  Furthermore, we like to genuinely thank all students and mentors for their effort. We hope to be in touch and see a few faces again next year.

A special thanks goes out to all mentors and students whose projects did not make it through.

GSoC Mentor Summit

Now it is the mentors’ turn to take part in this year GSoC mentor summit, October 12th till 14th. This year, Mariano Rico and Thiago Galery will represent DBpedia at the event. Their task is to engage in a vital discussion about this years program, about lessons learned, highlights and drawbacks they experienced during the summer. Hopefully, they return with new ideas from the exchange with mentors from other open source projects. In turn, we hope to improve our part of the program for students and mentors.

Sit tight, follow us on Twitter and we will update you about the event soon.

Yours DBpedia Association

DBpedia Chapters – Survey Evaluation – Episode Two

Welcome back to part two of the evaluation of the surveys, we conducted with the DBpedia chapters.

Survey Evaluation – Episode Two

The second survey focused on technical matters. We asked the chapters about the usage of DBpedia services and tools, technical problems and challenges and potential reasons to overcome them.  Have a look below.

Again, only nine out of 21 DBpedia chapters participated in this survey. And again, that means, the results only represent roughly 42% of the DBpedia chapter population

The good news is, all chapters maintain a local DBpedia endpoint. Yay! More than 55 % of the chapters perform their own extraction. The rest of them apply a hybrid approach reusing some datasets from DBpedia releases and additionally, extract some on their own.

Datasets, Services and Applications

In terms of frequency of dataset updates, the situation is as follows:  44,4 % of the chapters update them once a year. The answers of the remaining ones differ in equal shares, depending on various factors. See the graph below. 

 

 

 

 

 

 

 

When it comes to the maintenance of links to local datasets, most of the chapters do not have additional ones. However, some do maintain links to, for example, Greek WordNet, the National Library of Greece Authority record, Geonames.jp and the Japanese WordNet. Furthermore, some of the chapters even host other datasets of local interest, but mostly in a separate endpoint, so they keep separate graphs.

Apart from hosting their own endpoint, most chapters maintain one or the other additional service such as Spotlight, LodLive or LodView.

 

 

 

 

 

 

 

Moreover,  the chapters have additional applications they developed on top of DBpedia data and services.

Besides, they also gave us some reasons why they were not able to deploy DBpedia related services. See their replies below.

 

 

 

 

 

 

 

 

 

DBpedia Chapter set-up

Lastly, we asked the technical heads of the chapters what the hardest task for setting up their chapter had been.  The answers, again, vary as the starting position of each chapter differed. Read a few of their replies below.

The hardest technical task for setting up the chapter was:

  • to keep virtuoso up to date
  • the chapter specific setup of DBpedia plugin in Virtuoso
  • the Extraction Framework
  • configuring Virtuoso for serving data using server’s FQDN and Nginx proxying
  • setting up the Extraction Framework, especially for abstracts
  • correctly setting up the extraction process and the DBpedia facet browser
  • fixing internationalization issues, and updating the endpoint
  • keeping the extraction framework working and up to date
  • updating the server to the specific requirements for further compilation – we work on Debian

 

Final  words

With all the data and results we gathered, we will get together with our chapter coordinator to develop a strategy of how to improve technical as well as organizational issues the surveys revealed. By that, we hope to facilitate a better exchange between the chapters and with us, the DBpedia Association. Moreover, we intend to minimize barriers for setting up and maintaining a DBpedia chapter so that our chapter community may thrive and prosper.

In the meantime, spread your work and share it with the community. Do not forget to follow and tag us on Twitter ( @dbpedia ). You may also want to subscribe to our newsletter.

We will keep you posted about any updates and news.

Yours

DBpedia Association