Tag Archives: datasets

New Prototype: Databus Collection Feature

We are thrilled to announce that our Databus Collection Feature for the DBpedia Databus has been developed and is now available as a prototype. It simplifies the way to bundle your data and use it in your application.

A new Databus Collection Feature? How come, and how does it work? Read below and find out how using the DBpedia Databus becomes easier by the day and with each new tool.

Motivation

With more and more data being uploaded to the databus we started to develop test applications using that data. The SPARQL endpoint offers a central hub to access all metadata for datasets uploaded to the databus provided you know how to write SPARQL queries. The metadata includes the download links of the data files – it was, therefore, possible to pass a SPARQL query to an application, download the actual data and then use for whatever purpose the app had.

The Databus Collection Editor

The DBpedia Databus now provides an editor for collections. A collection is basically a labelled SPARQL query that is retrievable via URI. Hence, with the collection editor you can group Databus groups and artifacts into a bundle and publish your selection using your Databus account. It is now a breeze to select the data you need, share the exact selection with others and/or use it in existing or self-made applications.

If you are not familiar with SPARQL and data queries, you can think of the feature as a shopping cart for data: You create a new cart, put data in it and tell your friends or applications where to find it. Quite neat, right?

In the following section, we will cover the user interface of the collection editor.

The Editor UI

Firstly, you can find the collection editor by going to the DBpedia Databus and following the Collections link at the top or you can get there directly by clicking here.

What you will see is the following:

General Collection Info

Secondly, since you do not have any collections yet, the editor has already created an empty collection named “Unnamed” for you. At the right side next to the label and description you will find a pen icon. By clicking the icon or the label itself you can edit its content. The collection is not published yet, so the Collection URI is blank.

Whenever you are not logged in or the collection has not been published yet, the editor will also notify you that your changes are only saved in your local browser cache and NOT remotely on our server. Keep that in mind when clearing your cache. Publishing the collection however is easy: Simply log into (or create) your Databus account and hit the publish button in the action bar. This will open up a modal where you can pick your unique collection id and hit publish again. That’s it!

The Collection Info section will now show the collection URI. Following the link will take you to the HTML representation of your collection that will be visible to others. Hitting the Edit button in the action bar will bring you back to the editor.

Collection Hierarchy

Let’s have a look at the core piece of the collection editor: the hierarchy view. A collection can be a bundle of different Databus groups and artifacts but is not limited to that. If you know how to write a SPARQL query, you can easily extend your collection with more powerful selections. Therefore, the hierarchy is split into two nodes:

  • Generated Queries: Contains all queries that are generated from your selection in the UI
  • Custom Queries: Contains all custom written SPARQL queries

Both, hierarchy nodes have a “+” icon. Clicking on this button will let you add generated or custom queries respectively.

Custom Queries

If you hit the “+” icon on the Custom Queries node, a new node called “Custom Query” will appear in the hierarchy. You can remove a custom query by clicking on the trashcan icon in the hierarchy. If you click the node it will take you to a SPARQL input field where you can edit the query.

To make your collection more understandable for others, you can even document the query by adding a label and description.

Writing Your Own Custom Queries

A collection query is a SPARQL query of the form:

SELECT DISTINCT ?file WHERE {
    {
        [SUBQUERY]
    }
    UNION
    {
        [SUBQUERY]
    }
    UNION
    ...
    UNION
    {
        [SUBQUERY]
    }
}

All selections made by generated and custom queries will be joined into a single result set with a single column called “file“. Thus it is important that your custom query binds data to a variable called “file” as well.

Generated Queries

Clicking the “+” icon on the Generated Queries node will take you to a search field. Make use of the indexed search on the Databus to find and add the groups and artifacts you need. If you want to refine your search, don’t worry: you can do that in the next step!

Once the artifact or group has been added to your collection, the Add to Collection button will turn green. Once you are done you can go back to the Editor with Back to Hierarchy button.

Your hierarchy will now contain several new nodes.

Group Facets, Artifact Facets and Overrides

Group and artifacts that have been added to the collection will show up as nodes in the hierarchy. Clicking a node will open a filter where you can refine your dataset selection. Setting a filter to a group node will apply it to all artifact nodes unless you override that setting in any artifact node manually. The filter set in the group node is shown in the artifact facets in dark grey. Any overrides in the artifact facets will be highlighted in green:

Group Nodes

A group node will provide a list of filters that will be applied to all artifacts of that group:

Artifact Nodes

Artifact nodes will then actually select data files which will be visible in the faceted view. The facets are generated dynamically from the available variants declared in the metadata.

Example: Here we selected the latest version of the databus dump as n-triple. This collection is already in use: The collection URI is passed to the new generic lookup application, which then creates the search function for the databus website. If you are interested in how to configure the lookup application, you can go here: https://github.com/dbpedia/lookup-application. Additionally, there will also be another blog post about the lookup within the next few weeks

Use Cases

The DBpedia Databus Collections are useful in many ways.

  • You can share a specific dataset with your community or colleagues.
  • You can re-use dataset others created
  • You can plug collections into databus-ready applications and avoid spending time on the download and setup process
  • You can point to a specific piece of data (e.g. for testing) with a single URI in your publications
  • You can help others to create data queries more easily

We hope you enjoy the Databus Collection Feature and we would love to hear your feedback! You can leave your thoughts and suggestions in the new DBpedia Forum. Feedback of any kinds is highly appreciated since we want to improve the prototype as fast and user-driven as possible! Cheers!

A big thanks goes to DBpedia developer Jan Forberg who finalized the Databus Collection Feature and compiled this text.

Yours

DBpedia Association

The Release Circle – A Glimpse behind the Scenes

As you already know, with the new DBpedia strategy our mode of publishing releases changed.  The new DBpedia release process follows a three-step approach starting from the Extraction to ID-Management towards the Fusion, which finalizes the release process.  Our DBpedia releases are currently published on a monthly basis. In this post, we give you insight into the single steps of the release process and into what our developers actually do when preparing a DBpedia release.

Extraction  – Step one of the Release

The good news is, our new release mode is taking shape and noticeable picked up speed. Finally the 2018-08 and, additionally the 2018.09.12 and the 2018.10.16 Releases are now available in our LTS repository.

The 2018-08 Release was generated on the basis of the Wikipedia datasets extracted in early August and currently comprises 136 languages. The extraction release contains the raw extracted data generated by the DBpedia extraction-framework. The post-processing steps, such as data-deduplication or URI-normalization are omitted and moved to later parts of the release process. Thus, we can provide direct, transparent access to the generated data in every step. Until we manage two releases per month, our data is mostly based on the second Wikipedia datasets of the previous month. In line with that, the 2018.09.12 release is based on late August data and the recent 2018.10.16 Release is based on Wikipedia datasets extracted on September 20th. They all comprise 136 languages and contain a stable list of datasets since the 2018-08 release.

Our releases are now ready for parsing and external use. Additionally, there will be a new Wikidata-based release this week.

ID-Management – Step two of the Release

For a complete “new DBpedia” release the DBpedia ID-Management and Fusion of the data have to be added to the process. The Databus ID Management is a process to unify various different IRIs identifying the same entities coined from different data providers. Taking datasets with overlapping domains of interest from multiple data providers, the set of IRIs denoting the entities in the source datasets are determined heuristically (e.g. excluding RDF/OWL types/classes).

Afterwards, these selected IRIs a numeric primary key, the ‘Singleton ID’. The core of the ID Management process happens in the next step: Based on the large set of owl:sameAs assertions in the input data with high confidence, the connected components induced from the corresponding sameAs-graph is computed. In other words: The groups of all entities from the input datasets (transitively) reachable from one to another are determined. We dubbed these groups the sameAs-clusters. For each sameAs-cluster we pick one member as representant, which determines the ‘Cluster ID’ or ‘Global Identifier’ for all cluster members.

Apart from being an essential preparatory step for the Fusion, these Global Identifiers serve purpose in their own right as unified Linked Data identifiers for groups of Linked Data entities that should be viewed as equivalent or ‘the same thing’.

A processing workflow based on Apache Spark to perform the process described on above for large quantities of RDF input data is already in place and has been run successfully for a large set of DBpedia inputs consisting of:

 

Fusion – Step three of the Release

Based on the extraction and the ID-Management, the Data Fusion finalizes the last step of the  DBpedia release cycle. With the goal of improving data quality and data coverage, the process uses the DBpedia global IRI clusters to fuse and enrich the source datasets. The fused data contains all resource of the input datasets. The fusion process is based on a functional property decision to decide the number of selected values ( owl:FunctionalProperty determination ). Further, the value selection for this functional properties is based on a preference dependent on the originated source dataset. For example, preferred values for En-DBpedia over DE-DBpedia.

The enrichment improves entity-properties and -values coverage for resources only contained in the source data. Furthermore, we create provenance data to keep track of the origin of each triple. This provenance data is also used for the http-based http://global.dbpedia.org resource view.

At the moment the fused and enriched data is available for the generic, and mapping-based extractions. More datasets are still in progress.  The DBpedia-fusion data is uploading to http://downloads.dbpedia.org/repo/dev/fusion/

Please note we are still in the midst of the beta testing for our data release tool, so in case you do come across any errors, reporting them to us is much appreciated to fuel the testing process.

Further information regarding the releases progress can be found here: http://dev.dbpedia.org/

Next steps

We will add more releases to the repository on a monthly basis aiming for a bi-weekly release mode as soon as possible. In between the intervals, any mistakes or errors you find and report in this data can be fixed for the upcoming release. Currently, the generated metadata in the DataID-file is not stable. This will fluctuate, still needs to be improved and will change in the near future.  We are now working on the next release and will inform you as soon as it is published.

Yours DBpedia Association

This blog post was written with the help of our DBpedia developers Robert Bielinski, Markus Ackermann and Marvin Hofer who were responsible for the work done with respect to the DBpedia releases. We like to thank them for their great work. 

 

Beta-Test Updates

While everyone at the DBpedia Association was preparing for the SEMANTiCS Conference in Vienna, we also managed to reach an important milestone regarding the beta-test for our data release tool.

First and foremost, already 3500 files have been published with the plugin. These files will be part of the new DBpedia release and are available on our LTS repository.

Secondly, the documentation of the testing has been brought into good shape. Feel free to drop by and check it out.
Thirdly, we reached our first interoperability goal. The current metadata is sufficient to produce RSS 1.0 feeds. See here for further information. We also defined a loose roadmap on top of the readme, where interoperability to DCAT and DCAT-AP has high priority.

 

Now we have some time to support you and work one on one and also prepare the configurations to help you set up the data releases. Lastly, we already received data from DNB and SUMO, so we will start to look into these more closely.

Thanks to all the beta-testers for your nice work.

We keep you posted.

Yours

DBpedia Association

Keep using DBpedia!

Just recently, DBpedia Association member and hosting specialist, OpenLink released the DBpedia Usage report, a periodic report on the DBpedia SPARQL endpoint and associated Linked Data deployment.

The report not only gives some historical insight into DBpedia’s usage, number of visits and hits per day but especially shows statistics collected between October 2016 and December 2017. The report covers more than a year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/.  

Before we want to highlight a few aspects of DBpedia’s usage we would like to thank Open Link for the continuous hosting of the DBpedia Endpoint and the creation of this report

The graph shows the average number of hits/requests per day that were made to the DBpedia service during each of the releases.

The graph shows the average number of unique visits per day made to the DBpedia service during each of the datasets.

Speaking of which, as you can see in the following tables, there has been a massive increase in the number of hits coinciding with the DBpedia 2015–10 release on April 1st, 2016.

 

 

 

 

This boost can be attributed to an intensive promotion of DBpedia via community meetings, communication with various partners in the Linked Data community and Social media presence among the community, in order to increase backlinks.

Since then, not only the numbers of hits increased but DBpedia also provided for better data quality. We are constantly working on improving accessibility, data quality and stability of the SPARQL endpoint. Kudos to Open Link for maintaining the technical baseline for DBpedia.

The table shows the usage overview of last year.

The full report is available here.

 

Subscribe to the DBpedia Newsletter, check our DBpedia Website and follow us on Twitter, Facebook, and LinkedIn for the latest news.

Thanks for reading and keep using DBpedia!

Yours DBpedia Associaton

 

New DBpedia Release – 2016-10

We are happy to announce the new DBpedia Release.

This release is based on updated Wikipedia dumps dating from October 2016.

You can download the new DBpedia datasets in N3 / TURTLE serialisation from http://wiki.dbpedia.org/downloads-2016-10 or directly here http://downloads.dbpedia.org/2016-10/.

This release took us longer than expected. We had to deal with multiple issues and included new data. Most notable is the addition of the NIF annotation datasets for each language, recording the whole wiki text, its basic structure (sections, titles, paragraphs, etc.) and the included text links. We hope that researchers and developers, working on NLP-related tasks, will find this addition most rewarding. The DBpedia Open Text Extraction Challenge (next deadline Mon 17 July for SEMANTiCS 2017) was introduced to instigate new fact extraction based on these datasets.

We want to thank anyone who has contributed to this release, by adding mappings, new datasets, extractors or issue reports, helping us to increase coverage and correctness of the released data.  The European Commission and the ALIGNED H2020 project for funding and general support.

You want to read more about the  New Release? Click below for further  details.

 Statistics

Altogether the DBpedia 2016-10 release consists of 13 billion (2016-04: 11.5 billion) pieces of information (RDF triples) out of which 1.7 billion (2016-04: 1.6 billion) were extracted from the English edition of Wikipedia, 6.6 billion (2016-04: 6 billion) were extracted from other language editions and 4.8 billion (2016-04: 4 billion) from Wikipedia Commons and Wikidata.

In addition, adding the large NIF datasets for each language edition (see details below) increased the number of triples further by over 9 billion, bringing the overall count up to 23 billion triples.

Changes

  • The NLP Interchange Format (NIF) aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. To extend the versatility of DBpedia, furthering many NLP-related tasks, we decided to extract the complete human- readable text of any Wikipedia page (‘nif_context’), annotated with NIF tags. For this first iteration, we restricted the extent of the annotations to the structural text elements directly inferable by the HTML (‘nif_page_structure’). In addition, all contained text links are recorded in a dedicated dataset (‘nif_text_links’).
    The DBpedia Association started the Open Extraction Challenge on the basis of these datasets. We aim to spur knowledge extraction from Wikipedia article texts in order to dramatically broaden and deepen the amount of structured DBpedia/Wikipedia data and provide a platform for benchmarking various extraction tools with this effort.
    If you want to participate with your own NLP extraction engine, the next deadline for the SEMANTICS 2017 is July 17th.
    We included an example of these structures in section five of the download-page of this release.
  • A considerable amount of work has been done to streamline the extraction process of DBpedia, converting many of the extraction tasks into an ETL setting (using SPARK). We are working in concert with the Semantic Web Company to further enhance these results by introducing a workflow management environment to increase the frequency of our releases.

In case you missed it, what we changed in the previous release (2016-04)

  • We added a new extractor for citation data that provides two files:
    • citation links: linking resources to citations
    • citation data: trying to get additional data from citations. This is a quite interesting dataset but we need help to clean it up
  • In addition to normalised datasets to English DBpedia (en-uris), we additionally provide normalised datasets based on the DBpedia Wikidata (DBw) datasets (wkd-uris). These sorted datasets will be the foundation for the upcoming fusion process with wikidata. The DBw-based uris will be the only ones provided from the following releases on.
  • We now filter out triples from the Raw Infobox Extractor that are already mapped. E.g. no more “<x> dbo:birthPlace <z>” and “<x> dbp:birthPlace|dbp:placeOfBirth|… <z>” in the same resource. These triples are now moved to the “infobox-properties-mapped” datasets and not loaded on the main endpoint. See issue 22 for more details.
  • Major improvements in our citation extraction. See here for more details.
  • We incorporated the statistical distribution approach of Heiko Paulheim in creating type statements automatically and providing them as additional datasets (instance_types_sdtyped_dbo).

 

Upcoming Changes

  • DBpedia Fusion: We finally started working again on fusing DBpedia language editions. Johannes Frey is taking the lead in this project. The next release will feature intermediate results.
  • Id Management: Closely pertaining to the DBpedia Fusion project is our effort to introduce our own Id/IRI management, to become independent of Wikimedia created IRIs. This will not entail changing out domain or entity naming regime, but providing the possibility of adding entities of any source or scope.
  • RML Integration: Wouter Maroy did already provide the necessary groundwork for switching the mappings wiki to an RML based approach on Github. Wouter started working exclusively on implementing the Git based wiki and the conversion of existing mappings last week. We are looking forward to the consequent results of this process.
  • Further development of SPARK Integration and workflow-based DBpedia extraction, to increase the release frequency.

 

New Datasets

  • New languages extracted from Wikipedia:

South Azerbaijani (azb), Upper Sorbian (hsb), Limburgan (li), Minangkabau (min), Western Mari (mrj), Oriya (or), Ossetian (os)

  • SDTypes: We extended the coverage of the automatically created type statements (instance_types_sdtyped_dbo) to English, German and Dutch.
  • Extensions: In the extension folder (2016-10/ext) we provide two new datasets (both are to be considered in an experimental state:
    • DBpedia World Facts: This dataset is authored by the DBpedia Association itself. It lists all countries, all currencies in use and (most) languages spoken in the world as well as how these concepts relate to each other (spoken in, primary language etc.) and useful properties like iso codes (ontology diagram). This Dataset extends the very useful LEXVO dataset with facts from DBpedia and the CIA Factbook. Please report any error or suggestions in regard to this dataset to Markus.
    • JRC-Alternative-Names: This resource is a link based complementary repository of spelling variants for person and organisation names. The data is multilingual and contains up to hundreds of variations entity. It was extracted from the analysis of news reports by the Europe Media Monitor (EMM) as available on JRC-Names.

 Community

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2016-04 ontology encompasses:

  • 760 classes
  • 1,105 object properties
  • 1,622 datatype properties
  • 132 specialised datatype properties
  • 414 owl:equivalentClass and 220 owl:equivalentProperty mappings external vocabularies

The editor community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 2016-10 extraction, we used a total of 5887 template mappings (DBpedia 2015-10: 5800 mappings). The top language, gauged by the number of mappings, is Dutch (648 mappings), followed by the English community (606 mappings).

Read more

 Credits to

  • Markus Freudenberg (University of Leipzig / DBpedia Association) for taking over the whole release process and creating the revamped download & statistics pages.
  • Dimitris Kontokostas (University of Leipzig / DBpedia Association) for conveying his considerable knowledge of the extraction and release process.
  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • The whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.
  • Václav Zeman and the whole LHD team (University of Prague) for their contribution of additional DBpedia types
  • Alan Meehan (TCD) for performing a big external link cleanup
  • Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing the links from DOLCE to DBpedia ontology.
  • SpringerNature for offering a co-internship to a bright student and developing a closer relation to DBpedia on multiple issues, as well as Links to their SciGraph subjects.
  • Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that provides 5-Star Linked Open Data publication and SPARQL Query Services.
  • OpenLink Software (http://www.openlinksw.com/) collectively for providing the SPARQL Query Services and Linked Open Data publishing infrastructure for DBpedia in addition to their continuous infrastructure support.
  • Ruben Verborgh from Ghent University – imec for publishing the dataset as Triple Pattern Fragments, and imec for sponsoring DBpedia’s Triple Pattern Fragments server.
  • Ali Ismayilov (University of Bonn) for extending and cleaning of the DBpedia Wikidata dataset.
  • All the GSoC students and mentors which have directly or indirectly worked on the DBpedia release
  • Special thanks to members of the DBpedia Association, the AKSW and the Department for Business Information Systems of the University of Leipzig.

The work on the DBpedia 2016-10 release was financially supported by the European Commission through the project ALIGNED – quality-centric, software and data engineering.

More information about DBpedia is found at http://dbpedia.org as well as in the new overview article about the project available at http://wiki.dbpedia.org/Publications.

Have fun with the new DBpedia 2016-10 release!