All posts by Dimitris Kontokostas

DBpedia Version 2015-04 released

Dear all,

we are happy to announce the release of DBpedia 2015-04 (also known as: 2015 A). The new release is based on updated Wikipedia dumps dating from February/March 2015 and features an enlarged DBpedia ontology with more infobox to ontology mappings, leading to richer and cleaner data.

http://wiki.dbpedia.org/Downloads2015-04

The English version of the DBpedia knowledge base currently describes 5.9M things out of which 4.3M resources have abstracts, 452K geo coordinates and 1.45M depictions. In total, 4 million resources are classified in a consistent ontology and consists of  2,06M persons, 682K places (including 455K populated places), 376K creative works (including 92K music albums, 90K films and 17K video games), 188K organizations (including 51K companies and 33K educational institutions), 278K species and 5K diseases. The total number of resources in English DBpedia is 15.3M that, besides the 5.9M resources, includes 1.2M skos concepts (categories), 6.83M redirect pages, 256K disambiguation pages and 1.13M intermediate nodes.

We provide localized versions of DBpedia in 128 languages. All these versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia. The full DBpedia data set features 38 million labels and abstracts in 128 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million links to YAGO categories. DBpedia is connected with other Linked Datasets by around 50 million RDF links.

In addition we provide DBpedia datasets for Wikimedia Commons and Wikidata.

Altogether the DBpedia 2015-04 release consists of 6.9 billion pieces of information (RDF triples) out of which 737 million were extracted from the English edition of Wikipedia, 3.76 billion were extracted from other language editions and 2.4 billion from  DBpedia Commons and Wikidata.

Thorough statistics can be found on the DBpedia website and general information on the DBpedia datasets here.

From this release on we will try to provide two releases per year, one in April and the next in October. The 2015-04 release was delayed by 3 months but we will try to keep the schedule and release the 2015-10 at the end of October or early November.

On our plans for the next release is to remove the URI encoding of English DBpedia (dbpedia.org) and switch to IRIs only. This will simplify the release process and will be aligned with all other DBpedia language datasets. We know that this will probably break some links to DBpedia but we feel is the only way to move forward. If you have any reasons against this action, please let us know now.

A complete list of changes in this release can be found on GitHub.

From this release we adjusted the download page folder structure, giving us more flexibility to offer more datasets in the near future

http://downloads.dbpedia.org/2015-04/

Enlarged Ontology

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2015 ontology encompasses

  • 735 classes (DBpedia 2014: 685)
  • 1,098 object properties (DBpedia 2014: 1079)
  • 1,583 datatype properties (DBpedia 2014: 1,600)
  • 132 specialized datatype properties (DBpedia 2014: 116)
  • 408 owl:equivalentClass and 200 owl:equivalentProperty mappings external vocabularies

Additional Infobox to Ontology Mappings

The editors community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. There are six new languages with mappings: Arabic, Bulgarian, Armenian, Romanian, Swedish and Ukrainian.

For the DBpedia 2015 extraction, we used a total of 4317 template mappings (DBpedia 2014: 3814 mappings).

Extended Type System to cover Articles without Infobox

Until the DBpedia 3.8 release, a concept was only assigned a type (like person or place) if the corresponding Wikipedia article contains an infobox indicating this type. Starting from the 3.9 release, we provide type statements for articles without infobox that are inferred based on the link structure within the DBpedia knowledge base using the algorithm described in Paulheim/Bizer 2014. For the new release, an improved version of the algorithm was run to produce type information for 400,000 things that were formerly not typed. A similar algorithm (presented in the same paper) was used to identify and remove potentially wrong statements from the knowledge base.

In addition, this release include four new type datasets, although not included in the online sparql endpoint: 1) LHD datasets for English, German and Dutch and 2) DBTax for English.

Both of these datasets use a typing system beyond the DBpedia ontology and we provide a subset, mapped to the DBpedia ontology (dbo) and a full one with all types (ext).

New and updated RDF Links into External Data Sources

We updated the following RDF link sets pointing at other Linked Data sources: Freebase, Wikidata, Geonames and GADM.

Accessing the DBpedia 2015-04 Release

You can download the new DBpedia datasets in RDF format from http://wiki.dbpedia.org/Downloads or

http://downloads.dbpedia.org/2015-04/

 

Additional external dataset contributions

From the following releases we will provide additional datasets related to DBpedia. For 2015-04 we provide a pagerank dataset for English and German, provided by HPI.

http://downloads.dbpedia.org/2015-04/ext/

 

As usual, the new dataset is also published in 5-Star Linked Open Data form and accessible via the SPARQL Query Service endpoint at http://dbpedia.org/sparql and Triple Pattern Fragments service at http://fragments.dbpedia.org/.

Credits

Lots of thanks to

  • Markus Freudenberg (University of Leipzig) for taking over the whole release process
  • Dimitris Kontokostas for conveying his considerable knowledge of the extraction and release process.
  • Volha Bryl and Daniel Fleischhacker (University of Mannheim) for their work on the previous release and their continuous support in this release.
  • Alexandru Todor (University of Berlin) for contributing time and computing resources for the abstract extraction.
  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • The whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.
  • Heiko Paulheim (University of Mannheim) for re-running his algorithm to generate additional type statements for formerly untyped resources and identify and removed wrong statements.
  • Václav Zeman and the whole LHD team (University of Prague) for their contribution of additional DBpedia types
  • Marco Fossati (FBK) for contributing the DBTax types
  • Petar Ristoski (University of Mannheim) for generating the updated links pointing at the GADM database of Global Administrative Areas. Petar will also generate an updated release of DBpedia as Tables soon.
  • Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing the links from DOLCE to DBpedia ontology.
  • Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that provides 5-Star Linked Open Data publication and SPARQL Query Services.
  • OpenLink Software (http://www.openlinksw.com/) altogether for providing the SPARQL Query Services and Linked Open Data publishing  infrastructure for DBpedia in addition to their continuous infrastructure support.
  • Ruben Verborgh from Ghent University – iMinds for publishing the dataset as Triple Pattern Fragments, and iMinds for sponsoring DBpedia’s Triple Pattern Fragments server.
  • Magnus Knuth (HPI) for providing a pagerank dataset for English and German
  • Ali Ismayilov (University of Bonn) for implementing DBpedia Wikidata dataset.
  • Vladimir Alexiev (Ontotext) for leading a successful mapping and ontology clean up effort.
  • Nono314 for contributing a lot of improvements and bug fixes in the extraction framework as well as other community members.
  • All the GSoC students and mentors working directly or indirectly on the DBpedia release

The work on the DBpedia 2015-04 release was financially supported by the European Commission through the project ALIGNED – quality-centric, software and data engineering  (http://aligned-project.eu/).

More information about DBpedia is found at http://dbpedia.org as well as in the new overview article about the project available at http://wiki.dbpedia.org/Publications.

Have fun with the new DBpedia 2015-04 release!

Cheers,

Markus Freudenberg, Dimitris Kontokostas, Sebastian Hellmann

Call for Ideas and Mentors for GSoC 2014 DBpedia + Spotlight joint proposal (please contribute within the next days)

We started to draft a document for submission at Google Summer of Code 2014:
http://dbpedia.org/gsoc2014

We are still in need of ideas and mentors.  If you have any improvements on DBpedia or DBpedia Spotlight that you would like to have done, please submit it in the ideas section now. Note that accepted GSoC students will receive about 5000 USD for a three months, which can help you to estimate the effort and size of proposed ideas. It is also ok to extend/amend existing ideas (as long as you don’t hi-jack them). Please edit here:
https://docs.google.com/document/d/13YcM-LCs_W3-0u-s24atrbbkCHZbnlLIK3eyFLd7DsI/edit?pli=1

Becoming a mentor is also a very good way to get involved with DBpedia. As a mentor you will also be able to vote on proposals, after Google accepts our project. Note that it is also ok, if you are a researcher and have a suitable student to submit an idea and become mentor. After acceptance by Google the student then has to apply for the idea and get accepted.

Please take some time this week to add your ideas and apply as a mentor, if applicable. Feel free to improve the introduction as well and comment on the rest of the document.

Information on GSoC in general can be found here:
http://www.google-melange.com/gsoc/homepage/google/gsoc2014

Thank you for your help,
Sebastian and Dimitris

Making sense out of the Wikipedia categories (GSoC2013)

(Part of our DBpedia+spotlight @ GSoC mini blog series)

Mentor: Marco Fossati @hjfocs <fossati[at]spaziodati.eu>
Student: Kasun Perera <kkasunperera[at]gmail.com>

The latest version of the DBpedia ontology has 529 classes. It is not well balanced and shows a lack of coverage in terms of encyclopedic knowledge representation.

Furthermore, the current typing approach involves a costly manual mapping effort and heavily depends on the presence of infoboxes in Wikipedia articles.

Hence, a large number of DBpedia instances is either un-typed, due to a missing mapping or a missing infobox, or has a too generic or too specialized type, due to the nature of the ontology.

The goal of this project is to identify a set of senseful Wikipedia categories that can be used to extend the coverage of DBpedia instances.

How we used the Wikipedia category system

Wikipedia categories are organized in some kind of really messy hierarchy, which is of little use from an ontological point of view.

We investigated how to process this chaotic world.

Here’s what we have done

We have identified a set of meaningful categories by combining the following approaches:

  1. Algorithmic, programmatically traversing the whole Wikipedia category system.

Wow! This was really the hardest part. Kasun made a great job! Special thanks to the category guru Christian Consonni for shedding light in the darkness of such a weird world.

  1. Linguistic, identifying conceptual categories with NLP techniques.

We got inspired by the YAGO guys.

  1. Multilingual, leveraging interlanguage links.

Kudos to Aleksander Pohl for the idea.

  1. Post-mortem, cleaning out stuff that was still not relevant

No resurrection without Freebase!

Outcomes

We found out a total amount of 3751 candidates that can be used to type the instances.

We produced a dataset in the following format:

<Wikipedia_article_page> rdf:type <article_category>

You can access the full dump here. This has not been validated by humans yet.

If you feel like having a look at it, please tell us what do you think about.

Take a look at the Kasun’s progress page for more details.

DBpedia+Spotlight accepted @ Google Summer of Code 2013

Google Summer of Code (GSoC) is a global program that offers post-secondary student developers (ages 18 and older, BSc, MSc, PhD)  stipends to write code for various open source software projects. Since its inception in 2005, the program has brought together over 6,000 successful student participants and over 3,000 mentors from over 100 countries worldwide, all for the love of code.

DBpedia participated successfully in last’s year GSoC as DBpedia Spotlight. We were allowed with 4 students (out of a total 37 applications) and managed to enhance DBpedia Spotlight in time performance, accuracy and extra functionality.  We are thrilled to announce, that we were accepted again in GSoC 2013. We are participating with all DBpedia-family products this time – that is DBpedia, DBpedia Spotlight and DBpedia Wiktionary – and we hope we share the same luck, again.

This year we have  brand new and exciting ideas so, if you know energetic students (BSc, MSc, PhD) interested in working with DBpedia, text processing, and semantics, please encourage them to apply!

If you are a student, the application period starts in 2 weeks (deadline May 3rd). Judging from last year’s competition, writing a good application can be a really hard task so you should start preparing from now. We already created a dedicated mailing list and a few  warm-up tasks ( to get you familiar with our technologies) and we will of course be always available to any questions.

So go ahead, choose your idea, write your application and impress us;)

http://www.google-melange.com/gsoc/org/google/gsoc2013/dbpediaspotlight

On behalf of the DBpedia GSoC team,

Dimitris Kontokostas