Category Archives: Google Summer of Code

Call for Ideas and Mentors for GSoC 2014 DBpedia + Spotlight joint proposal (please contribute within the next days)

We started to draft a document for submission at Google Summer of Code 2014:

We are still in need of ideas and mentors.  If you have any improvements on DBpedia or DBpedia Spotlight that you would like to have done, please submit it in the ideas section now. Note that accepted GSoC students will receive about 5000 USD for a three months, which can help you to estimate the effort and size of proposed ideas. It is also ok to extend/amend existing ideas (as long as you don’t hi-jack them). Please edit here:

Becoming a mentor is also a very good way to get involved with DBpedia. As a mentor you will also be able to vote on proposals, after Google accepts our project. Note that it is also ok, if you are a researcher and have a suitable student to submit an idea and become mentor. After acceptance by Google the student then has to apply for the idea and get accepted.

Please take some time this week to add your ideas and apply as a mentor, if applicable. Feel free to improve the introduction as well and comment on the rest of the document.

Information on GSoC in general can be found here:

Thank you for your help,
Sebastian and Dimitris

Making sense out of the Wikipedia categories (GSoC2013)

(Part of our DBpedia+spotlight @ GSoC mini blog series)

Mentor: Marco Fossati @hjfocs <fossati[at]>
Student: Kasun Perera <kkasunperera[at]>

The latest version of the DBpedia ontology has 529 classes. It is not well balanced and shows a lack of coverage in terms of encyclopedic knowledge representation.

Furthermore, the current typing approach involves a costly manual mapping effort and heavily depends on the presence of infoboxes in Wikipedia articles.

Hence, a large number of DBpedia instances is either un-typed, due to a missing mapping or a missing infobox, or has a too generic or too specialized type, due to the nature of the ontology.

The goal of this project is to identify a set of senseful Wikipedia categories that can be used to extend the coverage of DBpedia instances.

How we used the Wikipedia category system

Wikipedia categories are organized in some kind of really messy hierarchy, which is of little use from an ontological point of view.

We investigated how to process this chaotic world.

Here’s what we have done

We have identified a set of meaningful categories by combining the following approaches:

  1. Algorithmic, programmatically traversing the whole Wikipedia category system.

Wow! This was really the hardest part. Kasun made a great job! Special thanks to the category guru Christian Consonni for shedding light in the darkness of such a weird world.

  1. Linguistic, identifying conceptual categories with NLP techniques.

We got inspired by the YAGO guys.

  1. Multilingual, leveraging interlanguage links.

Kudos to Aleksander Pohl for the idea.

  1. Post-mortem, cleaning out stuff that was still not relevant

No resurrection without Freebase!


We found out a total amount of 3751 candidates that can be used to type the instances.

We produced a dataset in the following format:

<Wikipedia_article_page> rdf:type <article_category>

You can access the full dump here. This has not been validated by humans yet.

If you feel like having a look at it, please tell us what do you think about.

Take a look at the Kasun’s progress page for more details.