Category Archives: Google Summer of Code

GSoC 2015 is gone, long live GSoC 2016

The submission deadline for mentoring organizations to submit their application for the 2016 Google Summer of Code is approaching quickly. As DBpedia is again planning to be a vital part of the Mentoring Summit, we like to take that opportunity to  give you a little recap of the projects mentored by DBpedia members during the past GSoC, in November 2015. 

Dimitris Kontokostas, Marco Fossati, Thiago Galery, Joachim Daiber and Reuben Verborgh, members of the Dbpedia community, mentored 8 great students from around the world. Following are some of the projects they completed.

Fact Extraction from Wikipedia Text by Emilio Dorigatti

DBpedia is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories. However, unstructured content (typically text) plays the most crucial role, due to the amount of knowledge it can deliver, and few efforts have been carried out to extract structured data out of it. Marco and Emilio built a fact extractor, which understands the semantics of a sentence thanks to Natural Language Processing (NLP) techniques. If you feel playful, you can download the produced datasetsFor more details, check out this blog postP.S.: the project has been cited by Python Weekly and Python TrendingMentor: Marco Fossati (SpazioDati)

Better context vectors for disambiguation by Philipp Dowling

Better Context Vectors  aimed to improve the representation of context used by DBpedia Spotlight by incorporating novel methods from distributional semantics. We investigated the benefits of replacing a word-count based method for one that uses a model based on word2vec. Our student, Philipp Dowling, implemented the model reader based on a preprocessed version of Wikipedia (leading to a few commits to the awesome library gensim) and the integration with the main DBpedia Spotlight pipeline. Additionally, we integrated a method for estimating weights for the different model components that contribute to disambiguating entities. Mentors: Thiago Galery (Analytyca), Joachim Daiber (Amsterdam Univ.), David Przybilla (Idio)

 

Wikipedia Stats Extractor by Naveen Madhire

Wikipedia Stats Extractor aimed to create a reusable tool to extract raw statistics for Name Entity Linking out of a Wikipedia dump. Naveen built the project on top of Apache Spark and Json-wikipedia which makes the code more maintainable and faster than its previous alternative (pignlproc). Furthermore Wikipedia Stats Extractor provides an interface which makes easier the task of processing Wikipedia dumps for  purposes other than Entity Linking. Extra changes were made in the way surface forms stats are extracted  and lots of noise was removed, both of which should in principle help Entity Linking.
Special regards to Diego Ceccarelli who gave us great insight on how Json-wikipedia worked. Mentors: Thiago Galery (Analytyca), Joachim Daiber (Amsterdam Univ.), David Przybilla (Idio)

 

DBpedia Live extensions by Andre Pereira

DBpedia Live provides near real-time knowledge extraction from Wikipedia. As wikipedia scales we needed to move our caching infrastructure from MySQL to MongoDB. This was the first task of Andre’s project. The second task was the implementation of a UI displaying the current status of DBpedia Live along with some admin utils. Mentors: Dimitris Kontokostas (AKSW/KILT), Magnus Knuth (HPI)

 

Adding live-ness to the Triple Pattern Fragments server by Pablo Estrada

DBpedia currently has a highly available Triple Pattern Fragments interface that offloads part of the query processing from the server into the clients. For this GSoC, Pablo developed a new feature for this server so it automatically keeps itself up to date with new data coming from DBpedia Live. We do this by periodically checking for updates, and adding them to an auxiliary database. Pablo developed smart update, and smart querying algorithms to manage and serve the live data efficiently. We are excited to let the project out in the wild, and see how it performs in real-life use cases. Mentors: Ruben Verborgh (Ghent Univ. – iMinds) and Dimitris Kontokostas (AKSW/KILT)

Registration for mentors @ GSoC 2016 is starting next month and DBpedia would of course try to participate again. If you want to become a mentor or just have a cool idea that seems suitable, don’t hesitate to ping us via the DBpedia discussion or developer mailing lists.

Stay tuned!

Your DBpedia Association

DBpedia Spotlight V0.7 released

DBpedia Spotlight is an entity linking tool for connecting free text to DBpedia through the recognition and disambiguation of entities and concepts from the DBpedia KB.

We are happy to announce Version 0.7 of DBpedia Spotlight, which is also the first official release of the probabilistic/statistical implementation.

More information about as well as updated evaluation results for DBpedia Spotlight V0.7 are found in this paper:

Joachim Daiber, Max Jakob, Chris Hokamp, Pablo N. Mendes: Improving Efficiency and Accuracy in Multilingual Entity ExtractionISEM2013. 

The changes to the statistical implementation include:

  • smaller and faster models through quantization of counts, optimization of search and some pruning
  • better handling of case
  • various fixes in Spotlight and PigNLProc
  • models can now be created without requiring a Hadoop and Pig installation
  • UIMA support by @mvnural
  • support for confidence value

See the release notes at [1] and the updated demos at [4].

Models for Spotlight 0.7 can be found here [2].

Additionally, we now provide the raw Wikipedia counts, which we hope will prove useful for research and development of new models [3].

A big thank you to all developers who made contributions to this version (with special thanks to Faveeo and Idio). Huge thanks to Jo for his leadership and continued support to the community.

Cheers,
Pablo Mendes,

on behalf of Joachim Daiber and the DBpedia Spotlight developer community.

[1] – https://github.com/dbpedia-spotlight/dbpedia-spotlight/releases/tag/release-0.7

[2] – http://spotlight.sztaki.hu/downloads/

[3] – http://spotlight.sztaki.hu/downloads/raw

[4] – http://dbpedia-spotlight.github.io/demo/

(This message is an adaptation of Joachim Daiber’s message to the DBpedia Spotlight list. Edited to suit this broader community and give credit to him.)

Call for Ideas and Mentors for GSoC 2014 DBpedia + Spotlight joint proposal (please contribute within the next days)

We started to draft a document for submission at Google Summer of Code 2014:
http://dbpedia.org/gsoc2014

We are still in need of ideas and mentors.  If you have any improvements on DBpedia or DBpedia Spotlight that you would like to have done, please submit it in the ideas section now. Note that accepted GSoC students will receive about 5000 USD for a three months, which can help you to estimate the effort and size of proposed ideas. It is also ok to extend/amend existing ideas (as long as you don’t hi-jack them). Please edit here:
https://docs.google.com/document/d/13YcM-LCs_W3-0u-s24atrbbkCHZbnlLIK3eyFLd7DsI/edit?pli=1

Becoming a mentor is also a very good way to get involved with DBpedia. As a mentor you will also be able to vote on proposals, after Google accepts our project. Note that it is also ok, if you are a researcher and have a suitable student to submit an idea and become mentor. After acceptance by Google the student then has to apply for the idea and get accepted.

Please take some time this week to add your ideas and apply as a mentor, if applicable. Feel free to improve the introduction as well and comment on the rest of the document.

Information on GSoC in general can be found here:
http://www.google-melange.com/gsoc/homepage/google/gsoc2014

Thank you for your help,
Sebastian and Dimitris

Making sense out of the Wikipedia categories (GSoC2013)

(Part of our DBpedia+spotlight @ GSoC mini blog series)

Mentor: Marco Fossati @hjfocs <fossati[at]spaziodati.eu>
Student: Kasun Perera <kkasunperera[at]gmail.com>

The latest version of the DBpedia ontology has 529 classes. It is not well balanced and shows a lack of coverage in terms of encyclopedic knowledge representation.

Furthermore, the current typing approach involves a costly manual mapping effort and heavily depends on the presence of infoboxes in Wikipedia articles.

Hence, a large number of DBpedia instances is either un-typed, due to a missing mapping or a missing infobox, or has a too generic or too specialized type, due to the nature of the ontology.

The goal of this project is to identify a set of senseful Wikipedia categories that can be used to extend the coverage of DBpedia instances.

How we used the Wikipedia category system

Wikipedia categories are organized in some kind of really messy hierarchy, which is of little use from an ontological point of view.

We investigated how to process this chaotic world.

Here’s what we have done

We have identified a set of meaningful categories by combining the following approaches:

  1. Algorithmic, programmatically traversing the whole Wikipedia category system.

Wow! This was really the hardest part. Kasun made a great job! Special thanks to the category guru Christian Consonni for shedding light in the darkness of such a weird world.

  1. Linguistic, identifying conceptual categories with NLP techniques.

We got inspired by the YAGO guys.

  1. Multilingual, leveraging interlanguage links.

Kudos to Aleksander Pohl for the idea.

  1. Post-mortem, cleaning out stuff that was still not relevant

No resurrection without Freebase!

Outcomes

We found out a total amount of 3751 candidates that can be used to type the instances.

We produced a dataset in the following format:

<Wikipedia_article_page> rdf:type <article_category>

You can access the full dump here. This has not been validated by humans yet.

If you feel like having a look at it, please tell us what do you think about.

Take a look at the Kasun’s progress page for more details.