DBpedia participated for a fourth time in the Google summer of code program. This was a quite competitive year (like every year) where more than fourty students applied for a DBpedia project. In the end, 8 great students from all around the world were selected and will work on their projects during the summer. Here’s a detailed list of the projects:
A Hybrid Classifier/Rule-based Event Extractor for DBpedia Proposal by Vincent Bohlen
In modern times the amount of information published on the internet is growing to an immeasurable extent. Humans are no longer able to gather all the available information by hand but are more and more dependent on machines collecting relevant information automatically. This is why automatic information extraction and in especially automatic event extraction is important. In this project I will implement a system for event extraction using Classification and Rule-based Event Extraction. The underlying data for both approaches will be identical. I will gather wikipedia articles and perform a variety of NLP tasks on the extracted texts. First I will annotate the named entities in the text using named entity recognition performed by DBpedia Spotlight. Additionally I will annotate the text with Frame Semantics using FrameNet frames. I will then use the collected information, i.e. frames, entities, entity types, with the aforementioned two different methods to decide if the collection is an event or not. Mentor: Marco Fossati (SpazioDati)
Automatic mappings extraction by Aditya Nambiar
DBpedia currently maintains a mapping between Wikipedia info-box properties to the DBpedia ontology, since several similar templates exist to describe the same type of info-boxes. The aim of the project is to enrich the existing mapping and possibly correct the incorrect mapping’s using Wikidata.
Several wikipedia pages use Wikidata values directly in their infoboxes. Hence by using the mapping between Wikidata properties and DBpedia Ontology classes along with the info-box data across several such wiki pages we can collect several such mappings. The first phase of the project revolves around using various such wikipedia templates , finding their usages across the wikipedia pages and extracting as many mappings as possible.
In the second half of the project we use machine learning techniques to take care of any accidental / outlier usage of Wikidata mappings in wikipedia. At the end of the project we will be able to obtain a correct set of mapping which we can use to enrich the existing mapping. Mentor: Markus Freudenberg (AKSW/KILT)
Combining DBpedia and Topic Modelling by wojtuch
DBpedia, a crowd- and open-sourced community project extracting the content from Wikipedia, stores this information in a huge RDF graph. DBpedia Spotlight is a tool which delivers the DBpedia resources that are being mentioned in the document.
Using DBpedia Spotlight to extract Named Entities from Wikipedia articles and then applying a topic modelling algorithm (e.g. LDA) with URIs of DBpedia resources as features would result in a model, which is capable of describing the documents with the proportions of the topics covering them. But because the topics are also represented by DBpedia URIs, this approach could result in a novel RDF hierarchy and ontology with insights for further analysis of the emerged subgraphs.
The direct implication and first application scenario for this project would be utilizing the inference engine in DBpedia Spotlight, as an additional step after the document has been annotated and predicting its topic coverage. Mentor: Alexandru Todor (FU Berlin)
DBpedia Lookup Improvements by Kunal.Jha
DBpedia is one of the most extensive and most widely used knowledge base in over 125 languages. DBpedia Lookup is a tool that allows The DBpedia Lookup is a web service that allows users to obtain various DBpedia URIs for a given label (keywords/anchor texts). The service provides two different types of search APIs, namely, Keyword Search and Prefix Search. The lookup service currently returns the query results in XML (default) and JSON formats and works on English language. It is based on a Lucene Index providing a weighted label lookup, which combines string similarity with a relevance ranking in order to find the most relevant matches for a given label. As a part of the GSOC 2016, I propose to implement improvisations with an intention to make the system more efficient and versatile. Mentor: Axel Ngonga (AKSW)
Inferring infobox template class mappings from Wikipedia + Wikidata by Peng_Xu
This project aims at finding mappings between the classes (eg. dbo:Person, dbo:City) in the DBpedia ontology and infobox templates on pages of Wikipedia resources using machine learning. Mentor: Nilesh Chakraborty (University of Bonn)
Integrating RML in the Dbpedia extraction framework by wmaroy
This project is about integrating RML in the Dbpedia extraction framework. Dbpedia is derived from Wikipedia infoboxes using the extraction framework and mappings defined using the wikitext syntax. A next step would be replacing the wikitext defined mappings with RML. To accomplish this, adjustments will have to be made to the extraction framework. Mentor: Dimitris Kontokostas (AKSW/KILT)
The List Extractor by FedBai
The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. The information is unstructured and thus cannot be easily used to form semantic statements and be integrated in the DBpedia ontology. Hence, the main task consists in creating a tool which can take one or more Wikipedia pages with lists within as an input and then construct appropriate mappings to be inserted in a DBpedia dataset. The extractor must prove to work well on a given domain and to have the ability to be expanded to reach generalization. Mentor: Marco Fossati (SpazioDati)
The Table Extractor by s.papalini
Wikipedia is full of data hidden in tables. The aim of this project is to exploring the possibilities of take advantage of all the data represented with the appearance of tables in Wiki pages, in order to populate the different versions of DBpedia with new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extract the semi structured data from all those tables now scattered in most of the Wiki pages. Mentor: Marco Fossati (SpazioDati)
At the begining of September 2016 you will receive news about successfull Google Summer of Code 2016 student projects. Stay tuned and follow us on facebook, twitter or visit our website for the latest news.
Your DBpedia Association