Monthly Archives: April 2016

DBpedia @ Google Summer of Code 2016

DBpedia participated for a fourth time in the Google summer of code program. This was a quite competitive year (like every year) where more than fourty students applied for a DBpedia project. In the end,  8 great students from all around the world  were selected and will work on their projects during the summer. Here’s a detailed list of the projects:

A Hybrid Classifier/Rule-based Event Extractor for DBpedia Proposal by Vincent Bohlen

In modern times the amount of information published on the internet is growing to an immeasurable extent. Humans are no longer able to gather all the available information by hand but are more and more dependent on machines collecting relevant information automatically. This is why automatic information extraction and in especially automatic event extraction is important. In this project I will implement a system for event extraction using Classification and Rule-based Event Extraction. The underlying data for both approaches will be identical. I will gather wikipedia articles and perform a variety of NLP tasks on the extracted texts. First I will annotate the named entities in the text using named entity recognition performed by DBpedia Spotlight. Additionally I will annotate the text with Frame Semantics using FrameNet frames. I will then use the collected information, i.e. frames, entities, entity types, with the aforementioned two different methods to decide if the collection is an event or not. Mentor: Marco Fossati (SpazioDati)

Automatic mappings extraction by Aditya Nambiar

DBpedia currently maintains a mapping between Wikipedia info-box properties to the DBpedia ontology, since several similar templates exist to describe the same type of info-boxes. The aim of the project is to enrich the existing mapping and possibly correct the incorrect mapping’s using Wikidata.

Several wikipedia pages use Wikidata values directly in their infoboxes. Hence by using the mapping between Wikidata properties and DBpedia Ontology classes along with the info-box data across several such wiki pages we can collect several such mappings. The first phase of the project revolves around using various such wikipedia templates , finding their usages across the wikipedia pages and extracting as many mappings as possible.

In the second half of the project we use machine learning techniques to take care of any accidental / outlier usage of Wikidata mappings in wikipedia. At the end of the project we will be able to obtain a correct set of mapping which we can use to enrich the existing mapping. Mentor: Markus Freudenberg (AKSW/KILT)

Combining DBpedia and Topic Modelling by wojtuch

DBpedia, a crowd- and open-sourced community project extracting the content from Wikipedia, stores this information in a huge RDF graph. DBpedia Spotlight is a tool which delivers the DBpedia resources that are being mentioned in the document.

Using DBpedia Spotlight to extract Named Entities from Wikipedia articles and then applying a topic modelling algorithm (e.g. LDA) with URIs of DBpedia resources as features would result in a model, which is capable of describing the documents with the proportions of the topics covering them. But because the topics are also represented by DBpedia URIs, this approach could result in a novel RDF hierarchy and ontology with insights for further analysis of the emerged subgraphs.

The direct implication and first application scenario for this project would be utilizing the inference engine in DBpedia Spotlight, as an additional step after the document has been annotated and predicting its topic coverage. Mentor: Alexandru Todor (FU Berlin)

DBpedia Lookup Improvements by Kunal.Jha

DBpedia is one of the most extensive and most widely used knowledge base in over 125 languages. DBpedia Lookup is a tool that allows The DBpedia Lookup is a web service that allows users to obtain various DBpedia URIs for a given label (keywords/anchor texts). The service provides two different types of search APIs, namely, Keyword Search and Prefix Search. The lookup service currently returns the query results in XML (default) and JSON formats and works on English language. It is based on a Lucene Index providing a weighted label lookup, which combines string similarity with a relevance ranking in order to find the most relevant matches for a given label. As a part of the GSOC 2016, I propose to implement improvisations with an intention to make the system more efficient and versatile. Mentor: Axel Ngonga (AKSW)

Inferring infobox template class mappings from Wikipedia + Wikidata by Peng_Xu

This project aims at finding mappings between the classes (eg. dbo:Person, dbo:City) in the DBpedia ontology and infobox templates on pages of Wikipedia resources using machine learning. Mentor: Nilesh Chakraborty (University of Bonn)

Integrating RML in the Dbpedia extraction framework by wmaroy

This project is about integrating RML in the Dbpedia extraction framework. Dbpedia is derived from Wikipedia infoboxes using the extraction framework and mappings defined using the wikitext syntax. A next step would be replacing the wikitext defined mappings with RML. To accomplish this, adjustments will have to be made to the extraction framework. Mentor: Dimitris Kontokostas (AKSW/KILT)

The List Extractor by FedBai

The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. The information is unstructured and thus cannot be easily used to form semantic statements and be integrated in the DBpedia ontology. Hence, the main task consists in creating a tool which can take one or more Wikipedia pages with lists within as an input and then construct appropriate mappings to be inserted in a DBpedia dataset. The extractor must prove to work well on a given domain and to have the ability to be expanded to reach generalization. Mentor: Marco Fossati (SpazioDati)

The Table Extractor by s.papalini

Wikipedia is full of data hidden in tables. The aim of this project is to exploring the possibilities of take advantage of all the data represented with the appearance of tables in Wiki pages, in order to populate the different versions of DBpedia with new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extract the semi structured data from all those tables now scattered in most of the Wiki pages. Mentor: Marco Fossati (SpazioDati)

At the begining of September 2016 you will receive news about successfull Google Summer of Code 2016 student projects. Stay tuned and follow us on  facebook, twitter or visit our website for the latest news.

 
Your DBpedia Association

We proudly present our new 2015-10 DBpedia release, which is abailable now via:  http://dbpedia.org/sparql. Go an check it out!

This DBpedia release is based on updated Wikipedia dumps dating from October 2015 featuring a significantly expanded base of information as well as richer and cleaner data based on the DBpedia ontology.

So, what did we do?

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2015-10 ontology encompasses

  • 739 classes (DBpedia 2015-04: 735)
  • 1,099 properties with reference values (a/k/a object properties) (DBpedia 2015-04: 1,098)
  • 1,596 properties with typed literal values (a/k/a datatype properties) (DBpedia 2015-04: 1,583)
  • 132 specialized datatype properties (DBpedia 2015-04: 132)
  • 407 owl:equivalentClass and 222 owl:equivalentProperty mappings external vocabularies (DBpedia 2015-04: 408 and 200, respectively)

The editors community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 2015-10 extraction, we used a total of 5553 template mappings (DBpedia 2015-04: 4317 mappings). For the first time the top language, gauged by number of mappings, is Dutch (606 mappings), surpassing the English community (600 mappings).

And what are the (breaking) changes ?

  • English DBpedia switched to IRIs from URIs. 
  • The instance-types dataset is now split to two files:
    • “instance-types” contains only direct types.
    • “Instance-types-transitive” contains transitive types.
    • The “mappingbased-properties” file is now split into three (3) files:
      • “geo-coordinates-mappingbased”
      • “mappingbased-literals” contains mapping based statements with literal values.
      • “mappingbased-objects”
  • We added a new extractor for citation data.
  • All datasets are available in .ttl and .tql serialization 
  • We are providing DBpedia as a Docker image.
  • From now on, we provide extensive dataset metadata by adding DataIDs for all extracted languages to the respective language directories.
  • In addition, we revamped the dataset table on the download-page. It’s created dynamically based on the DataID of all languages. Likewise, the tables on the statistics- page are now based on files providing information about all mapping languages.
  • From now on, we also include the original Wikipedia dump files(‘pages_articles.xml.bz2’) alongside the extracted datasets.
  • A complete changelog can always be found in the git log.

And what about the numbers?

Altogether the new DBpedia 2015-10 release consists of 8.8 billion (2015-04: 6.9 billion) pieces of information (RDF triples) out of which 1.1 billion (2015-04: 737 million) were extracted from the English edition of Wikipedia, 4.4 billion (2015-04: 3.8 billion) were extracted from other language editions, and 3.2 billion (2015-04: 2.4 billion) came from  DBpedia Commons and Wikidata. In general we observed a significant growth in raw infobox and mapping-based statements of close to 10%.  Thorough statistics are available via the Statistics page.

And what’s up next?

We will be working to move away from the mappings wiki but we will have at least one more mapping sprint. Moreover, we have some cool ideas for GSOC this year. Additional mentors are more than welcome. :-)

And who is to blame for the new release?

We want to thank all editors that contributed to the DBpedia ontology mappings via the Mappings Wiki, all the GSoC students and mentors working directly or indirectly on the DBpedia release and the whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.

Special thanks go to Markus Freudenberg and Dimitris Kontokostas (University of Leipzig), Volha Bryl (University of Mannheim / Springer), Heiko Paulheim (University of Mannheim), Václav Zeman and the whole LHD team (University of Prague), Marco Fossati (FBK), Alan Meehan (TCD), Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy), Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software), OpenLink Software (http://www.openlinksw.com/), Ruben Verborgh from Ghent University – iMinds, Ali Ismayilov (University of Bonn), Vladimir Alexiev (Ontotext) and members of the DBpedia Association, the AKSW and the department for Business Information Systems of the University of Leipzig for their committment in putting tremendous time and effort to get this done.

The work on the DBpedia 2015-10 release was financially supported by the European Commission through the project ALIGNED – quality-centric, software and data engineering  (http://aligned-project.eu/).

 

Detailed information about the new release are available here. For more information about DBpedia, please visit our website or follow us on Facebook!

Have fun and all the best!

Yours

DBpedia Association