DBpedia 3.6 released

Hi all, 

we are happy to announce the release of DBpedia 3.6. The new release is based on Wikipedia dumps dating from October/November 2010. 

 The new DBpedia dataset describes more than 3.5 million things, of which 1.67 million are classified in a consistent ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 16,500 video games, 148,000 organizations, 148,000 species and 5,200 diseases.  The DBpedia dataset features labels and abstracts for 3.5 million things in up to 97 different languages; 1,850,000 links to images and 5,900,000 links to external web pages; 6,500,000 external links into other RDF datasets, and 632,000 Wikipedia categories.  

The dataset consists of 672 million pieces of information (RDF triples) out of which 286 million were extracted from the English edition of Wikipedia and 386 million were extracted from other language editions and links to external datasets.  

Along with the release of the new datasets, we are happy to announce the initial release of the DBpedia MappingTool (http://mappings.dbpedia.org/index.php/MappingTool): a graphical user interface to support the community in creating and editing mappings as well as the ontology.  

The new release provides the following improvements and changes compared to the DBpedia 3.5.1 release:  

1. Improved DBpedia Ontology as well as improved Infobox mappings using http://mappings.dbpedia.org/  

Furthermore, there are now also mappings in languages other than English. These improvements are largely due to collective work by the community. There are 13.8 million RDF statements based on mappings (11.1 million in version 3.5.1). All this data is in the /ontology/ namespace. Note that this data is of much higher quality than the Raw Infobox data in the /property/ namespace.  

Statistics of the mappings wiki on the date of release 3.6:  

Mappings:     

  • English: 315 Infobox mappings (covers 1124 templates including redirects)     
  • Greek: 137 Infobox mappings (covers 192 templates including redirects)     
  • Hungarian: 111 Infobox mappings (covers 151 templates including redirects)     
  • Croatian: 36 Infobox mappings (covers 67 templates including redirects)     
  • German: 9 Infobox mappings
  • Slovenian: 4 Infobox mappings

Ontology:     

  • 272 classes

Properties:     

  • 629 object properties     
  • 706 datatype properties (they are all in the /datatype/ namespace)  

2.  Some commonly used property names changed  

Please see http://dbpedia.org/ChangeLog and http://dbpedia.org/Datasets/Properties to know which relations changed and update your applications accordingly!  

3. New Datatypes for increased quality in mapping-based properties  

  • xsd:positiveInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:negativeInteger 

4. Improved parsing coverage 

  • Parsing of lists of elements in Infobox property values that improves the completeness of extracted facts
  • Method to deal with missing repeated links in Infoboxes that do appear somewhere else on the page.
  • Flag templates are parsed.
  • Various improvements on internationalization.  

5. Improved recognition of  

  • Wikipedia language codes.
  • Wikipedia namespace identifiers.
  • Category hierarchies.  

6. Disambiguation links for acronyms (all upper-case title) are now extracted (for example, Kilobyte and Knowledge_base for “KB”):  

  • Wikilinks consisting of multiple words: If the starting letters of the words appear in correct order (with possible gaps) and cover all acronym letters.
  • Wikilinks consisting of a single word: If the case-insensitive longest common subsequence with the acronym is equal to the acronym. 

7. New ‘Geo-Related’ Extractor

  • Relates articles to resources of countries, whose label appear in the name of the articles’ categories.

8. Encoding (bugfixes) 

  • The new datasets support the complete range of Unicode code points (up to 0x10ffff). 16-bit code points start with ‘u’, code points larger than 16-bits start with ‘U’.
  • Commas and ampersands do not get encoded anymore in URIs. Please see http://dbpedia.org/URIencoding for an explanation regarding the DBpedia URI encoding scheme.  

9. Extended Datasets 

  • Thanks to Johannes Hoffart (Max-Planck-Institut für Informatik) for contributing links to YAGO2.
  • Freebase links have been updated. They now refer to mids (http://wiki.freebase.com/wiki/Machine_ID) because guids have been deprecated.  

You can download the new DBpedia dataset from http://dbpedia.org/Downloads36 

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql 

Lots of thanks to:  

  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Max Jakob (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the new datasets.
  • Robert Isele and Anja Jentzsch (both Freie Universität Berlin, Germany) for helping Max with their expertise on the extraction framework.
  • Paul Kreis (Freie Universität Berlin, Germany) for analyzing the DBpedia data of the previous release and suggesting ways to increase quality and quantity. Some results of his work were implemented in this release.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece), Jimmy O’Regan (Eolaistriu Technologies, Ireland), José Paulo Leal (University of Porto, Portugal) for providing patches to improve the extraction framework.
  • Claus Stadler (Universität Leipzig, Germany) for implementing the Geo-Related extractor and extracting its data.
  • Jens Lehmann and Sören Auer (both Universität Leipzig, Germany) for providing the new dataset via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.  

The work on the new release was financially supported by  

  • Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/).
  • The European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/).
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/). More information about DBpedia is found at http://dbpedia.org/About  

Have fun with the new dataset!  

The whole DBpedia team also congratulates Wikipedia to its 10th Birthday which was this weekend!  

Cheers,  

Chris Bizer