we are happy to announce the release of DBpedia 3.8.
The most important improvements of the new release compared to DBpedia 3.7 are:
1. the new release is based on updated Wikipedia dumps dating from late May / early June 2012.
2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen.
3. the DBpedia internationalization has progressed and we now provide localized versions of DBpedia in even more languages.
The English version of the DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.
We provide localized versions of DBpedia in 111 languages. All these versions together describe 20.8 million things, out of which 10.5 mio overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 10.3 million unique things in 111 different languages; 8.0 million links to images and 24.4 million HTML links to external web pages; 27.2 million data links into external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2 million YAGO categories. The dataset consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from the English edition of Wikipedia, 1.46 billion were extracted from other language editions, and about 27 million are data links into external RDF data sets.
The main changes between DBpedia 3.7 and 3.8 are described below. For additional, more detailed information please refer to the change log.
1. Enlarged Ontology
The DBpedia community added many new classes and properties on the mappings wiki. The DBpedia 3.8 ontology encompasses
- 359 classes (DBpedia 3.7: 319)
- 800 object properties (DBpedia 3.7: 750)
- 859 datatype properties (DBpedia 3.7: 791)
- 116 specialized datatype properties (DBpedia 3.7: 102)
- 45 owl:equivalentClass and 31 owl:equivalentProperty mappings to
2. Additional Infobox to Ontology Mappings
The editors of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 3.8 extraction, we used 2347 mappings, among them
- Polish: 382 mappings
- English: 345 mappings
- German: 211 mappings
- Portuguese: 207 mappings
- Greek: 180 mappings
- Slovenian: 170 mappings
- Korean: 146 mappings
- Hungarian: 111 mappings
- Spanish: 107 mappings
- Turkish: 91 mappings
- Czech: 66 mappings
- Bulgarian: 61 mappings
- Catalan: 52 mappings
- Arabic: 51 mappings
3. New local DBpedia Chapters
We are also happy to see the number of local DBpedia chapters in different countries rising. Since the 3.7 DBpedia release we welcomed the French, Italian and Japanese Chapters. In addition, we expect the Dutch DBpedia chapter to go online during the next months (in cooperation with http://bibliotheek.nl/). The DBpedia chapters provide local SPARQL endpoints and dereferencable URIs for the DBpedia data in their corresponding language. The DBpedia Internationalization page provides an overview of the current state of the DBpedia Internationalization effort.
4. New and updated RDF Links into External Data Sources
We have added new RDF links pointing at resources in the following Linked Data sources: Amsterdam Museum, BBC Wildlife Finder, CORDIS, DBTune, Eurostat (Linked Statistics), GADM, LinkedGeoData, OpenEI (Open Energy Info). In addition, we have updated many of the existing RDF links pointing at other Linked Data sources.
5. New Wiktionary2RDF Extractor
We developed a DBpedia extractor, that is configurable for any Wiktionary edition. It generates an comprehensive ontology about languages for use as a semantic lexical resource in linguistics. The data currently includes language, part of speech, senses with definitions, synonyms, taxonomies (hyponyms, hyperonyms, synonyms, antonyms) and translations for each lexical word. It furthermore is hosted as Linked Data and can serve as a central linking hub for LOD in linguistics. Currently available languages are English, German, French, Russian. In the next weeks we plan to add Vietnamese and Arabic. The goal is to allow the addition of languages just by configuration without the need of programming skills, enabling collaboration as in the Mappings Wiki. For more information visit http://wiktionary.dbpedia.org/
6. Improvements to the Data Extraction Framework
- Additionally to N-Triples and N-Quads, the framework was extended to write triple files in Turtle format
- Extraction steps that looked for links between different Wikipedia editions were replaced by more powerful post-processing scripts
- Preparation time and effort for abstract extraction is minimized, extraction time is reduced to a few milliseconds per page
- To save file system space, the framework can compress DBpedia triple files while writing and decompress Wikipedia XML dump files while reading
- Using some bit twiddling, we can now load all ~200 million inter-language links into a few GB of RAM and analyze them
- Users can download ontology and mappings from mappings wiki and store them in files to avoid downloading them for each extraction, which takes a lot of time and makes extraction results less reproducible
- We now use IRIs for all languages except English, which uses URIs for backwards compatibility
- We now resolve redirects in all datasets where the objects URIs are DBpedia resources
- We check that extracted dates are valid (e.g. February never has 30 days) and its format is valid according to its XML Schema type, e.g. xsd:gYearMonth
- We improved the removal of HTML character references from the abstracts
- When extracting raw infobox properties, we make sure that predicate URI can be used in RDF/XML by appending an underscore if necessary
- Page IDs and Revision IDs datasets now use the DBpedia resource as subject URI, not the Wikipedia page URL
- We use foaf:isPrimaryTopicOf instead of foaf:page for the link from DBpedia resource to Wikipedia page
- New inter-language link datasets for all languages
Accessing the DBpedia 3.8 Release
You can download the new DBpedia dataset from http://dbpedia.org/Downloads38.
As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql
Lots of thanks to
- Jona Christopher Sahnwaldt (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the DBpedia 3.8 data sets.
- Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for implementing the language generalizations to the extraction framework.
- Uli Zellbeck and Anja Jentzsch (Freie Universität Berlin, Germany) for generating the new and updated RDF links to external datasets using the Silk interlinking framework.
- Jonas Brekle (Universität Leipzig, Germany) and Sebastian Hellmann (Universität Leipzig, Germany)for their work on the new Wikionary2RDF extractor.
- All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
- The whole Internationalization Committee for pushing the DBpedia internationalization forward.
- Kingsley Idehen and Patrick van Kleef (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
The work on the DBpedia 3.8 release was financially supported by the European Commission through the projects LOD2 - Creating Knowledge out of Linked Data (http://lod2.eu/, improvements to the extraction framework) and LATC - LOD Around the Clock (http://latc-project.eu/, creation of external RDF links).
More information about DBpedia is found at http://dbpedia.org/About
Have fun with the new DBpedia release!