Category Archives: Dataset releases

DBpedia 3.6 released

Hi all, 

we are happy to announce the release of DBpedia 3.6. The new release is based on Wikipedia dumps dating from October/November 2010. 

 The new DBpedia dataset describes more than 3.5 million things, of which 1.67 million are classified in a consistent ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 16,500 video games, 148,000 organizations, 148,000 species and 5,200 diseases.  The DBpedia dataset features labels and abstracts for 3.5 million things in up to 97 different languages; 1,850,000 links to images and 5,900,000 links to external web pages; 6,500,000 external links into other RDF datasets, and 632,000 Wikipedia categories.  

The dataset consists of 672 million pieces of information (RDF triples) out of which 286 million were extracted from the English edition of Wikipedia and 386 million were extracted from other language editions and links to external datasets.  

Along with the release of the new datasets, we are happy to announce the initial release of the DBpedia MappingTool (http://mappings.dbpedia.org/index.php/MappingTool): a graphical user interface to support the community in creating and editing mappings as well as the ontology.  

The new release provides the following improvements and changes compared to the DBpedia 3.5.1 release:  

1. Improved DBpedia Ontology as well as improved Infobox mappings using http://mappings.dbpedia.org/  

Furthermore, there are now also mappings in languages other than English. These improvements are largely due to collective work by the community. There are 13.8 million RDF statements based on mappings (11.1 million in version 3.5.1). All this data is in the /ontology/ namespace. Note that this data is of much higher quality than the Raw Infobox data in the /property/ namespace.  

Statistics of the mappings wiki on the date of release 3.6:  

Mappings:     

  • English: 315 Infobox mappings (covers 1124 templates including redirects)     
  • Greek: 137 Infobox mappings (covers 192 templates including redirects)     
  • Hungarian: 111 Infobox mappings (covers 151 templates including redirects)     
  • Croatian: 36 Infobox mappings (covers 67 templates including redirects)     
  • German: 9 Infobox mappings
  • Slovenian: 4 Infobox mappings

Ontology:     

  • 272 classes

Properties:     

  • 629 object properties     
  • 706 datatype properties (they are all in the /datatype/ namespace)  

2.  Some commonly used property names changed  

Please see http://dbpedia.org/ChangeLog and http://dbpedia.org/Datasets/Properties to know which relations changed and update your applications accordingly!  

3. New Datatypes for increased quality in mapping-based properties  

  • xsd:positiveInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:negativeInteger 

4. Improved parsing coverage 

  • Parsing of lists of elements in Infobox property values that improves the completeness of extracted facts
  • Method to deal with missing repeated links in Infoboxes that do appear somewhere else on the page.
  • Flag templates are parsed.
  • Various improvements on internationalization.  

5. Improved recognition of  

  • Wikipedia language codes.
  • Wikipedia namespace identifiers.
  • Category hierarchies.  

6. Disambiguation links for acronyms (all upper-case title) are now extracted (for example, Kilobyte and Knowledge_base for “KB”):  

  • Wikilinks consisting of multiple words: If the starting letters of the words appear in correct order (with possible gaps) and cover all acronym letters.
  • Wikilinks consisting of a single word: If the case-insensitive longest common subsequence with the acronym is equal to the acronym. 

7. New ‘Geo-Related’ Extractor

  • Relates articles to resources of countries, whose label appear in the name of the articles’ categories.

8. Encoding (bugfixes) 

  • The new datasets support the complete range of Unicode code points (up to 0x10ffff). 16-bit code points start with ‘u’, code points larger than 16-bits start with ‘U’.
  • Commas and ampersands do not get encoded anymore in URIs. Please see http://dbpedia.org/URIencoding for an explanation regarding the DBpedia URI encoding scheme.  

9. Extended Datasets 

  • Thanks to Johannes Hoffart (Max-Planck-Institut für Informatik) for contributing links to YAGO2.
  • Freebase links have been updated. They now refer to mids (http://wiki.freebase.com/wiki/Machine_ID) because guids have been deprecated.  

You can download the new DBpedia dataset from http://dbpedia.org/Downloads36 

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql 

Lots of thanks to:  

  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Max Jakob (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the new datasets.
  • Robert Isele and Anja Jentzsch (both Freie Universität Berlin, Germany) for helping Max with their expertise on the extraction framework.
  • Paul Kreis (Freie Universität Berlin, Germany) for analyzing the DBpedia data of the previous release and suggesting ways to increase quality and quantity. Some results of his work were implemented in this release.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece), Jimmy O’Regan (Eolaistriu Technologies, Ireland), José Paulo Leal (University of Porto, Portugal) for providing patches to improve the extraction framework.
  • Claus Stadler (Universität Leipzig, Germany) for implementing the Geo-Related extractor and extracting its data.
  • Jens Lehmann and Sören Auer (both Universität Leipzig, Germany) for providing the new dataset via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.  

The work on the new release was financially supported by  

  • Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/).
  • The European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/).
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/). More information about DBpedia is found at http://dbpedia.org/About  

Have fun with the new dataset!  

The whole DBpedia team also congratulates Wikipedia to its 10th Birthday which was this weekend!  

Cheers,  

Chris Bizer

DBpedia 3.5.1 available on Amazon EC2

As the Amazon Web Services are getting used a lot for cloud computing, we have started to provide current snapshots of the DBpedia dataset for this environment.

We provide the DBpedia dataset for Amazon Web Services in two ways:

1. Source files for being mounted:  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2319

2. Virtuoso SPARQL store for being instanciated: http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C 

DBpedia 3.5.1 released

Hi all,

we are happy to announce the release of DBpedia 3.5.1.

This is primarily a bugfix release, which is based on Wikipedia dumps dating from March 2010. Thanks to the great community feedback about the previous DBpedia release, we were able to resolve the reported issues as well as to improve template to ontology mappings.

The new release provides the following improvements and changes compared to the DBpedia 3.5 release:

  1. Some abstracts contained unwanted WikiText markup. The detection of infoboxes and tables has been improved, so that even most pages with syntax errors have clean abstracts now.
  2. In 3.5 there has been an issue detecting interlanguage links, which led to some non-english statements having the wrong subject. This has been fixed.
  3. Image references to dummy images (e.g. http://en.wikipedia.org/wiki/Image:Replace_this_image.svg) have been removed.
  4. DBpedia 3.5.1 uses a stricter IRI validation now. Care has been taken to only discard URIs from Wikipedia, which are clearly invalid.
  5. Recognition of disambiguation pages has been improved, increasing the size from 247,000 to 769,000 triples.
  6. More geographic coordinates are extracted now, increasing its number from 1,200,000 to 1,500,000 in the english version.
  7. For this release, all Freebase links have been regenerated from the most recent freebase dump.

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads351. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  • Neofonie GmbH (http://www.neofonie.de), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com).
  • OpenLink Software (http://www.openlinksw.com). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers,

Robert Isele and Anja Jentzsch

DBpedia 3.5 released

Hi all,

we are happy to announce the release of DBpedia 3.5. The new release is based on Wikipedia dumps dating from March 2010. Compared to the 3.4 release, we were able to increase the quality of the DBpedia knowledge base by employing a new data extraction framework which applies various data cleansing heuristics as well as by extending the infobox-to-ontology mappings that guide the data extraction process.

The new DBpedia knowledge base describes more than 3.4 million things, out of which 1.47 million are classified in a consistent ontology, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The DBpedia data set features labels and abstracts for these 3.2 million things in up to 92 different languages; 1,460,000 links to images and 5,543,000 links to external web pages; 4,887,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The DBpedia knowledge base altogether consists of over 1 billion pieces of information (RDF triples) out of which 257 million were extracted from the English edition of Wikipedia and 766 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.4 release:

  1. The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables based on table-to-ontology mappings and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are also defined in the mappings. A lot of work also went into the value parsers and the DBpedia 3.5 dataset should therefore be much cleaner than its predecessors. In addition, units of measurement are normalized to their respective SI unit, which makes querying DBpedia easier.
  2. The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned. The documentation of the new mapping language is found at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/mapping%20language/
  3. In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on http://mappings.dbpedia.org.  At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates. On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, the ontology consists of 259 classes and about 1,200 properties. 
  4. The ontology properties extracted from infoboxes are now split into two data sets (For details see: http://wiki.dbpedia.org/Datasets):  1. The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g. length). The range of a property is either an xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit. 2. The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimeters instead of meters.
  5. The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping. 
  6. Three new extractors have been implemented:  1. PageIdExtractor extracting Wikipedia page IDs are extracted for each page. 2. RevisionExtractor extracting the latest revision of a page. 3. PNDExtractor extracting PND (Personnamendatei) identifiers.
  7. The data set now provides labels, abstracts, page links and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
  8. In addition the N-Triples datasets, N-Quads datasets are provided which include a provenance URI to each statement. The provenance URI denotes the origin of the extracted triple in Wikipedia (For details see: http://wiki.dbpedia.org/Datasets).You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads35. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis (all Freie Universität Berlin) for reimplementing the DBpedia extraction framework in Scala, for extending the infobox-to-ontology mappings and for extracting the new DBpedia 3.5 knowledge base. 
  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  1. Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  2. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
  3. OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers

Chris Bizer

DBpedia 3.4 released

We are happy to announce the release of DBpedia 3.4. The new release is based on Wikipedia dumps dating from September 2009.

The new DBpedia data set describes more than 2.9 million things, including 282,000 persons, 339,000 places, 88,000 music albums, 44,000 films, 15,000 video games, 119,000 organizations, 130,000 species and 4400 diseases. The DBpedia data set now features labels and abstracts for these things in 91 different languages; 807,000 links to images and 3,840,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories. The data set consists of 479 million pieces of information (RDF triples) out of which 190 million were extracted from the English edition of Wikipedia and 289 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.3 release:

  1. the data set has been extracted from more recent Wikipedia dumps.
  2. the data set now provides labels, abstracts and infobox data in 91 different languages.
  3. we provide two different version of the DBpedia Infobox Ontology (loose and strict) in order to meet different application requirements. Please refer to http://wiki.dbpedia.org/Datasets#h18-11 for details.
  4. as Wikipedia has moved to dual-licensing, we also dual-license DBpedia. The DBpedia 3.4 data set is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License.
  5. the mapping-based infobox data extractor has been improved and now normalizes units of measurement.
  6. various bug fixes and improvements throughout the code base. Please refer to the change log for the complete list http://wiki.dbpedia.org/Changelog

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads34. As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql.

Lots of thanks to

  • Anja Jentzsch, Christopher Sahnwaldt, Robert Isele, and Paul Kreis (all Freie Universität Berlin) for improving the DBpedia extraction framework and for extracting the new data set.
  • Jens Lehmann and Sören Auer (Universität Leipzig) for providing new data set via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  • Neofonie GmbH, Berlin, (http://www.neofonie.de/index.jsp) for supporting the DBpedia project by paying Christopher Sahnwaldt.

The next steps for the DBpedia project will be to

  1. synchronize Wikipedia and DBpedia by deploying the DBpedia live extraction which updates the DBpedia knowledge base immediately when a Wikipedia article changes.
  2. enable the DBpedia user community to edit and maintain the DBpedia ontology and the infobox mappings that are used by the extraction framework in a public Wiki.
  3. increase the quality of the extracted data by improving and fine-tuning the extraction code.

All this will hopefully happen soon.

Have fun with the new data set!

Cheers

Chris Bizer

DBpedia 3.3 released

We are pleased to announce the release of DBpedia 3.3. This release is based on Wikipedia dumps of May 2009.

The new release includes the following improvements over DBpedia 3.2:

1. more accurate abstract extraction
2. labels and abstracts in 80 languages
3. several infobox extraction bugfixes
4. new links to Dailymed, Diseasome, Drugbank, Sider, TCM
5. updated Open Cyc links

You can find the datasets here, and the rdf files here. The dataset is available to be queried at our Sparql endpoint.

After eight long months without DBpedia release (due to a lack of Wikipedia dumps), today’s release will bring us up to speed again, and we will release DBpedia datasets much more often in the future.

DBpedia now part of Amazon Public Data Sets

Kingsley announced on Tuesday that the first of data sets from the LOD community including DBpedia have been uploaded to the Amazon’s public data set hosting facility. Thus you can now do the following:

  1. Download DBpedia data from Amazon’s hosting facility at no cost to your own data center and then build your own personal or service specific edition of DBpedia
  2. Download to an EC2 AMI and build yourself using Virtuoso or any other Quad / Triple Store
  3. Use the DBpedia EC2 AMI which we provide (which will produce a rendition in 1.5 hrs)

We especially thank our colleagues and new Linked Data supporters at both Amazon Web Services and Infochimps.org for their assistance re. getting this very taxing process in motion.

DBpedia version 3.2 released including the new DBpedia Ontology

we are happy to announce the release of DBpedia version 3.2.

The new knowledge base has been extracted from the October 2008 Wikipedia dumps. Compared to the last release, the new knowledge base provides three mayor improvements:

1. DBpedia Ontology

DBpedia now features a shallow, cross-domain ontology, which has been manually created based on the most commonly used infoboxes within Wikipedia. The ontology currently covers over 170 classes which form a subsumption hierarchy and have 940 properties. The ontology is instanciated by a new infobox data extraction method which is based on hand-generated mappings of Wikipedia infoboxes to the DBpedia ontology. The mappings define fine-granular rules on how to parse infobox values. The mappings also adjust weaknesses in the Wikipedia infobox system, like having different infoboxes for the same class (currently 350 Wikipedia templates are mapped to 170 ontology classes), using different property names for the same property (currently 2350 template properties are mapped to 940 ontology properties), and not having clearly defined datatypes for properties. Therefore, the instance data within the infobox ontology is much cleaner and better structured than the infobox data within the DBpedia infobox dataset which is generated using the old infobox extraction code. The DBpedia Ontology currently contains about 882.000 instances.

More information about the ontology is found at http://wiki.dbpedia.org/Ontology

2. RDF Links to Freebase

Freebase is an open-license database which provides data about million of things from various domains. Freebase has recently released an Linked Data interface to their content. As there is a big overlap between DBpedia and Freebase, we have added 2.4 million RDF links to DBpedia pointing at the corresponding things in Freebase. These links can be used to smush and fuse data about a thing from DBpedia and Freebase.

For more information about the Freebase links see
http://blog.dbpedia.org/2008/11/15/dbpedia-is-now-interlinked-with-freebase-links-to-opencyc-updated/

3. Cleaner Abstacts

Within the old DBpedia dataset it occurred that the abstracts for different languages contained Wikpedia markup and other strange characters. For the 3.2 release, we have improved DBpedia’s abstract extraction code which results in much cleaner abstracts that can safely be displayed in user interfaces.

Access the new DBpedia knowledge base 

The new DBpedia release can be downloaded from:

http://wiki.dbpedia.org/Downloads32

and is also available via the DBpedia SPARQL endpoint at

http://dbpedia.org/sparql

and via DBpedia’s Linked Data interface. Example URIs:

http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Oliver_Stone

Lots of thanks to everybody who contributed to the Dbpedia 3.2 release!

Especially:

1. Georgi Kobilarov (Freie Universität Berlin) who designed and implemented the new infobox extraction framework.
2. Anja Jentsch (Freie Universität Berlin) who contributed to implementing the new extraction framework and wrote the infobox to ontology class mappings.
3. Paul Kreis (Freie Universität Berlin) who improved the datatype extraction code.
4. Andreas Schultz (Freie Universität Berlin) for generating the Freebase to DBpedia RDF links.
5. Everybody at OpenLink Software for hosting DBpedia on a Virtuoso server and for providing the statistics about the new Dbpedia knowledge base.

Have fun with the new DBpedia knowledge base!

DBpedia 3.1 breaks 100 million triples barrier

Today, we released DBpedia 3.1. As always in the past years, the size of Wikipedia increased a lot over the past months. The new extraction contains 116,7 million triples, marking an increase of 27% over the previous version.

Apart from the more recent Wikipedia dumps we used, some notable improvements are a much better YAGO mapping, providing a more complete (more classes assigned to instances) and accurate (95% accuracy) class hierarchy for DBpedia. The Geo extractor code has been improved and is now run for all 14 languages. URI validation has switched to the PEAR validation class.

Downloads | ChangeLog

DBpedia 3.0 Release

We announce the availability of the DBpedia 3.0 final release.

Downloads are available at http://wiki.dbpedia.org/Downloads. For a list of changes since DBpedia 2.0, see the Changelog. Most notably, multi-language support was improved, new linked data sets added, and extraction code improved. Compared to the 3.0 release candidate, a number of extraction framework and data set bugs reported at our sourceforge.net bug tracker were fixed.

Overall, the combined download size of all provided NT and CSV files is 5,0 GB (uncompressed: 48,1 GB). The available data sets contain 92M triples (excluding 126M triples for internal Wikipedia links). DBpedia’s coverage grows to 2.4M entities for the English edition in this release, thanks to the hard-working Wikipedia contributors.

The extraction was performed on a server of the AKSW research group. I would like to thank Jörg Schüppel, Sören Auer, Chris Bizer, Richard Cyganiak, Georgi Kobilarov, the OpenLink team, and many other contributors for their DBpedia support.