All posts by ChrisBizer

DBpedia Version 2014 released

Hi all,

we are happy to announce the release of DBpedia 2014.

The most important improvements of the new release compared to DBpedia 3.9 are:

1. the new release is based on updated Wikipedia dumps dating from April / May 2014 (the 3.9 release was based on dumps from March / April 2013), leading to an overall increase of the number of things described in the English edition from 4.26 to 4.58 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner data.

The English version of the DBpedia knowledge base currently describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology (http://wiki.dbpedia.org/Ontology2014), including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.

We provide localized versions of DBpedia in 125 languages. All these versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia. The full DBpedia data set features 38 million labels and abstracts in 125 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million links to YAGO categories. DBpedia is connected with other Linked Datasets by around 50 million RDF links.

Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions.

Detailed statistics about the DBpedia data sets in 28 popular languages are provided at Dataset Statistics page (http://wiki.dbpedia.org/Datasets2014/DatasetStatistics).

The main changes between DBpedia 3.9 and 2014 are described below. For additional, more detailed information please refer to the DBpedia Change Log (http://wiki.dbpedia.org/Changelog).

 1. Enlarged Ontology

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2014 ontology encompasses

  • 685  classes (DBpedia 3.9: 529)
  • 1,079 object properties (DBpedia 3.9: 927)
  • 1,600 datatype properties (DBpedia 3.9: 1,290)
  • 116 specialized datatype properties (DBpedia 3.9: 116)
  • 47 owl:equivalentClass and 35 owl:equivalentProperty mappings to http://schema.org

2. Additional Infobox to Ontology Mappings

The editors community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 2014 extraction, we used 4,339 mappings (DBpedia 3.9: 3,177 mappings), which are distributed as follows over the languages covered in the release.

  • English: 586 mappings
  • Dutch: 469 mappings
  • Serbian: 450 mappings
  • Polish: 383 mappings
  • German: 295 mappings
  • Greek: 281 mappings
  • French: 221 mappings
  • Portuguese: 211 mappings
  • Slovenian: 170 mappings
  • Korean: 148 mappings
  • Spanish: 137 mappings
  • Italian: 125 mappings
  • Belarusian: 125 mappings
  • Hungarian: 111 mappings
  • Turkish: 91 mappings
  • Japanese: 81 mappings
  • Czech: 66 mappings
  • Bulgarian: 61 mappings
  • Indonesian: 59 mappings
  • Catalan: 52 mappings
  • Arabic: 52 mappings
  • Russian: 48 mappings
  • Basque: 37 mappings
  • Croatian: 36 mappings
  • Irish: 17 mappings
  • Wiki-Commons: 12 mappings
  • Welsh: 7 mappings
  • Bengali: 6 mappings
  • Slovak: 2 Mappings

3. Extended Type System to cover Articles without Infobox

 Until the DBpedia 3.8 release, a concept was only assigned a type (like person or place) if the corresponding Wikipedia article contains an infobox indicating this type. Starting from the 3.9 release, we provide type statements for articles without infobox that are inferred based on the link structure within the DBpedia knowledge base using the algorithm described in Paulheim/Bizer 2014 (http://www.heikopaulheim.com/documents/ijswis_2014.pdf). For the new release, an improved version of the algorithm was run to produce type information for 400,000 things that were formerly not typed. A similar algorithm (presented in the same paper) was used to identify and remove potentially wrong statements from the knowledge base.

 4. New and updated RDF Links into External Data Sources

 We updated the following RDF link sets pointing at other Linked Data sources: Freebase, Wikidata, Geonames and GADM. For an overview about all data sets that are interlinked from DBpedia please refer to http://wiki.dbpedia.org/Interlinking.

Accessing the DBpedia 2014 Release 

 You can download the new DBpedia datasets in RDF format from http://wiki.dbpedia.org/Downloads.
In addition, we provide 
some of the core DBpedia data also in tabular form (CSV and JSON formats) at http://wiki.dbpedia.org/DBpediaAsTables.

 As usual, the new dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql.

Credits

 Lots of thanks to

  1. Daniel Fleischhacker (University of Mannheim) and Volha Bryl (University of Mannheim) for improving the DBpedia extraction framework, for extracting the DBpedia 2014 data sets for all 125 languages, for generating the updated RDF links to external data sets, and for generating the statistics about the new release.
  2. All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  3.  The whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.
  4. Dimitris Kontokostas (University of Leipzig) for improving the DBpedia extraction framework and loading the new release onto the DBpedia download server in Leipzig.
  5. Heiko Paulheim (University of Mannheim) for re-running his algorithm to generate additional type statements for formerly untyped resources and identify and removed wrong statements.
  6. Petar Ristoski (University of Mannheim) for generating the updated links pointing at the GADM database of Global Administrative Areas. Petar will also generate an updated release of DBpedia as Tables soon.
  7. Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing the links from DOLCE to DBpedia ontology.
  8.  Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.
  9.  OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  10. Michael Moore (University of Waterloo, as an intern at the University of Mannheim) for implementing the anchor text extractor and and contribution to the statistics scripts.
  11. Ali Ismayilov (University of Bonn) for implementing Wikidata extraction, on which the interlanguage link generation was based.
  12. Gaurav Vaidya (University of Colorado Boulder) for implementing and running Wikimedia Commons extraction.
  13. Andrea Di Menna, Jona Christopher Sahnwaldt, Julien Cojan, Julien Plu, Nilesh Chakraborty and others who contributed improvements to the DBpedia extraction framework via the source code repository on GitHub.
  14.  All GSoC mentors and students for working directly or indirectly on this release: https://github.com/dbpedia/extraction-framework/graphs/contributors

 The work on the DBpedia 2014 release was financially supported by the European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/).

More information about DBpedia is found at http://dbpedia.org/About as well as in the new overview article about the project available at  http://wiki.dbpedia.org/Publications.

Have fun with the new DBpedia 2014 release!

Cheers,

Daniel Fleischhacker, Volha Bryl, and Christian Bizer

 

 

DBpedia Spotlight V0.7 released

DBpedia Spotlight is an entity linking tool for connecting free text to DBpedia through the recognition and disambiguation of entities and concepts from the DBpedia KB.

We are happy to announce Version 0.7 of DBpedia Spotlight, which is also the first official release of the probabilistic/statistical implementation.

More information about as well as updated evaluation results for DBpedia Spotlight V0.7 are found in this paper:

Joachim Daiber, Max Jakob, Chris Hokamp, Pablo N. Mendes: Improving Efficiency and Accuracy in Multilingual Entity ExtractionISEM2013. 

The changes to the statistical implementation include:

  • smaller and faster models through quantization of counts, optimization of search and some pruning
  • better handling of case
  • various fixes in Spotlight and PigNLProc
  • models can now be created without requiring a Hadoop and Pig installation
  • UIMA support by @mvnural
  • support for confidence value

See the release notes at [1] and the updated demos at [4].

Models for Spotlight 0.7 can be found here [2].

Additionally, we now provide the raw Wikipedia counts, which we hope will prove useful for research and development of new models [3].

A big thank you to all developers who made contributions to this version (with special thanks to Faveeo and Idio). Huge thanks to Jo for his leadership and continued support to the community.

Cheers,
Pablo Mendes,

on behalf of Joachim Daiber and the DBpedia Spotlight developer community.

[1] – https://github.com/dbpedia-spotlight/dbpedia-spotlight/releases/tag/release-0.7

[2] – http://spotlight.sztaki.hu/downloads/

[3] – http://spotlight.sztaki.hu/downloads/raw

[4] – http://dbpedia-spotlight.github.io/demo/

(This message is an adaptation of Joachim Daiber’s message to the DBpedia Spotlight list. Edited to suit this broader community and give credit to him.)

DBpedia as Tables released

As some of the potential users of DBpedia might not be familiar with the RDF data model and the SPARQL query language, we provide some of the core DBpedia 3.9 data also in tabular form as Comma-Seperated-Values (CSV) files which can easily be processed using standard tools such as spreadsheet applications, relational databases or data mining tools.

For each class in the DBpedia ontology (such as Person, Radio Station, Ice Hockey Player, or Band) we provide a single CSV file which contains all instances of this class. Each instance is described by its URI, an English label and a short abstract, the mapping-based infobox data describing the instance (extracted from the English edition of Wikipedia), and geo-coordinates (if applicable).

Altogether we provide 530 CVS files in the form of a single ZIP file (size 3 GB compressed and 73.4 GB uncompressed).

More information about the file format as well as the download link is found at DBpedia as Tables.

DBpedia 3.8 released, including enlarged Ontology and additional localized Versions

Hi all,

we are happy to announce the release of DBpedia 3.8.


The most important improvements of the new release compared to DBpedia 3.7 are:

1. the new release is based on updated Wikipedia dumps dating from late May / early June 2012.
2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen.
3. the DBpedia internationalization has progressed and we now provide localized versions of DBpedia in even more languages.

The English version of the DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent
Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

We provide localized versions of DBpedia in 111 languages. All these versions together describe 20.8 million things, out of which 10.5 mio overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 10.3 million unique things in 111 different languages; 8.0 million links to images and 24.4 million HTML links to external web pages; 27.2 million data links into external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2 million YAGO categories. The dataset consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from the English edition of Wikipedia, 1.46 billion were extracted from other language editions, and about 27 million are data links into external RDF data sets.

The main changes between DBpedia 3.7 and 3.8 are described below. For additional, more detailed information please refer to the change log.

1. Enlarged Ontology

The DBpedia community added many new classes and properties on the
mappings wiki. The DBpedia 3.8 ontology encompasses

  • 359 classes (DBpedia 3.7: 319)
  • 800 object properties (DBpedia 3.7: 750)
  • 859 datatype properties (DBpedia 3.7: 791)
  • 116 specialized datatype properties (DBpedia 3.7: 102)
  • 45 owl:equivalentClass and 31 owl:equivalentProperty mappings to
    http://schema.org


2. Additional Infobox to Ontology Mappings

The editors of the
mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 3.8 extraction, we used 2347 mappings, among them

  • Polish: 382 mappings
  • English: 345 mappings
  • German: 211 mappings
  • Portuguese: 207 mappings
  • Greek: 180 mappings
  • Slovenian: 170 mappings
  • Korean: 146 mappings
  • Hungarian: 111 mappings
  • Spanish: 107 mappings
  • Turkish: 91 mappings
  • Czech: 66 mappings
  • Bulgarian: 61 mappings
  • Catalan: 52 mappings
  • Arabic: 51 mappings


3. New local DBpedia Chapters


We are also happy to see the number of local DBpedia chapters in different countries rising. Since the 3.7 DBpedia release we welcomed the French, Italian and Japanese Chapters. In addition, we expect the Dutch DBpedia chapter to go online during the next months (in cooperation with http://bibliotheek.nl/). The DBpedia chapters provide local SPARQL endpoints and dereferencable URIs for the DBpedia data in their corresponding language. The DBpedia Internationalization page provides an overview of the current state of the DBpedia Internationalization effort.

4. New and updated RDF Links into External Data Sources

We have added new RDF links pointing at resources in the following Linked Data sources: Amsterdam Museum, BBC Wildlife Finder, CORDIS, DBTune, Eurostat (Linked Statistics), GADM, LinkedGeoData, OpenEI (Open Energy Info). In addition, we have updated many of the existing RDF links pointing at other Linked Data sources.


5. New Wiktionary2RDF Extractor

We developed a DBpedia extractor, that is configurable for any Wiktionary edition. It generates an comprehensive ontology about languages for use as a semantic lexical resource in linguistics. The data currently includes language, part of speech, senses with definitions, synonyms, taxonomies (hyponyms, hyperonyms, synonyms, antonyms) and translations for each lexical word. It furthermore is hosted as Linked Data and can serve as a central linking hub for LOD in linguistics. Currently available languages are English, German, French, Russian. In the next weeks we plan to add Vietnamese and Arabic. The goal is to allow the addition of languages just by configuration without the need of programming skills, enabling collaboration as in the Mappings Wiki. For more information visit http://wiktionary.dbpedia.org/

6. Improvements to the Data Extraction Framework

  • Additionally to N-Triples and N-Quads, the framework was extended to write triple files in Turtle format
  • Extraction steps that looked for links between different Wikipedia editions were replaced by more powerful post-processing scripts
  • Preparation time and effort for abstract extraction is minimized, extraction time is reduced to a few milliseconds per page
  • To save file system space, the framework can compress DBpedia triple files while writing and decompress Wikipedia XML dump files while reading
  • Using some bit twiddling, we can now load all ~200 million inter-language links into a few GB of RAM and analyze them
  • Users can download ontology and mappings from mappings wiki and store them in files to avoid downloading them for each extraction, which takes a lot of time and makes extraction results less reproducible
  • We now use IRIs for all languages except English, which uses URIs for backwards compatibility
  • We now resolve redirects in all datasets where the objects URIs are DBpedia resources
  • We check that extracted dates are valid (e.g. February never has 30 days) and its format is valid according to its XML Schema type, e.g. xsd:gYearMonth
  • We improved the removal of HTML character references from the abstracts
  • When extracting raw infobox properties, we make sure that predicate URI can be used in RDF/XML by appending an underscore if necessary
  • Page IDs and Revision IDs datasets now use the DBpedia resource as subject URI, not the Wikipedia page URL
  • We use foaf:isPrimaryTopicOf instead of foaf:page for the link from DBpedia resource to Wikipedia page
  • New inter-language link datasets for all languages


Accessing the DBpedia 3.8  Release

You can download the new DBpedia dataset from
http://dbpedia.org/Downloads38.

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at
http://dbpedia.org/sparql

Credits

Lots of thanks to

  • Jona Christopher Sahnwaldt (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the DBpedia 3.8 data sets.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for implementing the language generalizations to the extraction framework.
  • Uli Zellbeck and Anja Jentzsch (Freie Universität Berlin, Germany) for generating the new and updated RDF links to external datasets using the Silk interlinking framework.
  • Jonas Brekle (Universität Leipzig, Germany) and Sebastian Hellmann (Universität Leipzig, Germany)for their work on the new Wikionary2RDF extractor.
  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • The whole Internationalization Committee for pushing the DBpedia internationalization forward.
  • Kingsley Idehen and Patrick van Kleef (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.

The work on the DBpedia 3.8 release was financially supported by the European Commission through the projects LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/, improvements to the extraction framework) and LATC – LOD Around the Clock (http://latc-project.eu/, creation of external RDF links).

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia release!

Cheers,

Chris Bizer

DBpedia Spotlight has been selected for Google Summer of Code. Please apply now!

The Google Summer of Code (GSoC) is a global program that offers student developers (BSc,MSc,PhD) stipends to write code for open source software projects. It has had thousands of participants since the first edition in 2005, connecting prospective students with mentors from open source communities such as Debian, KDE, Gnome, Apache Software Foundation, Mozilla, etc.

For the students, it is a great chance to get real-world software development experience. For the open source communities, it is a chance to expand their development community. For everybody else, more source code is created and released for the benefit of all!

We are thrilled to announce that our open source project DBpedia Spotlight has been selected for the Google Summer of Code 2012.

We are now seeking students interested in working with us to enhance operational aspects of DBpedia Spotlight, as well as to engage in research activities in collaboration with our team. If you are an energetic developer, passionate for open source and interested in areas related to DBpedia Spotlight, please get in touch with us!

We have shared a number of project ideas to get you started.

To apply, visit: http://www.google-melange.com/gsoc/org/google/gsoc2012/dbpediaspotlight

If you would like to see DBpedia Spotlight in action, helping you to explore available projects within GSoC 2012, please visit our demonstration page at: http://spotlight.dbpedia.org/gsoc/


DBpedia 3.7 released, including 15 localized Editions

Hi all,

we are happy to announce the release of DBpedia 3.7. The new release is based on Wikipedia dumps dating from late July 2011.

The new DBpedia data set describes more than 3.64 million things, of which 1.83 million are classified in a consistent ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations, 183,000 species and 5,400 diseases.

The DBpedia data set features labels and abstracts for 3.64 million things in up to 97 different languages; 2,724,000 links to images and 6,300,000 links to external web pages; 6,200,000 external links into other RDF datasets, and 740,000 Wikipedia categories. The dataset consists of 1 billion pieces of information (RDF triples) out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets.

Localized Editions

Up till now, we extracted data from non-English Wikipedia pages only if there exists an equivalent English page, as we wanted to have a single URI to identify a resource across all 97 languages. However, since there are many pages in the non-English Wikipedia editions that do not have an equivalent English page (especially small towns in different countries, e.g. the Austrian village Endach, or legal and administrative terms that are just relevant for a single country) relying on English URIs only had the negative effect that DBpedia did not contain data for these entities and many DBpedia users have complained about this shortcoming.

As part of the DBpedia 3.7 release, we now provide 15 localized DBpedia editions for download that contain data from all Wikipedia pages in a specific language. These localized editions cover the following languages: ca, de, el, es, fr, ga, hr, hu, it, nl, pl, pt, ru, sl, tr. The URIs identifying entities in these i18n data sets are constructed directly from the non-English title and a language-specific URI namespaces (e.g. http://ru.dbpedia.org/resource/Berlin), so there are now 16 different URIs in DBpedia that refer to Berlin. We also extract the inter-language links from the different Wikipedia editions. Thus, whenever a inter-language links between a non-English Wikipedia page and its English equivalent exists, the resulting owl:sameAs link can be used to relate the localized DBpedia URI to the equivalent in the main (English) DBpedia edition. The localized DBpedia editions are provided for download on the DBpedia download page (http://wiki.dbpedia.org/Downloads37). Note that we have not provide public SPARQL endpoints for the localized editions, nor do the localized URIs dereference. This might change in the future, as more local DBpedia chapters are set up in different countries as part of the DBpedia internationalization effort (http://dbpedia.org/Internationalization).

Other Changes

Beside the new localized editions, the DBpedia 3.7 release provides the following improvements and changes compared to the last release:

1. Framework

  • Redirects are resolved in a post-processing step for increased inter-connectivity of 13% (applied for English data sets)
  • Extractor configuration using the dependency injection principle
  • Simple threaded loading of mappings in server
  • Improved international language parsing support thanks to the members of the Internationalization Committee: http://dbpedia.org/Internationalization

2. Bugfixes

  • Encode homepage URLs to conform with N-Triples spec
  • Correct reference parsing
  • Recognize MediaWiki parser functions
  • Raw infobox extraction produces more object properties again
  • skos:related for category links starting with “:” and having and anchor text
  • Restrict objects to Main namespace in MappingExtractor
  • Double rounding (e.g. a person’s height should not be 1800.00000001 cm)
  • Start position in abstract extractor
  • Server can handle template names containing a slash
  • Encoding issues in YAGO dumps

3. Ontology

  • 320 ontology classes
  • 750 object properties
  • 893 datatype properties
  • owl:equivalentClass and owl:equivalentProperty mappings to http://schema.org

Note that the ontology now is a directed-acyclic graph. Classes can have multiple superclasses, which was important for the mappings to schema.org. A taxonomy can still be constructed by ignoring all superclass but the one that is specified first in the list and is considered the most important.

4. Mappings

  • Dynamic statistics for infobox mappings showing the overall and individual coverage of the mappings in each language: http://mappings.dbpedia.org/index.php/Mapping_Statistics
  • Improved DBpedia Ontology as well as improved Infobox mappings using http://mappings.dbpedia.org/. These improvements are largely due to collective work by the community before and during the DBpedia Mapping Creation Sprint. For English, there are 17.5 million RDF statements based on mappings (13.8 million in version 3.6) (see also http://dbpedia.org/Downloads37#ontologyinfoboxproperties).
  • ConstantProperty mappings to capture information from the template title (e.g. Infobox_Australian_Road {{TemplateMapping | mapToClass = Road | mappings = {{ConstantMapping | ontologyProperty = country | value = Australia }}}})
  • Language specification for string properties in PropertyMappings (e.g. Infobox_japan_station: {{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name | language = ja}} )
  • Multiplication factor in PropertyMappings (e.g. Infobox_GB_station: {{PropertyMapping | templateProperty = usage0910 | ontologyProperty = passengersPerYear | factor = 1000000}}, because it’s always specified in millions)

5. RDF Links to External Data Sources

  • New RDF links pointing at resources in the following Linked Data sources: Umbel, EUnis, LinkedMDB, Geospecis
  • Updated RDF links pointing at resources in the following Linked Data sources: Freebase, WordNet, Opencyc, New York Times, Drugbank, Diseasome, Flickrwrapper, Sider, Factbook, DBLP, Eurostat, Dailymed, Revyu

Accessing the new DBpedia Release

You can download the new DBpedia dataset from http://dbpedia.org/Downloads37.

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint (http://dbpedia.org/sparql).

Credits

Lots of thanks to

  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Max Jakob (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the new datasets.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for providing language generalizations to the extraction framework.
  • Paul Kreis (Freie Universität Berlin, Germany) for administering the ontology and for delivering the mapping statistics and schema.org mappings.
  • Uli Zellbeck (Freie Universität Berlin, Germany) for providing the links to external datasets using the Silk framework.
  • The whole Internationalization Committee for expanding some DBpedia extractors to a number of languages:
    http://dbpedia.org/Internationalization.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.

The work on the new release was financially supported by:

  • The European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/, improvements to the extraction framework).
  • The European Commission through the project LATC – LOD Around the Clock (http://latc-project.eu/, creation of external RDF links).
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/).

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new data set!

Cheers,

Chris Bizer

DBpedia Spotlight – Text Annotation Toolkit released

We are happy to announce a first release of DBpedia Spotlight – Shedding Light on the Web of Documents. 

The amount of data in the Linked Open Data cloud is steadily increasing. Interlinking text documents with this data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. 

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. The DBpedia Spotlight Architecture is composed by the following modules:

  • Web application, a demonstration client (HTML/Javascript UI) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.

  • Web Service, a RESTful Web API that exposes the functionality of annotating and/or disambiguating entities in text. The service returns XML, JSON or RDF.

  • Annotation Java / Scala API, exposing the underlying logic that performs the annotation/disambiguation.

  • Indexing Java / Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.

More information about DBpedia Spotlight can be found at: 

http://spotlight.dbpedia.org 

DBpedia Spotlight is provided under the terms of the Apache License, Version 2.0. Part of the code uses LingPipe under the Royalty Free License.

 

The source code can be downloaded from: 

http://sourceforge.net/projects/dbp-spotlight 

The development of DBpedia Spotlight was supported by: 

  • Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/).

  • The European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/). 

Lots of thanks to:

  • Andreas Schultz for his help with the SPARQL endpoint.

  • Paul Kreis for his help with evaluations.

  • Robert Isele and Anja Jentzsch for their help in early stages with the DBpedia extraction framework.

Cheers,

 Pablo N. Mendes, Max Jakob, Andrés García-Silva and Chris Bizer.

DBpedia 3.6 released

Hi all, 

we are happy to announce the release of DBpedia 3.6. The new release is based on Wikipedia dumps dating from October/November 2010. 

 The new DBpedia dataset describes more than 3.5 million things, of which 1.67 million are classified in a consistent ontology, including 364,000 persons, 462,000 places, 99,000 music albums, 54,000 films, 16,500 video games, 148,000 organizations, 148,000 species and 5,200 diseases.  The DBpedia dataset features labels and abstracts for 3.5 million things in up to 97 different languages; 1,850,000 links to images and 5,900,000 links to external web pages; 6,500,000 external links into other RDF datasets, and 632,000 Wikipedia categories.  

The dataset consists of 672 million pieces of information (RDF triples) out of which 286 million were extracted from the English edition of Wikipedia and 386 million were extracted from other language editions and links to external datasets.  

Along with the release of the new datasets, we are happy to announce the initial release of the DBpedia MappingTool (http://mappings.dbpedia.org/index.php/MappingTool): a graphical user interface to support the community in creating and editing mappings as well as the ontology.  

The new release provides the following improvements and changes compared to the DBpedia 3.5.1 release:  

1. Improved DBpedia Ontology as well as improved Infobox mappings using http://mappings.dbpedia.org/  

Furthermore, there are now also mappings in languages other than English. These improvements are largely due to collective work by the community. There are 13.8 million RDF statements based on mappings (11.1 million in version 3.5.1). All this data is in the /ontology/ namespace. Note that this data is of much higher quality than the Raw Infobox data in the /property/ namespace.  

Statistics of the mappings wiki on the date of release 3.6:  

Mappings:     

  • English: 315 Infobox mappings (covers 1124 templates including redirects)     
  • Greek: 137 Infobox mappings (covers 192 templates including redirects)     
  • Hungarian: 111 Infobox mappings (covers 151 templates including redirects)     
  • Croatian: 36 Infobox mappings (covers 67 templates including redirects)     
  • German: 9 Infobox mappings
  • Slovenian: 4 Infobox mappings

Ontology:     

  • 272 classes

Properties:     

  • 629 object properties     
  • 706 datatype properties (they are all in the /datatype/ namespace)  

2.  Some commonly used property names changed  

Please see http://dbpedia.org/ChangeLog and http://dbpedia.org/Datasets/Properties to know which relations changed and update your applications accordingly!  

3. New Datatypes for increased quality in mapping-based properties  

  • xsd:positiveInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:negativeInteger 

4. Improved parsing coverage 

  • Parsing of lists of elements in Infobox property values that improves the completeness of extracted facts
  • Method to deal with missing repeated links in Infoboxes that do appear somewhere else on the page.
  • Flag templates are parsed.
  • Various improvements on internationalization.  

5. Improved recognition of  

  • Wikipedia language codes.
  • Wikipedia namespace identifiers.
  • Category hierarchies.  

6. Disambiguation links for acronyms (all upper-case title) are now extracted (for example, Kilobyte and Knowledge_base for “KB”):  

  • Wikilinks consisting of multiple words: If the starting letters of the words appear in correct order (with possible gaps) and cover all acronym letters.
  • Wikilinks consisting of a single word: If the case-insensitive longest common subsequence with the acronym is equal to the acronym. 

7. New ‘Geo-Related’ Extractor

  • Relates articles to resources of countries, whose label appear in the name of the articles’ categories.

8. Encoding (bugfixes) 

  • The new datasets support the complete range of Unicode code points (up to 0x10ffff). 16-bit code points start with ‘u’, code points larger than 16-bits start with ‘U’.
  • Commas and ampersands do not get encoded anymore in URIs. Please see http://dbpedia.org/URIencoding for an explanation regarding the DBpedia URI encoding scheme.  

9. Extended Datasets 

  • Thanks to Johannes Hoffart (Max-Planck-Institut für Informatik) for contributing links to YAGO2.
  • Freebase links have been updated. They now refer to mids (http://wiki.freebase.com/wiki/Machine_ID) because guids have been deprecated.  

You can download the new DBpedia dataset from http://dbpedia.org/Downloads36 

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql 

Lots of thanks to:  

  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Max Jakob (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the new datasets.
  • Robert Isele and Anja Jentzsch (both Freie Universität Berlin, Germany) for helping Max with their expertise on the extraction framework.
  • Paul Kreis (Freie Universität Berlin, Germany) for analyzing the DBpedia data of the previous release and suggesting ways to increase quality and quantity. Some results of his work were implemented in this release.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece), Jimmy O’Regan (Eolaistriu Technologies, Ireland), José Paulo Leal (University of Porto, Portugal) for providing patches to improve the extraction framework.
  • Claus Stadler (Universität Leipzig, Germany) for implementing the Geo-Related extractor and extracting its data.
  • Jens Lehmann and Sören Auer (both Universität Leipzig, Germany) for providing the new dataset via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.  

The work on the new release was financially supported by  

  • Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/).
  • The European Commission through the project LOD2 – Creating Knowledge out of Linked Data (http://lod2.eu/).
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/). More information about DBpedia is found at http://dbpedia.org/About  

Have fun with the new dataset!  

The whole DBpedia team also congratulates Wikipedia to its 10th Birthday which was this weekend!  

Cheers,  

Chris Bizer

DBpedia 3.5.1 available on Amazon EC2

As the Amazon Web Services are getting used a lot for cloud computing, we have started to provide current snapshots of the DBpedia dataset for this environment.

We provide the DBpedia dataset for Amazon Web Services in two ways:

1. Source files for being mounted:  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2319

2. Virtuoso SPARQL store for being instanciated: http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C 

DBpedia 3.5 released

Hi all,

we are happy to announce the release of DBpedia 3.5. The new release is based on Wikipedia dumps dating from March 2010. Compared to the 3.4 release, we were able to increase the quality of the DBpedia knowledge base by employing a new data extraction framework which applies various data cleansing heuristics as well as by extending the infobox-to-ontology mappings that guide the data extraction process.

The new DBpedia knowledge base describes more than 3.4 million things, out of which 1.47 million are classified in a consistent ontology, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The DBpedia data set features labels and abstracts for these 3.2 million things in up to 92 different languages; 1,460,000 links to images and 5,543,000 links to external web pages; 4,887,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The DBpedia knowledge base altogether consists of over 1 billion pieces of information (RDF triples) out of which 257 million were extracted from the English edition of Wikipedia and 766 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.4 release:

  1. The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables based on table-to-ontology mappings and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are also defined in the mappings. A lot of work also went into the value parsers and the DBpedia 3.5 dataset should therefore be much cleaner than its predecessors. In addition, units of measurement are normalized to their respective SI unit, which makes querying DBpedia easier.
  2. The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned. The documentation of the new mapping language is found at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/mapping%20language/
  3. In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on http://mappings.dbpedia.org.  At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates. On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, the ontology consists of 259 classes and about 1,200 properties. 
  4. The ontology properties extracted from infoboxes are now split into two data sets (For details see: http://wiki.dbpedia.org/Datasets):  1. The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g. length). The range of a property is either an xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit. 2. The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimeters instead of meters.
  5. The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping. 
  6. Three new extractors have been implemented:  1. PageIdExtractor extracting Wikipedia page IDs are extracted for each page. 2. RevisionExtractor extracting the latest revision of a page. 3. PNDExtractor extracting PND (Personnamendatei) identifiers.
  7. The data set now provides labels, abstracts, page links and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
  8. In addition the N-Triples datasets, N-Quads datasets are provided which include a provenance URI to each statement. The provenance URI denotes the origin of the extracted triple in Wikipedia (For details see: http://wiki.dbpedia.org/Datasets).You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads35. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis (all Freie Universität Berlin) for reimplementing the DBpedia extraction framework in Scala, for extending the infobox-to-ontology mappings and for extracting the new DBpedia 3.5 knowledge base. 
  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  1. Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  2. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
  3. OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers

Chris Bizer