DBpedia Homepage | Blog | Sourceforge Page

DBpedia 3.5.1 available on Amazon EC2

August 10, 2010 - 8:49 am by ChrisBizer - No comments »

As the Amazon Web Services are getting used a lot for cloud computing, we have started to provide current snapshots of the DBpedia dataset for this environment.

We provide the DBpedia dataset for Amazon Web Services in two ways:

1. Source files for being mounted:  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2319

2. Virtuoso SPARQL store for being instanciated: http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C 

DBpedia 3.5.1 released

April 28, 2010 - 7:18 pm by AnjaJentzsch - No comments »

Hi all,

we are happy to announce the release of DBpedia 3.5.1.

This is primarily a bugfix release, which is based on Wikipedia dumps dating from March 2010. Thanks to the great community feedback about the previous DBpedia release, we were able to resolve the reported issues as well as to improve template to ontology mappings.

The new release provides the following improvements and changes compared to the DBpedia 3.5 release:

  1. Some abstracts contained unwanted WikiText markup. The detection of infoboxes and tables has been improved, so that even most pages with syntax errors have clean abstracts now.
  2. In 3.5 there has been an issue detecting interlanguage links, which led to some non-english statements having the wrong subject. This has been fixed.
  3. Image references to dummy images (e.g. http://en.wikipedia.org/wiki/Image:Replace_this_image.svg) have been removed.
  4. DBpedia 3.5.1 uses a stricter IRI validation now. Care has been taken to only discard URIs from Wikipedia, which are clearly invalid.
  5. Recognition of disambiguation pages has been improved, increasing the size from 247,000 to 769,000 triples.
  6. More geographic coordinates are extracted now, increasing its number from 1,200,000 to 1,500,000 in the english version.
  7. For this release, all Freebase links have been regenerated from the most recent freebase dump.

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads351. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  • Neofonie GmbH (http://www.neofonie.de), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com).
  • OpenLink Software (http://www.openlinksw.com). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers,

Robert Isele and Anja Jentzsch

DBpedia 3.5 released

April 12, 2010 - 11:28 am by ChrisBizer - No comments »

Hi all,

we are happy to announce the release of DBpedia 3.5. The new release is based on Wikipedia dumps dating from March 2010. Compared to the 3.4 release, we were able to increase the quality of the DBpedia knowledge base by employing a new data extraction framework which applies various data cleansing heuristics as well as by extending the infobox-to-ontology mappings that guide the data extraction process.

The new DBpedia knowledge base describes more than 3.4 million things, out of which 1.47 million are classified in a consistent ontology, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The DBpedia data set features labels and abstracts for these 3.2 million things in up to 92 different languages; 1,460,000 links to images and 5,543,000 links to external web pages; 4,887,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The DBpedia knowledge base altogether consists of over 1 billion pieces of information (RDF triples) out of which 257 million were extracted from the English edition of Wikipedia and 766 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.4 release:

  1. The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables based on table-to-ontology mappings and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are also defined in the mappings. A lot of work also went into the value parsers and the DBpedia 3.5 dataset should therefore be much cleaner than its predecessors. In addition, units of measurement are normalized to their respective SI unit, which makes querying DBpedia easier.
  2. The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned. The documentation of the new mapping language is found at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/mapping%20language/
  3. In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on http://mappings.dbpedia.org.  At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates. On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, the ontology consists of 259 classes and about 1,200 properties. 
  4. The ontology properties extracted from infoboxes are now split into two data sets (For details see: http://wiki.dbpedia.org/Datasets):  1. The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g. length). The range of a property is either an xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit. 2. The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimeters instead of meters.
  5. The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping. 
  6. Three new extractors have been implemented:  1. PageIdExtractor extracting Wikipedia page IDs are extracted for each page. 2. RevisionExtractor extracting the latest revision of a page. 3. PNDExtractor extracting PND (Personnamendatei) identifiers.
  7. The data set now provides labels, abstracts, page links and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
  8. In addition the N-Triples datasets, N-Quads datasets are provided which include a provenance URI to each statement. The provenance URI denotes the origin of the extracted triple in Wikipedia (For details see: http://wiki.dbpedia.org/Datasets).You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads35. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis (all Freie Universität Berlin) for reimplementing the DBpedia extraction framework in Scala, for extending the infobox-to-ontology mappings and for extracting the new DBpedia 3.5 knowledge base. 
  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  1. Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  2. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
  3. OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers

Chris Bizer

OPEN POSITION: Move to Berlin, work on DBpedia (1 year full-time contract)

March 29, 2010 - 1:54 pm by ChrisBizer - No comments »

Hi all, 

the DBpedia Team at Freie Universität Berlin is looking for a developer/researcher who wants to contribute to the further development of the DBpedia information extraction framework, investigate approaches to annotate free-text with DBpedia URIs and participate in the various Linked Data efforts currently advanced by our team. 

Candidates should have

  • good programming skills in Java, in addition Scala and PHP are helpful.
  • a university degree preferably in computer science or information systems.  Previous knowledge of Semantic Web Technologies (RDF, SPARQL, Linked Data) and experience with information extraction and/or named entity recognition techniques are a plus. 

Contract start date: 15 May 2010
Duration: 1 year
Salary: around 40.000 Euro/year (German BAT IIa) 

You will be part of an innovative and cordial team and enjoy flexible work hours. After the year, chances are high that you will be able to choose between longer-term positions at Freie Universität Berlin and at Neofonie.  Please contact chris@bizer.de via email until 15 April 2010 for additional details and include information about your skills and experience into your mail. 

The whole DBpedia team is very thankful to neofonie GmbH for contributing to the development of the DBpedia project by financing this position. neofonie is a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications. 

Cheers, 

Chris  

 
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
chris@bizer.de

Invitation to contribute to DBpedia by improving the infobox mappings + New Scala-based Extraction Framework

March 12, 2010 - 1:07 pm by ChrisBizer - No comments »

Hi all,

in order to extract high quality data from Wikipedia, the DBpedia extraction framework relies on infobox to ontology mappings which define how Wikipedia infobox templates are mapped to classes of the DBpedia ontology.

Up to now, these mappings were defined only by the DBpedia team and as Wikipedia is huge and contains lots of different infobox templates, we were only able to define mappings for a small subset of all Wikipedia infoboxes and also only managed to map a subset of the properties of these infoboxes.

In order to enable the DBpedia user community to contribute to improving the coverage and the quality of the mappings, we have set up a public wiki at http://mappings.dbpedia.org/index.php/Main_Page which contains:

1.  all mappings that are currently used by the DBpedia extraction framework
2. the definition of the DBpedia ontology and
3. documentation for the DBpedia mapping language as well as step-by-step guides on how to extend and refine mappings and the ontology.

So if you are using DBpedia data and you you were always annoyed that DBpedia did not properly cover the infobox template that is most important to you, you are highly invited to extend the mappings and the ontology in the wiki. Your edits will be used for the next DBpedia release expected to be published in the first week of April.

The process of contributing to the ontology and the mappings is as follows:

1.  You familiarize yourself with the DBpedia mapping language by reading the documentation in the wiki.
2.  In order to prevent random SPAM, the wiki is read-only and new editors need to be confirmed by a member of the DBpedia team (currently Anja Jentzsch does the clearing). Therefore, please create an account in the wiki for yourself. After this, Anja will give you editing rights and you can edit the mappings as well as the ontology.
3. For contributing to the next DBpedia relase, you can edit until Sunday, March 21. After this, we will check the mappings and the ontology definition in the Wiki for consistency and then use both for the next DBpedia release.

So, we are starting kind of a social experiment on if the DBpedia user community is willing to contribute to the improvement of DBpedia and on how the DBpedia ontology develops through community contributions J

Please excuse, that it is currently still rather cumbersome to edit the mappings and the ontology. We are currently working on a visual editor for the mappings as well as a validation service, which will check edits to the mappings and test the new mappings against example pages from Wikipedia. We hope that we will be able to deploy these tools in the next two months, but still wanted to release the wiki as early as possible in order to already allow community contributions to the DBpedia 3.5 release.

If you have questions about the wiki and the mapping language, please ask them on the DBpedia mailing list where Anja and Robert will answer them.

What else is happening around DBpedia?

In order to speed up the data extraction process and to lay a solid foundation for the DBpedia Live extraction, we have ported the DBpedia extraction framework from PHP to Scala/Java. The new framework extracts exactly the same types of data from Wikipedia as the old framework, but processes a single page now in 13 milliseconds instead of the 200 milliseconds. In addition, the new framework can extract data from tables within articles and can handle multiple infobox templates per article. The new framework is available under GPL license in the DBpedia SVN and is documented at http://wiki.dbpedia.org/Documentation.

The whole DBpedia team is very thankful to two companies which enabled us to do all this by sponsoring the DBpedia project:

1. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
2.  Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/index.jsp).

Thank you a lot for your support!

I personally would also like to thank:

1.  Anja Jentzsch, Robert Isele, and Christopher Sahnwaldt for all their great work on implementing the new extraction framework and for setting up the mapping wiki.
2.  Andreas Lange and Sidney Bofah for correcting and extending the mappings in the Wiki.

Cheers,

Chris Bizer

Open Knowledge Conference 2010

December 9, 2009 - 12:35 pm by Sören - One comment »

OKCon, now in its fifth year, is the interdisciplinary conference that brings together individuals from across the open knowledge spectrum (such as also DBpedia in particular and Linked Open Data in general) for a day of presentations and workshops.Open knowledge promises significant social and economic benefits in a wide range of areas from governance to science, culture to technology. Opening up access to content and data can radically increase access and reuse, improving transparency, fostering innovation and increasing societal welfare.

In addition to high profile initiatives such as Wikipedia, OpenStreetMap and the Human Genome Project, there is enormous growth among open knowledge projects and communities at all levels. Moreover, in the last year, many governments across the world have begun opening up their data.

And it doesn’t stop there. In academia, open access to both publications and data has been gathering momentum, and similar calls to open up learning materials have been heard in education. Furthermore, this gathering flood of open data and content is the creator and driver of massive technological change. How can we make this data available, how can we connect it together, how can we use it collaborate and share our work?

  • where: London, UK
  • when: Saturday 24th April, 2010
  • www: http://www.okfn.org/okcon/
  • cfp: http://www.okfn.org/okcon/cfp/ (deadline: Jan 31st 2010)
  • hashtag: #okcon2010

DBpedia in ReadWriteWeb’s Top 10 Semantic Web Products of 2009

December 3, 2009 - 6:53 pm by Sören - No comments »

The new year is slowly approaching and people start compiling their top x lists of 2009, with x usually ranging between 10 and 365. ;-)

The popular Web technology blog ReadWriteWeb has chosen x with value 10 and picked DBpedia as one of their top Semantic Web products of 2009. Its actually the only non-commercial community project in the list and in good company with products such as Google’s Search Options and Rich Snippets, Apperture and Data.gov. Other picks, which btw. heavily use or link to DBpedia, include OpenCalais, Freebase, BBC Music and Zemanta.

Read the full article at http://www.readwriteweb.com/archives/top_10_semantic_web_products_of_2009.php

German government proclaims Faceted Wikipedia/DBpedia Search one of the 365 most innovative ideas in Germany

November 20, 2009 - 5:17 pm by ChrisBizer - 4 comments »

The German federal government has proclaimed Faceted Wikipedia Search as one of the 365 most innovative ideas in Germany in the context of the Deutschland – Land der Ideen competition. The competition showcases innovative ideas in areas such as science and technology, business, education, art and ecology. The patron of the competition is the German President Horst Köhler.

Faceted Wikipedia/DBpedia Search allows users to ask complex queries, like “Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?” against Wikipedia. The answers to these queries are not generated using key word matching as the answers of search engines like Google or Yahoo, but are generated based on structured information that has been extracted from many different Wikipedia articles. Faceted Wikipedia/DBpedia Search allows users to query Wikipedia like a structured database and thus enables them to truly exploit Wikipedia’s collective intelligence.

Faceted Wikipedia/Dbpedia Search can be tested online at http://dbpedia.neofonie.de/browse/

Please click on the example queries below to see Faceted Wikipedia Search in action:

Faceted Wikipedia/DBpedia Search has been jointly developed by neofonie GmbH, Berlin and the Web-based Systems Group at Freie Universität Berlin. Technically, Faceted Wikipedia/DBpedia Search is based on the DBpedia data extraction framework and neofonie search technology.

 

The DBpedia data extraction framework extracts structured data from Wikipedia, such as the content of infoboxes which summarize relevant facts as a table on the top right-hand side of Wikipedia articles. The extracted data is represented using the Resource Description Framework, a data model for web-based systems. Currently, the framework extracts around 190 million facts from the English editon of Wikipedia and 289 million facts from Wikipedia editions in 90 further languages. The DBpedia data extraction framework is developed by the Web-based Systems group at Freie Universität Berlin and the Agile Knowledge Engineering and Semantic Web group at Universität Leizpig.

 

The neofonie search engine, neofonie search, is employed to execute complex queries over the extracted data. neofonie search aggregates RDF data from DBpedia with full-text data from Wikipedia. The aggregated data is then divided into hierarchical facets, composed of 200 types with 2.9 million values. In addition to providing the search technology and processing power, neofonie is also responsible for the hosting of the Faceted Wikipedia/DBpedia Search on the Amazon Elastic Compute Cloud (Amazon EC2).

As DBpedia covers a wide range of domains and has a high degree of conceptual overlap with various other open-license datasets, an increasing number of data publishers have started to set data-level links from their data sources to DBpedia, making DBpedia one of the cristalization points of the emerging Web of Linked Data. In the future, the links between databases will allow applications like Faceted Wikipedia Search to answer queries based not only on Wikipedia knowledge but based on knowledge from a world wide web of databases.

Faceted Wikipedia Search will be presented as part of the Land der Ideen series on April 12th, 2010 at neofonie, Berlin.

Additional information about the Land der Ideen competition, DBpedia, neofonie and the Web of Data is found at:

DBpedia 3.4 released

November 11, 2009 - 1:52 pm by ChrisBizer - One comment »

We are happy to announce the release of DBpedia 3.4. The new release is based on Wikipedia dumps dating from September 2009.

The new DBpedia data set describes more than 2.9 million things, including 282,000 persons, 339,000 places, 88,000 music albums, 44,000 films, 15,000 video games, 119,000 organizations, 130,000 species and 4400 diseases. The DBpedia data set now features labels and abstracts for these things in 91 different languages; 807,000 links to images and 3,840,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories. The data set consists of 479 million pieces of information (RDF triples) out of which 190 million were extracted from the English edition of Wikipedia and 289 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.3 release:

  1. the data set has been extracted from more recent Wikipedia dumps.
  2. the data set now provides labels, abstracts and infobox data in 91 different languages.
  3. we provide two different version of the DBpedia Infobox Ontology (loose and strict) in order to meet different application requirements. Please refer to http://wiki.dbpedia.org/Datasets#h18-11 for details.
  4. as Wikipedia has moved to dual-licensing, we also dual-license DBpedia. The DBpedia 3.4 data set is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License.
  5. the mapping-based infobox data extractor has been improved and now normalizes units of measurement.
  6. various bug fixes and improvements throughout the code base. Please refer to the change log for the complete list http://wiki.dbpedia.org/Changelog

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads34. As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql.

Lots of thanks to

  • Anja Jentzsch, Christopher Sahnwaldt, Robert Isele, and Paul Kreis (all Freie Universität Berlin) for improving the DBpedia extraction framework and for extracting the new data set.
  • Jens Lehmann and Sören Auer (Universität Leipzig) for providing new data set via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  • Neofonie GmbH, Berlin, (http://www.neofonie.de/index.jsp) for supporting the DBpedia project by paying Christopher Sahnwaldt.

The next steps for the DBpedia project will be to

  1. synchronize Wikipedia and DBpedia by deploying the DBpedia live extraction which updates the DBpedia knowledge base immediately when a Wikipedia article changes.
  2. enable the DBpedia user community to edit and maintain the DBpedia ontology and the infobox mappings that are used by the extraction framework in a public Wiki.
  3. increase the quality of the extracted data by improving and fine-tuning the extraction code.

All this will hopefully happen soon.

Have fun with the new data set!

Cheers

Chris Bizer

DBpedia Faceted Browser and DBpedia User Script released

September 22, 2009 - 3:04 pm by ChristianBecker - One comment »

We are pleased to announce the release of the DBpedia Faceted Browser by Jona Christopher Sahnwaldt as well as the DBpedia User Script by Anja Jentzsch.

The DBpedia Faceted Browser allows you to explore Wikipedia via a faceted browsing interface. It supports keyword queries and offers relevant facets to narrow down search results, based on the DBpedia Ontology. In this manner, queries such as “recent films about Buenos Aires” can be easily and intuitively posed against DBpedia.
The DBpedia Faceted browser was developed in cooperation with the search engine company Neofonie, which also kindly provided the funding for this project.

The DBpedia User Script is a Greasemonkey script that enhances Wikipedia pages with a link to their corresponding DBpedia page and can be used within Firefox, Safari and Opera with a suitable Greasemonkey plugin.

Links: