Monthly Archives: October 2019

GlobalFactSync and WikiDataCon2019

We will be spending the next three days in Berlin at WikidataCon 2019, the conference for open data enthusiasts. From October 24th till 26th we will be presenting the latest developments and first results of our work in the GlobalFactSyncRE-Project. 

Short Project Intro

Funded by the Wikimedia Foundation, the project started in June 2019 and has two goals:

  • Answer the following questions:
    • How is data edited in Wikipedia and Wikidata?
    • Where does it come from?
    • How can we synchronize it globally?
  • Build an information system to synchronize facts between all Wikipedia language-editions, Wikidata, DBpedia and eventually multiple external sources, while also providing respective references. 

In order to help Wikipedians to maintain their infoboxes, check for factual correctness, and also improve data in Wikidata, we use data from Wikipedia infoboxes of different languages, Wikidata, and DBpedia and fuse them into our PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper.

Can’t join the conference or want to find out more about GlobalFactSync?

No problem, the poster we are presenting at the conference is currently available here and will soon be available here. Additionally, why not go through our project timeline, follow up on our progress so far and find out what’s coming up next.

In case you have specific questions regarding GlobalfactSync or even some helpful feedback just ping us via dbpedia@infai.org. We also have our new DBpedia Forum, home to the DBpedia Comunity, which just waits for you to initialize a discussion around GlobalFactSync. Why not start it now?

For general DBpedia news and updates follow us on Twitter.

…And if you are in Berlin at WikiDataCon2019 stop by our poster and talk to our developers. They are looking forward to vital exchanges with you.

All the best

yours,


DBpedia Association


One Billion derived Knowledge Graphs

… by and for Consumers until 2025

One Billion – what a mission! We are proud to announce that the DBpedia Databus website at https://databus.dbpedia.org and the SPARQL API at https://databus.dbpedia.org/(repo/sparql|yasgui) (docu) are in public beta now!

The system is usable (eat-your-own-dog-food tested) following a “working software over comprehensive documentation” approach. Due to its many components (website, SPARQL endpoints, keycloak, mods, upload client, download client, and data debugging), we estimate approximately six months in beta to fix bugs, implement all features and improve the details.

But, let’s start from the beginning

The DBpedia Databus is a platform to capture invested effort by data consumers who needed better data quality (fitness for use) in order to use the data and give improvements back to the data source and other consumers. DBpedia Databus enables anybody to build an automated DBpedia-style extraction, mapping and testing for any data they need. Databus incorporates features from DNS, Git, RSS, online forums and Maven to harness the full work power of data consumers. Vision

Our vision

Professional consumers of data worldwide have already built stable cleaning and refinement chains for all available datasets, but their efforts are invisible and not reusable. Deep, cleaned data silos exist beyond the reach of publishers and other consumers trapped locally in pipelines. Data is not oil that flows out of inflexible pipelines. Databus breaks existing pipelines into individual components that together form a decentralized, but centrally coordinated data network. In this set-up, data can flow back to previous components, the original sources, or end up being consumed by external components.

One Billion interconnected, quality-controlled Knowledge Graphs until 2025

The Databus provides a platform for re-publishing these files with very little effort (leaving file traffic as only cost factor) while offering the full benefits of built-in system features such as automated publication, structured querying, automatic ingestion, as well as pluggable automated analysis, data testing via continuous integration, and automated application deployment (software with data). The impact is highly synergistic. Just a few thousand professional consumers and research projects can expose millions of cleaned datasets, which are on par with what has long existed in deep silos and pipelines.

To a data consumer network

As we are inverting the paradigm form a publisher-centric view to a data consumer network, we will open the download valve to enable discovery and access to massive amounts of cleaner data than published by the original source. The main DBpedia Knowledge Graph alone has 600k file downloads per year complemented by downloads at over 20 chapters, e.g. http://es.dbpedia.org as well as over 8 million daily hits on the main Virtuoso endpoint.

Community extension from the alpha phase such as DBkWik, LinkedHypernyms are being loaded onto the bus and consolidated. We expect this number to reach over 100 by the end of the year. Companies and organisations who have previously uploaded their backlinks here will be able to migrate to the databus. Other datasets are cleaned and posted. In two of our research projects LOD-GEOSS and PLASS, we will re-publish open datasets, clean them and create collections, which will result in DBpedia-style knowledge graphs for energy systems and supply-chain management.

A new era for decentralized collaboration on data quality

DBpedia was established around producing a queryable knowledge graph derived from Wikipedia content that’s able to answer questions like “What have Innsbruck and Leipzig in common?” A community and consumer network quickly formed around this highly useful data, resulting in a large, well-structured, open knowledge graph that seeded the Linked Open Data Cloud — which is the largest knowledge graph on earth. The main lesson learned after these 13 years is that current data “copy” or “download” processes are inefficient by a magnitude that can only be grasped from a global perspective. Consumers spend tremendous effort fixing errors on the client-side. If one unparseable line needs 15 minutes to find and fix, we are talking about 104 days of work for 10,000 downloads. Providers – on the other hand – will never have the resources to fix the last error as cost increases exponentially (20/80 rule). 

One billion knowledge graphs in mind – the progress so far

Discarding faulty data often means that a substitute source has to be found, which is hours of research and might lead to similar problems. From the dozens of DBpedia Community meetings we held we can summarize that for each clean-up procedure, data transformation, linkset or schema mapping that a consumer creates client-side, dozens of consumers have invested the same effort client-side before him and none of it reaches the source or other consumers with the same problem. Holding the community meetings just showed us the tip of the iceberg. 

As a foundation, we implemented a mappings wiki that allowed consumers to improve data quality centrally. A next advancement was the creation of the SHACL standard by our former CTO and board member Dimitris Kontokostas. SHACL allows consumers to specify repeatable tests on graph structures and datatypes, which is an effective way to systematically assess data quality. We established the DBpedia Databus as a central platform to better capture decentrally created, client-side value by consumers.

It is an open system, therefore value that is captured flows right back to everybody.  

The full document “DBpedia’s Databus and strategic initiative to facilitate “One Billion derived Knowledge Graphs by and for Consumers” until 2025 is available here.  

If you have any feedback or questions, please use the DBpedia Forum, the “report issues” button, or dbpedia@infai.org.

Yours,

DBpedia Association