DBpedia Member Features – In the last few weeks, we gave DBpedia members the chance to present special products, tools and applications and share them with the community. We already published several posts in which DBpedia members provided unique insights. This week we will continue with Diffbot. They will present the Diffbot Knowledge Graph and various extraction tools. Have fun while reading!
Diffbot’s mission to “structure the world’s knowledge” began with Automatic Extraction APIs meant to pull structured data from most pages on the public web by leveraging machine learning rather than hand-crafted rules.
More recently, Diffbot has emerged as one of only three Western entities to crawl a vast majority of the web, utilizing our Automatic Extraction APIs to make the world’s largest commercially-available Knowledge Graph.
A Knowledge Graph At The Scale Of The Web
The Diffbot Knowledge Graph is automatically constructed by crawling and extracting data from over 60 billion web pages. It currently represents over 10 billion entities and 1 trillion facts about People, Organizations, Products, Articles, Events, among others.
Users can access the Knowledge Graph programmatically through an API. Other ways to access the Knowledge Graph include a visual query interface and a range of integrations (e.g., Excel, Google Sheets, Tableau).
Whether you’re consuming Diffbot KG data in a visual “low code” way or programmatically, we’ve continually added features to our powerful query language (Diffbot Query Language, or DQL) to allow users to “query the web like a database.”
Guilt-Free Public Web Data
Current use cases for Diffbot’s Knowledge Graph and web data extraction products run the gamut and include data enrichment; lead enrichment; market intelligence; global news monitoring; large-scale product data extraction for ecommerce and supply chain; sentiment analysis of articles, discussions, and products; and data for machine learning. For all of the billions of facts in Diffbot’s KG, data provenance is preserved with the original source (a public URL) of each fact.
Entities, Relationships, and Sentiment From Private Text Corpora
The team of researchers at Diffbot has been developing new natural language processing techniques for years to improve their extraction and KG products. In October 2020, Diffbot made this technology commercially-available to all via the Natural Language API.
Our Natural Language API pulls out entities, relationships/facts, categories and sentiment from free-form texts. This allows organizations to turn unstructured texts into structured knowledge graphs.
Diffbot and DBpedia
In addition to extracting data from web pages, Diffbot’s Knowledge Graph compiles public web data from many structured sources. One important source of knowledge is DBpedia. Diffbot also contributes to DBpedia by providing access to our extraction and KG services and collaborating with researchers in the DBpedia community. For a recent collaboration between DBpedia and Diffbot, be sure to check out the Diffbot track in DBpedia’s Autumn Hackathon for 2020.
A big thank you to Diffbot, especially Filipe Mesquita for presenting their innovative Knowledge Graph.
DBpedia Member Features – In the last few weeks, we gave DBpedia members the chance to present special products, tools and applications and share them with the community. We already published several posts in which DBpedia members provided unique insights. This week we will continue with FinScience. They will present their latest products, solutions and challenges. Have fun while reading!
A brief presentation of who we are
FinScience is an Italian data-driven fintech company founded in 2017 in Milan by Google’s former senior managers and Alternative Data experts, who have combined their digital and financial expertise. FinScience, thus, originates from this merger of the world of Finance and the world of Data Science. The company leverages founders’ experiences concerning Data Governance, Data Modeling and Data Platforms solutions. These are further enriched through the tech role in the European Consortium SSIX (Horizon 2020 program) focused on the building of a Social Sentiment for financial purposes. FinScience applies proprietary AI-based technologies to combine financial data/insights with alternative data in order to generate new investment ideas, ESG scores and non-conventional lists of companies that can be included in investment products by financial operators.
The FinScience’s data analysis pipeline is strongly grounded on the DBpedia ontology: the greatest value, according to our experience, is given by the possibility to connect knowledge in different languages, to query automatically-extracted structured information and to have rather frequently updated models.
Products and solutions
FinScience daily retrieves content from the web. About 1.5 million web pages are visited every day on about 35.000 different domains. The content of these pages is extracted, interpreted and analysed via Natural Language Processing techniques to identify valuable information and sources. Thanks to the structured information based on the DBpedia ontology, we can apply our proprietary AI algorithms to suggest to our customers the right investment opportunities.Our products are mainly based on the integration of this purely digital data – we call it “alternative data”- with traditional sources coming from the world of finance and sustainability. We describe these products briefly:
FinScience Platform for traders: it leverages the power of machine learning to help traders monitor specific companies, spot new trends in the financial market, give access to an high added-value selection of companies and themes.
ESG scoring: we provide an assessment of corporate ESG performance, by combining internal data (traditional, self-disclosed data) with external ‘alternative’ data (stakeholder-generated data) in order to measure the gap between what the companies communicate and what is stakeholder perception related to corporate sustainability commitments.
Thematic selections of listed companies : we create Trend-Driven selections oriented towards innovative themes: our data, together with the analysis of financial specialists, contribute to the selection of a set of listed companies related to trending themes such as the Green New Deal, the 5G technology or new medtech applications.
FinScience and DBpedia
As mentioned before, FinScience is strongly grounded in the DBpedia ontology, since we employ Spotlight to perform Named Entity Recognition (NER), namely automatic annotation of entities in a text. The NER task is performed with a two step procedure. The first step consists in annotating the named entity of a text using DBpedia Spotlight. In particular, Spotlight links a mention in the text (that is identified by its name and its context within the text) to the DBpedia entity that maximizes the joint probability of occurrence of both. The model is pre-trained on texts extracted from Wikipedia. Note that each entity is represented by a link to a DBpedia page (see, e.g. http://dbpedia.org/page/Eni ), a DBpedia type indicating the type of the entity according to this ontology and other information.
Another interesting feature of this approach is that we have a one to one mapping of the italian and english entities (and in general any language supported by DBpedia), allowing us to have a unified representation of an entity in the two languages. We are able to obtain this kind of information by exploiting the potential of DBpedia Virtuoso, which allows us to access DBpedia dataset via SPARQL. By identifying the entities mentioned in the online content, we can understand which topics are mentioned and thus identify companies and trends that are spreading in the digital ecosystem as well as analyzing how they are related to each other.
Challenges and next steps
One of the toughest challenges for us is to find an optimal way to update the models used by DBpedia Spotlight. Every day new entities and concepts arise and we are willing to recognise them in the news we analyze. And that is not all. In addition to recognizing new concepts, we need to be able to track an entity through all the updated versions of the model. In this way, we will not only be able to identify entities, but we will also have evidence of when some concepts were first generated. And we will know how they have changed over time, regardless of the names that have been used to identify them.
We are strongly involved in the DBpedia community and we try to contribute with our know-how. Particularly FinScience will contribute on infrastructure and Dockerfiles as well as on finding issues on the new released project (for instance, wikistats-extractor).
A big thank you to FinSciene for presenting their products, challenges and contribution to DBpedia.
DBpediaMember Features – In the coming weeks, we will give DBpedia members the chance to present special products, tools and applications and share them with the community. We will publish several posts in which DBpedia members provide unique insights. This week the Semantic Web Company will present use cases for the PoolParty Semantic Suite. Have fun while reading!
by the Semantic Web Company
About 80 to 90 percent of the information companies generate is extremely diverse and unstructured — stored in text files, e-mails or similar documents, what makes it difficult to search and analyze. Knowledge graphs have become a well-known solution to this problem because they make it possible to extract information from text and link it to other data sources, whether structured or not. However, building a knowledge graph at enterprise scale can be challenging and time-consuming.
PoolParty Semantic Suite is the most complete and secure semantic platform on the global market. It is also the ideal tool to help companies build and manage Enterprise Knowledge Graphs. With PoolParty in place, you will have no problems extracting value from large amounts of heterogeneous data, no matter if it’s stored in a relational database or in text files. The platform provides comprehensive tools for the management of enterprise knowledge graphs along the entire life cycle. Here is a list of the main use cases for the PoolParty Semantic Suite:
Data linking and enrichment
Driven by the Linked Data initiative, increasing amounts of viable data sets about various topics have been published on the Semantic Web. PoolParty allows users to use these online resources, amongst them DBPedia, to easily and quickly enrich a thesaurus with more data.
Search and recommender engines
Arrive at enriched and in-depth search results that provide relevant facts and contextualized answers to your specific questions, rather than a broad search result with many (ir)relevant documents and messages – but no valuable input. PoolParty Semantic Suite can be used to implement semantic search and recommendations that are relevant to your users.
Text Mining and Auto Tagging
Manually tagging an entire database is very time-consuming and often leads to inconsistent search results. PoolParty’s graph-based text mining can improve this process making it faster, consistent and precise. This is achieved by using advanced text mining algorithms and Natural Language Processing to automatically extract relevant entities, terms and other metadata from text and documents, helping drive in-depth text analytics.
Data Integration and Data Fabric
The Semantic Data Fabric is a new solution to data silos that combines the best-of-breed technologies, data catalogs and knowledge graphs, based on Semantic AI. With a semantic data fabric, companies can combine text and documents (unstructured) with data residing in relational databases and data warehouses (structured) to create a comprehensive view of their customers, employees, products, and other vital areas of business.
Taxonomies, Ontologies and Knowledge Graphs That Scale
With release 8.0 of the PoolParty Semantic Suite, users have even more options to conveniently generate, edit, and use knowledge graphs. In addition, the powerful and performant GraphDB by Ontotext has been added as PoolParty’s recommended embedded store and it is shipped as an add-on module. GraphDB is an enterprise-level graph database with state-of-the-art performance, scalability and security. This provides greater robustness to PoolParty and allows you to work with much larger taxonomies effectively.
A big thank you to the Semantic Web Company presenting use cases for the PoolParty Semantic Suite.
DBpedia Member Features – In the coming weeks, we will give DBpedia members the chance to present special products, tools and applications and share them with the community. We will publish several posts in which DBpedia members provide unique insights. This week TerminusDB will show you how to use TerminusDB’s unique collaborative features to access DBpedia data. Have fun while reading!
by Luke Feeney from TerminusDB
This post introduces TerminusDB as a member of the DBpedia Association – proudly supporting the important work of DBpedia. It will also show you how to use TerminusDB’s unique collaborative features to access DBpedia data.
TerminusDB – an Open Source Knowledge Graph
TerminusDB is an open-source knowledge graph database that provides reliable, private & efficient revision control & collaboration. If you want to collaborate with colleagues or build data-intensive applications, nothing will make you more productive.
TerminusDB provides the full suite of revision control features and TerminusHub allows users to manage access to databases and collaboratively work on shared resources.
data storage, sharing, and versioning capabilities
for your team or integrated in your app
locally then sync when you push your changes
querying, cleaning, and visualization
powerful version control and collaboration for your enterprise and
The TerminusDB project originated in Trinity College Dublin in Ireland in 2015. From its earliest origins, TerminusDB worked with DBpedia through the ALIGNED project, which was a research project funded by Horizon 2020 that focused on building quality-centric software for data management.
While working on this project and especially our work building the architecture behind Seshat: The Global History Databank, we needed a solution that could enable collaboration among a highly distributed team on a shared database whose primary function was the curation of high-quality datasets with a very rich structure. While the scale of data was not particularly large, the complexity was extremely high. Unfortunately, the linked-data and RDF toolchains was severely lacking – we evaluated several tools in an attempt to architect a solution; however, in the end we were forced to build an end-to-end ourselves.
Evolution of TerminusDB
In general, we think that computers are fantastic things because they allow you to leverage much more evidence when making decisions than would otherwise be possible. It is possible to write computer programs that automate the ingestion and analysis of unimaginably large quantities of data.
If the data is well chosen, it is almost always the case that computational analysis reveals new and surprising insights simply because it incorporates more evidence than could possibly be captured by a human brain. And because the universe is chaotic and there are combinatorial explosions of possibilities all over the place, evidence is always better than intuition when seeking insight.
As anybody who has grappled with computers and large quantities of data will know, it’s not as simple as that. Computers should be able to do most of this for us. It makes no sense that we are still writing the same simple and tedious data validation and transformation programs over and over ad infinitum. There must be a better way.
This is the problem that we set out to solve with TerminusDB. We identified two indispensable characteristics that were lacking in data management tools:
rich and universally machine-interpretable modelling language. If we
want computers to be able to transform data between different
representations automatically, they need to be able to describe
their data models to one another.
revision control. Revision control technologies have been
instrumental in turning software production from a craft to an
engineering discipline because they make collaboration and
coordination between large groups much more fault tolerant. The need
for such capabilities is obvious when dealing with data – where the
existence of multiple versions of the same underlying dataset is
almost ubiquitous and with only the most primitive tool support.
TerminusDB and DBpedia
Team TerminusDB took part in the DBpedia Autumn Hackathon 2020. As you know, DBpedia is an extract of the structured data from Wikipedia.
You can read all about our DBpedia Autumn Hackathon adventures in this blog post.
Unlike many systems in the graph database world, TerminusDB is committed to open source. We believe in the principals of open source, open data and open science. We welcome all those data people that want to contribute to the general good of the world. This is very much in alignment with the DBpedia Association and community.
DBpedia on TerminusHub
is the collaborative point between TerminusDBs. You can push data to
you colleagues and collaborators, you can pull updates (efficiently –
just the diffs) and you can clone databases that are made available
on the Hub (by the TerminusDB team or by others). Think of it as
GitHub, but for data.
The DBpedia database is available on TerminusHub. You can clone the full DB in a couple of minutes (depending on your internet connection of course) and get querying. TerminusDB uses succinct data structures to compress everything so it makes sharing large database feasible – more technical detail here: https://github.com/terminusdb/terminusdb/blob/dev/docs/whitepaper/terminusdb.pdf for interested parties.
TerminusDB in the DBpedia Association
We will contribute to DBpedia by working to improve the quality of data available, by introducing new datasets that can be integrated with DBpedia, and by participating fully in the community.
We are looking forward to a bright future together.
A big thank you to Luke and TerminusDB presenting how TerminusDB works and how they would like to work with DBpedia in the future.
DBpedia Member Features – In the coming weeks, we will give DBpedia members the chance to present special products, tools and applications and share them with the community. We will publish several posts in which DBpedia members provide unique insights. This week GNOSS will give an overview of their products and business focus. Have fun while reading!
by Irene Martínez and Susana López from GNOSS
GNOSS (https://www.gnoss.com/) is a Spanish technology manufacturing company that has developed its own platform for the construction and exploitation of knowledge graphs. GNOSS technology operates within the framework of the set of technologies that concur in the Artificial Intelligence Program semantically interpreted: NLU (Natural Language Understanding); identification, extraction, disambiguation and linking of entities; as well as the construction of interrogation and knowledge discovery systems based on inferences and on systems that emulate the forms of natural reasoning.
How is our business focus
The GNOSS project is positioned in the emerging market for Deep AI (Deep Understanding AI). By Deep AI we mean the convergence of symbolic AI and sub-symbolic AI.
GNOSS is the leading company in Spain in the construction of solutions aimed at the construction of knowledge ecosystems interpretable and queryable (interrogable) by machines and people, which integrate heterogeneous and distributed data represented by technical vocabularies and ontologies written in programming languages (OWL-RDF ) interpretable by machines, which are consolidated and exploited through knowledge graphs
The technology developed by GNOSS facilitates the construction, within the framework of the aforementioned ecosystems, of intelligent interrogation and search systems, information enrichment and context generation systems, advanced recommendation systems, predictive Business Intelligence systems based on dynamic visualizations and NLP/NLU systems.
GNOSS works in the cloud and is offered as a service. We have a complex and robust technological infrastructure designed to compute intelligent data in a framework that offers the maximum guarantee of security and best practices in technology services.
Products and Solutions
GNOSS Knowledge Graph Builder is a development platform upon which third parties can deploy their web projects, with a complete suite of components to build Knowledge Graphs and deploy an intelligent web semantically aware in record time. The platform enables the interrogation of a Knowledge Graph by both machines and people. The main modules of the platform are 1) Metadata and Knowledge Graph Construction and Management; 2)Discovery, reasoning and analysis through Knowledge Graphs; 3) Semantic Content Management. It also includes some configurable characteristics and functions for fast, agile and flexible adaptation and evolution of intelligent digital ecosystems
Thanks to GNOSS Knowledge Graph Builder and GNOSS Sherlock Services, we have developed a suite of transversal solutions and some sectorial solutions based on the creation and exploitation of Knowledge Graphs.
The transversal solutions are: GNOSS Metadata Management Solution (for the integration of heterogeneous and distributed information into semantic data layer consolidating information into a knowledge graph), GNOSS Sherlock NLP-NLU Service (Intelligent software services for machines to understand us, based on natural language processing and on entity recognition and linking; and dynamic graphic visualizations), GNOSS Search Cloud (which includes intelligent information search, interrogation and retrieval systems; inferences; recommendations and generation of significant contexts), GNOSS Semantic BI&Analytics (expressive and dynamic Business Intelligence based on Smart Data).
We have developed sectorial solutions in Education and University, Tourism, Culture and Museums, Healthcare, Communication and MK, Banking, Public Administration; Catalogs and support to supply chain.
What significance does DBpedia for us
We think that the foundations for the construction of the great European Project of Symbolic AI are being created thanks to DBpedia and other Linked Open Data projects, by turning the internet into a Universal Knowledge Base, which works according to the principles and standards of Linked Open Data and Semantic Web. This knowledge base, as the brain of the internet, would be the basis of the IA of the future. In this context, we consider that DBpedia plays a central role as an open general knowledge base and, therefore, as the core of the European Project of Symbolic AI.
Currently, some projects developed with GNOSS platform are already using DBpedia to access a large amount of structured and ontologically-represented information, in order to link entities, enrich information and offer contextual information. Two examples of this are the ‘Augmented Reading’ of Museo del Prado in the descriptions of the artworks of the Museum Prado, and the Graph of related entities in Didactalia.net.
The ‘Augmented Reading’ of Museo del Prado in the descriptions of the artworks of the Museum (see for instance ‘The Family of Carlos IV’, by Francisco de Goya) recognizes and extracts the entities contained in them, thereby providing additional and contextual information about them, so that anyone who can read them without giving up understanding them in depth.
In Didactalia.net, for a given educational resource, its Graph of related entities works as a conceptual map of the resource to support the teacher and the student in the teaching-learning process (see for instance this resource about Descartes).
How do we envision our future work with DBpedia
GNOSS can contribute to DBpedia at different levels, from making suggestions for further development to participating in strategy and commercialization.
We could collaborate with DBpedia contributing to tests of the releases of DBpedia and giving our feedback of the use of DBpedia in projects applied to public and private organizations developed with GNOSS. Based on this, we could make suggestions for future work considering our experience and customer needs in this context.
We could participate in the strategy and commercialization, in order to gain more presence in sectors in which we work, such as healthcare, education, culture or communication, and to achieve that the private companies can appreciate and benefit from the great value that DBpedia can offer them.
A big thank you to GNOSS for presenting their product and envisioning how they would like to work with DBpedia in the future.
The following post introduces to you ImageSnippets and how this tool profits from the use of DBpedia.
ImageSnippets – A Tool for Image Curation
For over two decades,ImageSnippetshas been evolving as an ontology and data-driven framework for image annotation research. Representing the informal knowledge people have about the context and provenance of images as RDF/linked data is challenging, but it has also been an enlightening and engaging journey in not only applying formal semantic web theory to building image graphs but also to weave together our interests with what others have been doing in the field of semantic annotation and knowledge graph building over these many years.
DBpedia provides the entities for our RDF descriptions
Since the beginning, we have always made use of DBpedia and other publicly available datasets to provide the entities for use in our RDF descriptions. Though ImageSnippets can be used to build special vocabularies around niche domains, our primary research is around relation ontology building and we prefer to avoid the creation of new entities unless we absolutely can not find them through any other service.
When we first went live with our basic system in 2013, we began hand-building tens of thousands of triples using terms primarily from DBpedia (the core of the linked data cloud.) While there would often be an overlap of terms with other datasets – almost a case of too many choices – we formed a best practice of preferentially using DBpedia terms as often as possible, because they gave us the most utility for reasoning using the SKOS concepts built into the DBpedia service. We have also made extensive use of DBpedia Spotlight for named-entity extraction.
How to combine DBpedia & Wikidata and make it useful for ImageSnippets
But the addition of the Wikidata Query Service over the past 18 months or so has now given us an even more unique challenge: how to work with both! Since DBpedia and Wikidata both have class relationships that we can reason from, we found ourselves in a position to be able to examine both DBpedia and Wikidata in concert with each other through the use of mapping techniques between the two datasets.
How it works: ImageSnippets & DBpedia
When an image is saved, we build inference graphs over results from both DBpedia and Wikidata. These graphs can be revealed with simple SPARQL queries at our endpoint and queries from subclasses, taxons and SKOS concepts can find image results in our custom search tool. We have also just recently added a pathfinder utility – highly useful for semantic explainability as it will return the precise path of connections from an originating source entity to the target entity that was used in our custom image search.
Sometimes a query will produce very unintuitive results, and the pathfinder tool enables us to quickly locate semantic errors which lead to clearly erroneous misclassifications (for example, a search for the Wikidata subclass of ‘communication medium’ reveals images of restaurants and hotels because of misclassifications in Wikidata.) In this way we can quickly troubleshoot the results of queries, using the images as visual cues to explore the accuracy of the semantic modelling in both datasets.
We are very excited with the new directions that we feel can come of our knitting together of the two knowledge graphs through the use of our visual interface and believe there is a great potential for ImageSnippets to serve a more complex role in cleaning and aligning the two datasets, using the images as our guides.
A big thank you to Margaret Warren for providing some insights into her work at ImageSnippets.
During the DBpedia Day in Leipzig, I gave a talk about how to use the facts contained in the DBpedia Knowledge Graph for generating coherent sentences and texts.
We essentially rely on Natural Language Generation (NLG) techniques for accomplishing this task. NLG is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001).
Why not generate text from Knowledge graphs?
The generation of natural language from the Semantic Web has been already introduced some years ago (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014; Staykova, 2014). However, it has gained recently substantial attention and some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). Still, English is the only language which has been widely targeted. Thus, we proposed RDF2NL which can generate texts in other languages than English by relying on different language versions of SimpleNLG.
What is RDF2NL?
While the exciting avenue of using deep learning techniques in NLG approaches (Gatt and Krahmer, 2017) is open to this task and deep learning has already shown promising results for RDF data (Sleimi and Gardent, 2016), the morphological richness of some languages led us to develop a rule-based approach. This was to ensure that we could identify the challenges imposed by each language from the SW perspective before applying Machine Learning (ML) algorithms. RDF2NL is able to generate either a single sentence or a summary of a given resource. RDF2NL is based on Ngonga Ngomo et.al LD2NL and it also uses the Brazilian, Spanish, French, German and Italian adaptations of SimpleNLG to the realization task.
An example of RDF2NL application:
We envisioned a promising application by using RDF2PT which aims to support the automatic creation of benchmarking datasets to Named Entity Recognition (NER) and Entity Linking (EL) tasks. In Brazilian Portuguese, there is a lack of gold standards datasets for these tasks, which makes the investigation of these problems difficult for the scientific community. Our aim was to create Brazilian Portuguese silver standard datasets which are able to be uploaded into GERBIL for easy evaluation. To this end, we implemented RDF2PT ( Portuguese version of RDF2NL) in BENGAL , which is an approach for automatically generating NER benchmarks based on RDF triples and Knowledge Graphs. This application has already resulted in promising datasets which we have used to investigate the capability of multilingual entity linking systems for recognizing and disambiguating entities in Brazilian Portuguese texts. Some results you can find below: NER – http://gerbil.aksw.org/gerbil/experiment?id=201801050043 NED – http://gerbil.aksw.org/gerbil/experiment?id=201801110012
More application scenarios
Summarize or Explain KBs to non-experts
Create news automatically (automated journalism)
Summarize medical records
Generate technical manuals
Support the training of other NLP tasks
Generate product descriptions (Ebay)
Deep Learning into RDF2NL
After devising our rule-based approach, we realized that RD2NL is really good by selecting adequate content from the RDF triples, but the fluency of its generated texts remains a challenge. Therefore, we decided to move forward and work with neural network models to improve the fluency of texts as they have already shown promising results in the generation of translations. Thus, we focused on the generation of referring expressions, which is an essential part while generating texts, it basically decides how the NLG model will present the information about a given entity. For example, the referring expressions of the entity Barack Obama can be “the former president of USA”, “Obama”, “Barack”, “He” and so on. Afterward, we have been working on combining different NLG sub-tasks into single neural models for improving the fluency of our texts.
GSoC on it – Stay tuned!
Apart from trying to improve the fluency of our models, we relied previously on different language versions of SimpleNLG to the realization task. Nowadays, we have been investigating the generation of multiple languages by using a unique neural model. Our student has been working hard to provide nice results and we are basically at the end of our GSoC project. So stay tuned to know the outcome of this exciting project.
Many thanks to Diego for his contribution. If you want to write a guest post, share your results on the DBpedia Blog, and thus give your work more visibility and outreach, just ping us via firstname.lastname@example.org.
With timbr, WPSemantix and the DBpedia Association launch the first SQL Semantic Knowledge Graph that integrates Wikipedia and Wikidata Knowledge into SQL engines.
In part three of DBpedia’s growth hack blog series, we feature timbr, the latest development at DBpedia in collaboration with WPSemantix. Read on to find out how it works.
timbr – DBpedia SQL Semantic Knowledge Platform
Tel Aviv, Israel and Leipzig, Germany – July 18, 2019 – WP-Semantix (WPS) – the “SQL Knowledge Graph Company™” and DBpedia Association – Institut für Angewandte Informatik e.V., announced today the launch of the timbr-DBpedia SQL Semantic Knowledge Platform, a unique version of WPS’ timbr SQL Semantic Knowledge Graph that integrates timbr-DBpedia ontology, timbr’s ontology explorer/visualizer and timbr’s SQL query service, to provide for the first time semantic access to DBpedia knowledge in SQL and to thus facilitate DBpedia knowledge integration into standard data warehouses and data lakes.
DBpedia is the crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects and publish these as files on the Databus and via online databases. This structured information resembles an open knowledge graph which has been available for everyone on the Web for over a decade. Knowledge graphs are a new kind of databases developed to store knowledge in a machine-readable form, organized as connected, relationship-rich data. After the publication of DBpedia (in parallel to Freebase) 12 years ago, knowledge graphs have become very successful and Google uses a similar approach to create the knowledge cards displayed in search results.
Query the world’s knowledge in standard SQL
Amit Weitzner, founder and CEO at WPS commented: “Knowledge graphs use specialized languages, require resource-intensive, dedicated infrastructure and require costly ETL operations. That is, they did until timbr came along. timbr employs SQL – the most widely known database language, to eliminate the technological barriers to entry for using knowledge graphs and to implement Semantic Web principles to provide knowledge graph functionality in SQL. timbr enables modelling of data as connected, context-enriched concepts with inference and graph traversal capabilities while being queryable in standard SQL, to represent knowledge in data warehouses and data lakes. timbr-DBpedia is our first vertical application and we are very excited by the prospects of our cooperation with the DBpedia team to enable the largest user base to query the world’s knowledge in standard SQL.”
Sebastian Hellmann, executive director of the DBpedia Association commented that:
timbr will help to explore the power of semantic technologies
Prof. James Hendler, pioneer and a world-leading authority in Semantic Web technologies and WPS’ advisory board member commented “timbr can be a game-changing solution by enabling the semantic inference capabilities needed in many modelling applications to be done in SQL. This approach will enable many users to get the advantages of semantic AI technologies and data integration without the learning curve of many current systems. By giving more people access to the semantic version of Wikipedia, timbr-DBpedia will definitely contribute to allowing the majority of the market to explore the power of semantic technologies.”
timbr-DBpedia is available as a query service or licensed for use as SaaS or on-premises. See the DBpedia website: wiki.dbpedia.org/timbr.
WP-Semantix Ltd. (wpsemantix.com) is the developer of the timbr SQL semantic knowledge platform, a dynamic abstraction layer over relational and non-relational data, facilitating declaration and powerful exploration of semantically rich ontologies using a standard SQL query interface. timbr is natively accessible in Apache Spark, Python, R and SQL to empower data scientists to perform complex analytics and generate sophisticated ML algorithms. Its JDBC interface provides seamless integration with the most popular business intelligence solutions to make complex analytics accessible to analysts and domain experts across the organization.
WP-Semantix, timbr, “SQL Knowledge Graph”, “SQL Semantic Knowledge Graph” and associated marks and trademarks are registered trademarks of WP Semantix Ltd.
DBpedia is looking forward to this cooperation. Follow us on Twitter for the latest information and stay tuned for part four of our growth hack series. The next post features the GlobalFactSyncRe. Curious? You have to be a little more patient and wait till Thursday, July 25th.
Before getting into the technical details of, did you know the term Chaudron derives from Old French and denotes a large metal cooking pot? The word was used as an alternative form of chawdron which means entrails. Entrails and cauldron – a combo that seems quite fitting with Halloween coming along.
And now for something completely different
To begin with, Chaudron is a dataset of more than two million triples. It complements DBpedia with physical measures. The triples are automatically extracted from Wikipedia infoboxes using a pattern-matching and a formal grammar approaches. This dataset adds triples to the existing DBpedia resources. Additionally, it includes measures on various resources such as chemical elements, railway, people places, aircrafts, dams and many other types of resources.
Guest article by Victor de Boer, Vrije Universiteit Amsterdam, NL, member of NL-DBpedia
Who uses DBpedia anyway?…
This question started a research project for Frank Walraven, an Information Sciences Master student at Vrije Universiteit Amsterdam (VUA). The question came up during one of the meetings of the Dutch DBpedia chapter, of which VUA is a member.
If DBpedia users and their usage are better understood, this can lead to better servicing of those Dbpedia users by, for example, prioritizing the enrichment or improvement of specific sections of DBpedia. Characterizing use(r)s of a Linked Open Dataset is an inherently challenging task because in an open web world it is difficult to tell who is accessing your digital resources.
Frank conducted his MSc project research at the Dutch National Library and used a hybrid approach utilizing both, a data-driven method based on user log analysis and a short survey to get to know the users of the dataset.
As a scope, Frank selected just the Dutch DBpedia dataset. For the data-driven part of the method, Frank used a complete user log of HTTP requests on the Dutch DBpedia. This log file consisted of over 4.5 Million entries and logged both URI lookups and SPARQL endpoint requests. For this research, he only included a subset of the URI lookups.
Analysis of IP- Addresses od DBpedia Users
As a first analysis step, the requests’ origins IPs were categorized. Five classes can be identified (A-E), with the vast majority of IP addresses being in class “A”: Very large networks and bots. Most of the IP addresses in these lists could be traced back to search engine indexing bots such as those from Yahoo or Google. In classes B-F, Frank manually traced the top 30 most encountered IP-addresses. He concluded that even there 60% of the requests came from bots, 10% definitely not from bots, with 30% remaining unclear.
Step II – Identification of Page Requests
The second analysis step in the data-driven method consisted of identifying what types of pages were most requested. To cluster the thousands of DBpedia URI request, Frank retrieved the ‘categories’ of the pages. These categories are extracted from Wikipedia category links. An example is the “Android_TV” resource, which has two categories: “Google” and “Android_(operating_system)”. Following skos:broader links, a ‘level 2 category’ could also be found to aggregate to an even higher level of abstraction. As not all resources have such categories, this does not give a complete image, but it does provide some ideas on the most popular categories of items requested. After normalizing for categories with large amounts of incoming links, for example, the category “non-endangered animal”, the most popular categories where
1. Domestic & International movies,
4. Dutch & International municipality information and
Additionally, Frank set up a user survey to corroborate this evidence. The survey contained questions about the how and why of the respondents use of the Dutch DBpedia, including the categories they were most interested in.
The survey was distributed using the Dutch DBpedia website and via Twitter. However, the endeavour only attracted 5 respondents. This illustrates the difficulty of the problem that users of the DBpedia resource are not necessarily easily reachable through communication channels. The five respondents were all quite closely related to the chapter but the results were interesting nonetheless. Most of the DBpedia users used the DBpedia SPARQL endpoint. The full results of the survey can be found through Frank’s thesis, but in terms of corroboration, the survey revealed that four out of the five categories found in the data-driven method were also identified in the top five results from the survey. The fifth one identified in the survey was ‘geography’, which could be matched to the fifth from the data-driven method.
Frank’s research shows that it remains a challenging problem, using a combination of data-driven and user-driven method. Yet, it is indeed possible to get an indication into the most-used categories on DBpedia. Within the Dutch DBpedia Chapter, we are currently considering follow-up research questions based on Frank’s research. For further information about the work of the Dutch DBpedia chapter, please visit their website.
A big thanks to the Dutch DBpedia Chapter for supervising this research and providing insights via this post.