How does EpiK Perform Knowledge Extraction?

4 min readApr 4, 2024

The construction of a knowledge graph serves as the foundation for subsequent applications, and its prerequisite is the extraction of data from various data sources. For domain-specific knowledge graphs, their data sources mainly come from two channels: one is the data from the business itself, which is usually contained in the company’s internal database tables and stored in a structured manner; the other is publicly available data on the web that is crawled, which is typically in the form of web pages and therefore unstructured data.

The former generally requires simple preprocessing to be used as input for subsequent AI systems. At the same time, the latter usually involves using techniques such as natural language processing to extract structured information. For example, in the search example mentioned above, the relationship between Bill Gates and Melinda Gates can be extracted from unstructured data, such as data sources like Wikipedia.

The challenge of information extraction lies in dealing with unstructured data. In the diagram below, we provide an example. On the left is a piece of unstructured English text, and on the right are the entities and relationships extracted from this text. The construction of a similar knowledge graph involves several natural language processing techniques in the following aspects:

a. Named Entity Recognition (NER): This technique aims to identify and classify named entities in text, such as person names, locations, organizations, etc. It helps identify the entities of interest in unstructured data.

b. Relation Extraction: This technique focuses on extracting relationships between entities mentioned in the text. It involves identifying the subject and object entities and determining the type of relationship between them.

c. Entity Resolution: Entity resolution addresses identifying and linking different mentions of the same entity within the text. It aims to unify references to the same entity, even if they are mentioned using other names or aliases.

d. Coreference Resolution: Coreference resolution deals with determining when different expressions or pronouns in the text refer to the same entity. It helps understand the context and resolve references to entities mentioned earlier in the text.

The specific implementation details of these techniques are beyond the scope of this explanation. Interested readers can refer to relevant resources or explore dedicated courses on natural language processing for more in-depth understanding.

Firstly, there is Named Entity Recognition (NER), which involves extracting entities from the text and classifying/labeling each entity. For example, we can extract the entity “NYC” from the above-mentioned text and label it as “Location”. We can also extract “Virgil’s BBQ” and label it “Restaurant”. This process is known as Named Entity Recognition, and it is a relatively mature technology with existing tools available for this task.

Secondly, we can use relation extraction techniques to extract the relationships between entities from the text. For instance, the relationship between the entities “hotel” and “Hilton property” could be “in,” and the relationship between “hotel” and “Times Square” could be “near”, and so on. Relation extraction allows us to identify and extract the relationships between entities, providing valuable information for constructing a knowledge graph.

In addition, two challenging issues exist in the process of named entity recognition and relation extraction. One is entity resolution, which deals with the problem of different entity mentions referring to the same entity. For example, “NYC” and “New York” are different surface strings but refer to the same city, New York. It is necessary to merge these entities. Entity resolution reduces the variety of entities and helps reduce sparsity in the knowledge graph.

The other issue is coreference resolution, which determines the referent of pronouns such as “it,” “he,” and “she” in the text. For example, in this text, both occurrences of “it” refer to the entity “hotel” that has been labeled. Coreference resolution aims to establish the correct referential links between pronouns and their corresponding entities.

Entity and coreference resolutions are more challenging than the previous two issues.

How does EpiK Perform Knowledge Extraction?

Written by EpiK Protocol