Why is Knowledge Graph important? Why does EpiK persist in building a Knowledge Graph?

EpiK Protocol
6 min readFeb 29, 2024

A Knowledge Graph is a particular type of graph structure that contains semantic information about entities and captures their relationships and connections. Knowledge Graphs have gained wide attention and application in industry and academia in recent years. Various domains have developed their Knowledge Graph construction platforms to meet the specific needs of different industries.

Taking Sage Knowledge Base as an example, EpiK demonstrates the importance of Knowledge Graphs and several standard models used in their construction.

Sage Knowledge Base is a Knowledge Graph platform that enables the application of Knowledge Graphs in various domains, such as question-answering systems, recommendation systems, drug discovery, and stock market prediction. This case study aims to introduce a technique for automatically learning representations of Knowledge Graphs, which involves transforming the entities and relationships in a Knowledge Graph into specialized numerical representations. The entire process can be seen as a graph with interconnected elements.

AutoML (Automatic Machine Learning) is a method that helps find the best parameter settings to achieve good results when dealing with complex data and various tasks. AutoML simplifies the modeling process and reduces the need for extensive domain expertise. The following sections will provide a detailed explanation of how to design models for learning Knowledge Graph representations.

1. Triple-based models

There are several types of triple-based models:

Distance-based models use neural networks to learn representations of the triples’ knowledge. For example, some models employ Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (ConvE), or Recursive Neural Networks (RSN) to represent the knowledge within triples.

Another type is bilinear models, which perform well in learning representations based on triple knowledge. Bilinear models can effectively express knowledge while maintaining relatively low model complexity. However, bilinear models may have some issues related to generalization. To address this, Fourth Paradigm proposed a method called AutoSF, which aims to automatically search for the best relationship matrix to achieve unified modeling.

AutoSF and its improved version, AutoSF+, introduce algorithms to enhance search efficiency.

In AutoSF, the search begins by finding the best-performing relationship matrix on the test set, with a constraint that limits the matrix to only four non-zero elements. Then, the number of non-zero elements gradually increases in each iteration, and the best relationship matrix is searched under the corresponding conditions. However, progressive search algorithms may get trapped in local optima. To overcome this, AutoSF+ incorporates a search pattern called a genetic algorithm. Specifically, the matrices undergo mutation and crossover operations in each iteration, and only the better-performing matrices are retained. Eventually, a better relationship matrix can be found.

Furthermore, the process of selecting matrices also takes domain-specific characteristics into account. Fourth Paradigm designed a filter to reduce unnecessary evaluations. Additionally, a predictor is defined to estimate the model’s performance based on the symmetry of the relationship matrix by utilizing symmetry-related features in the matrix and using a two-layer MLP to score the model’s performance.

2.Path-based Models

Path-based models help us better understand the information contained in triplets. We can obtain more information about their relationship if we connect the head entity and the tail entity with a path. These paths preserve the original triplet information, express more complex relationships, and include long-chain information among multiple triplets.

One such model is PTransE, an improvement upon the TransE model. It transforms triplets into paths composed of multiple relationships. Similar to TransE, we can add translation vectors to represent the relationship between the head and tail entities. However, PTransE has some limitations, such as its ability to handle one-to-many and symmetric relationships.

Another model is RSN (Recurrent Skipping Network), which uses recurrent neural networks to model paths. This model has a skipping connection structure among entity nodes, and the corresponding entity and relationship embeddings are outputted. RSN can capture long-term information effectively but needs help understanding the semantic information within triplets.

To address these issues, the Fourth Paradigm proposed the Interstellar model. It treats each triplet as a separate unit and determines the optimal model structure through search. The model can be transformed into a triplet-based approach by breaking the paths between triplets. By removing the tail entity of each triplet in the path (represented by a zero vector), the model can resemble the approach of PTransE. The model can automatically understand semantic information and properties within paths through this method.

The fourth Paradigm used a unique approach when designing the model’s search algorithm. They independently evaluated the overall performance of the model structure and used an efficient method to assess the performance in detail. This ensures the accuracy of the model structure and the efficiency of evaluation.

3.Graph Neural Network-based Models

Several graph neural network-based models exist, such as R-GCN, CompGCN, and KE-GCN. These models take simple embedding representations as input and use graph neural networks to aggregate node information, generating more advanced embedding representations. Finally, they compute the final scores using scoring functions (such as TransE and ConvE). However, these models have some limitations. Firstly, they require loading the entire knowledge graph, making scalability challenging. Additionally, they rely on scoring functions, and the improvement in model performance after applying graph neural networks is limited.

To address these issues, a model called GraIL was proposed in 2020. It extracts subgraphs containing the given head and tail entities from the original knowledge graph. Then, it labels the entities based on their distances to the head and tail entities, uses graph neural networks to propagate and update information within the subgraphs, and finally obtains scores for triplets composed of the head and tail entities. GraIL can perform inductive reasoning without needing pre-trained embeddings, meaning it can score unknown entities. However, GraIL is highly complex in subgraph extraction and label generation.

To improve upon the mentioned models’ limitations, the Fourth Paradigm proposed a RED-GNN model. Firstly, it expands the paths in the graph to the same length using a special relationship called identity. Then, all the paths are stacked to form a relationship subgraph (directed graph), preserving the direction of information propagation.

As the layers in the relationship subgraph have overlapped, RED-GNN can model all the relationship subgraphs with the same head entity using dynamic programming. Traditional graph neural network computation methods require individual calculations for each relationship subgraph, whereas RED-GNN uses recursive and parallel computations to handle multiple relationship subgraphs simultaneously. The information aggregation in the graph neural network is based on the relationship information between entities and employs attention mechanisms to fuse information adaptively.

The figure below shows the results of comparative experiments. RED-GNN is a purely subgraph-based model that does not rely on embedding representations, making it suitable for transfer and inductive reasoning. The experimental results indicate that even without using any embedding representations, RED-GNN performs better than most methods. Due to its fewer model parameters and the adoption of dynamic programming algorithms, RED-GNN exhibits significant improvements in computational efficiency compared to GraIL.

--

--

EpiK Protocol

The World’s First Decentralized Protocol for AI Data Construction, Storage and Sharing. https://www.epik-protocol.io/ | https://twitter.com/EpikProtocol