Creating a Mathematical Knowledge Graph Using Lean 4’s Mathlib Library

9 min readSep 20, 2024

1. Introduction

Lean 4 is a computer-assisted proof system, also known as an interactive theorem prover, used for the formal verification of mathematical theorems or the correctness of computer programs. Mathlib is the mathematical library for Lean 4, supported by an active development team that includes many mathematicians and computer scientists. This system has been applied to cutting-edge mathematical research; for example, a key theorem in condensed mathematics was successfully verified using the Lean system during the liquid tensor experiment initiated by Fields Medalist Peter Scholze. This project builds a knowledge graph based on Lean 4, consisting of the following components:

Data Acquisition
Graph Design
Graph Visualization
Model Training and Inference Based on the Graph

2. Construction and Visualization of the Lean Knowledge Graph

2.1 Data Acquisition and Organization

The mathematical library mathlib4 for the interactive proof system Lean 4 can be obtained from GitHub. After installing Lean 4, mathlib4 can be compiled, and then the doc-gen4 tool can be used to generate documentation in HTML format. The documentation for mathlib4 includes all mathematical objects defined in the library, with the main types being:

def (mathematical definitions)
abbrev (abbreviations, which can be automatically expanded)
structure (similar to data structures in programming languages)
class (in type theory, supports typeclass inference)
instance (for example, the natural number type as an instance of a “monoid”)
inductive (recursive definitions)
axiom (axioms)
lemma (lemmas)
theorem (theorems)

To extract information such as the names of these mathematical objects, their hierarchical relationships, connections, and natural language descriptions, the Python library BeautifulSoup is used to analyze the HTML documents and convert them into JSON format.

Figure 1: JSON Hierarchical Structure of a Theorem

2.2 Graph Design

Based on the obtained JSON data, the tree structure of the JSON data is illustrated in Figure 1. The following definitions are made according to its structural characteristics:

Entity Definitions:

Entity Name: The value corresponding to the name field in each JSON entry

Attribute: The value corresponding to the kind field in the JSON
Node Type: Node
Explanation: Each mathematical theorem is defined as an entity with its name, and its type kind serves as the node’s attribute.

Entity Name: [Node_name]’s description

Attribute: The value corresponding to the description field in the JSON
Node Type: Description
Explanation: A corresponding description node is created based on the description associated with each theorem.

Entity Name: [Node_name]’s attribute

Attribute: The value corresponding to the attribute field in the JSON
Node Type: Attribute
Explanation: A corresponding attribute node is established based on the attributes associated with each theorem.

Entity Name: Names from the Class_fields list

Node Type: Field
Explanation: A corresponding field node is created based on the fields associated with each theorem.

Entity Name: Names from the Classparent list

Node Type: Parent
Explanation: A corresponding parent node is created based on the theorem’s inheritance relationship.

Figure 2: Visualization of the Knowledge Graph

Relationships Definitions:

Relationship Name: “HAS_LINK_TO”
Explanation: Represents the relational connection between definitions and theorems.
Relationship Name: “HAS_ATTRIBUTE”
Explanation: Indicates that a theorem possesses its corresponding attribute.
Relationship Name: “HAS_DESCRIPTION”
Explanation: Indicates that a theorem has its corresponding description. Some theorems may not have a corresponding description.
Relationship Name: “HAS_FIELD”
Explanation: Indicates that a theorem has its corresponding field.
Relationship Name: “EXTENDS_TO”
Explanation: Indicates the parent class of a theorem, showing that the current theorem is inherited from its parent theorem.

Using the definitions above, Python is employed to parse each entry in the JSON data and construct the knowledge graph, which is ultimately saved to a Neo4j graph database to complete the preliminary construction of the graph. Due to the extensive content of the graph, only a portion is selected for visualization, as shown in Figure 2.

2.3 Analysis of the Relationship Structure Characteristics of the Graph

In the preliminary constructed graph, there is a relationship defined as ‘HAS_LINK_TO’. While this relationship indicates a connection between nodes, it does not reflect the direct logical meaning of the mathematical theorems. To address this, the connection relationship is refined by incorporating the kind of the connected nodes to redefine the relationship as ‘HAS_ + [Linked node’s kind]’. However, since the kind of some nodes is unknown, the relationship definition ‘HAS_LINK_TO’ is still retained.

The restructured portion of the graph is illustrated in Figure 3.

Figure 3: Reconstructed Portion of the Knowledge Graph

Figure 4: Example of Partial Triples

3 Experiment Analysis Based on the NeuralKG Platform

3.1 Introduction to NeuralKG

NeuralKG is an open-source library based on Python, designed for diverse learning of knowledge graph representations. Through a unified framework, NeuralKG successfully reproduces the link prediction results of various methods on benchmarks, sparing users from the tedious task of re-implementing these methods, especially those originally written in non-Python programming languages. Additionally, NeuralKG is highly configurable and extensible, offering a variety of decoupled modules that can be mixed and matched. As a result, developers and researchers can quickly implement their own designed models and achieve optimal training methods to efficiently reach the best performance.

3.2 Introduction to Experimental Algorithms

This experiment utilizes three categories of knowledge graph embedding (KGE) algorithms: conventional KGE, GNN-based KGE, and rule-based KGE. The conventional KGE employs TransE, DistMult, and ConvE; the GNN-based KGE uses RGCN and CompGCN; and the rule-based KGE utilizes ComplEx-NNE+AER.

TransE is a pioneering knowledge graph embedding method introduced in 2013, working based on distance metrics. It treats entities in a knowledge graph as points in a vector space and relationships as vectors in that space, such that the vector for the head entity plus the relationship vector should be close to the vector for the tail entity. For each given fact (head entity, relationship, tail entity), TransE attempts to make the sum of the head entity vector and relationship vector close to the tail entity vector, thus learning low-dimensional embedding representations of entities and relationships. TransE is particularly suited for representing one-to-one relationships but may struggle with one-to-many, many-to-one, and many-to-many relationships.
DistMult is a matrix factorization-based knowledge graph embedding model introduced in 2015, aimed at capturing semantic information in knowledge graphs through vector representations of entities and relationships. The model calculates relationship scores using the dot product of entity pairs and relationship vectors, making it particularly suitable for modeling symmetric relationships. The simplicity and efficiency of DistMult have led to its widespread application in link prediction, entity disambiguation, and relation extraction tasks. However, it has limitations in handling asymmetric relationships, as the model assumes relationships are symmetric and uses diagonal matrices to represent relationships, limiting its ability to capture complex interactions. Despite these limitations, DistMult remains popular in knowledge graph research due to its simplicity and efficiency.
ConvE learns low-dimensional representations of entities and relationships using convolutional neural networks. It first processes the embedding representations of entities and relationships (usually initialized as random vectors) through an embedding layer. Then, it rearranges or reshapes the entity and relationship embeddings into a two-dimensional matrix, which is used as input for the convolutional neural network. Through convolution operations, ConvE effectively captures local patterns and correlations between entity and relationship embeddings, aiding the model in better understanding and predicting complex relationships between entities.
RGCN is a neural network architecture specifically designed to handle graph-structured data, capable of addressing the heterogeneity of nodes and edges in a graph. RGCN introduces relationship-specific weights based on graph convolutional networks (GCN) to effectively encode complex entities and relationships in knowledge graphs. This approach allows RGCN to capture diverse types of relationships between entities, showing excellent performance in tasks like link prediction and entity classification.
CompGCN combines graph convolutional networks (GCN) with compositional operations to learn representations of both nodes (entities) and edges (relationships) simultaneously. This way, CompGCN not only learns the embedding representations of entities but also captures interactions and compositional patterns among relationships, enhancing the model’s understanding of knowledge graph structures. Through this innovative approach, CompGCN significantly improves the performance of knowledge graph embeddings, particularly in modeling and predicting complex relationships.
ComplEx-NNE+AER is an advanced knowledge graph embedding model that combines complex embeddings (ComplEx), neural network embeddings (NNE), and automatic expression reconstruction (AER) techniques. This model leverages the properties of complex space to more effectively capture symmetric and antisymmetric relationships between entities while optimizing the embedding space through the automatic expression reconstruction mechanism, enhancing the model’s expressiveness and generalization capability. ComplEx-NNE+AER achieves notable performance improvements in knowledge graph tasks such as link prediction and entity disambiguation through this composite approach.

3.3 Experimental Design

In this experiment, the knowledge graph is divided into training, validation, and test sets in a ratio of 0.7, 0.2, and 0.1, respectively. The parameter initialization for each algorithm uses the default settings provided by NeuralKG. The algorithms are iteratively trained on the training set, and the optimal parameters and corresponding models are selected based on their performance on the validation set. Finally, the score of the model on the test set is taken as the final result.

3.4 Evaluation Metrics

This experiment employs two evaluation metrics:

MRR (Mean Reciprocal Rank): This is a commonly used metric for assessing the performance of ranking tasks such as link prediction, particularly prevalent in knowledge graph embeddings and recommendation systems. MRR calculates the average of the reciprocal ranks of the correct answers returned by the model for a given set of queries. If a model returns a ranked list of correct answers for a set of queries, MRR is the average of these reciprocal ranks. Mathematically, for a query set Q, where the rank of the correct answer for each query qi is denoted as ranki, MRR is defined as follows:

Hit@K: This is another metric used to evaluate model performance, measuring the frequency with which the correct answer is ranked within the top K positions. In simple terms, if the correct answer is included in the top K results returned by the model, the query is considered a “hit.” The value of Hit@K represents the proportion of “hits” across all queries, reflecting the model’s ability to find the correct answer among its top K predictions. Mathematically, for a query set Q, if hiti indicates whether the correct answer for query qi appears in the top K positions (where hiti = 1if it does and hiti = 0 if it does not), then Hit@K is defined as follows:

3.5 Experimental Results

3.6 Experimental Analysis

Based on the experimental results, we can draw the following conclusions:

Performance of KGE Models: Among the conventional KGE models, ConvE performs the best, achieving an MRR of 26.1%, with Hit@1, Hit@3, and Hit@10 reaching 17.0%, 30.1%, and 40.5%, respectively. Compared to DistMult, TransE outperforms in all metrics, particularly showing a significant advantage in Hit@1 and Hit@10, indicating a good balance between precision and recall.
Performance of GNN Models: Among the GNN-based models, CompGCN exhibits the best overall performance, with an MRR of 27.0%, and it surpasses RGCN and all conventional KGE models in Hit@1, Hit@3, and Hit@10. RGCN performs comparably to ConvE, nearly matching it in MRR and Hit@3, demonstrating the strong capability of GNN models in handling graph data.
Performance of Rule-Based Models: The rule-based model ComplEx shows relatively weak performance, with its MRR and Hit metrics lower than those of other models. Particularly, its Hit@1 performance is only 10.5%, significantly trailing behind the other models, indicating limitations in its ability to find the most accurate answers.
Model Selection and Graph Features: Considering the specific features of the graph is crucial when selecting a model. The subdivision of the ‘HAS_LINK_TO’ relationship during graph construction has increased the logical complexity. This structural change may affect the performance of different models, especially GNN models that can capture complex graph structures and diverse relationships. The outstanding performance of CompGCN may stem from its effective modeling of the diversity of entity and relationship types within the graph.
Model Generalization Ability: The unknown kind of some nodes adds difficulty to model learning. In such cases, the model’s generalization ability becomes particularly important. GNN-based models, by considering the local graph structural information of nodes, may find it easier to generalize to unknown or sparse node types.