Knowledge Graph of Figures in Modern Chinese History

EpiK Protocol
4 min readDec 12, 2024

--

1. Introduction

Modern Chinese history has seen the emergence of numerous outstanding figures and events, with abundant historical materials forming a vast knowledge system. This project aims to collect and mine information about historical figures in modern China, constructing a knowledge graph of individuals and their related entities to assist in the retrieval and study of historical data.

2. Knowledge Graph Construction

2.1 Data Sources

Information on figures from modern Chinese history primarily comes from Baidu Baike and Lishi Ji websites. Using Python’s Scrapy framework, we obtained structured data, semi-structured data, and textual data for nearly 1,300 individuals. The structured data mainly includes names, courtesy names, birthplaces, and birth and death dates; semi-structured data includes relationships and historical achievements; textual data consists of biographies and comments, which may not be entirely accurate due to editorial variances.

2.2 Data Processing

2.2.1 Semi-structured Data Processing

  • Data Transformation: Extracting information from semi-structured data and converting it into a uniform format.

For example:Authored “xxx” -> Literary Work: “xxx”

  • Data Cleaning: Removing obviously incorrect information.

For example:Date of Birth: 1766. This clearly exceeds the scope of modern history and may indicate an error in the website’s linked information.

  • Data Integration: Merging data from the two sources, deleting any inconsistencies.

2.2.2 Text Data Processing

The project explored both semantic role labeling and deep learning-based entity-relation extraction methods.

2.2.2.1 Entity-Relation Extraction Based on LTP Semantic Role Labeling

LTP (Language Technology Platform) is an open-source Chinese natural language processing tool developed by Harbin Institute of Technology. It allows users to perform tasks such as word segmentation, part-of-speech tagging, and syntactic parsing.

Using LTP’s semantic role labeling to analyze sentences:

from ltp import LTP

ltp = LTP()

seg, hidden = ltp.seg([“王俊昌于1943年2月加入中国共产党。”])

srl = ltp.srl(hidden, keep_empty=False)

The above sentence is decomposed into its central verb (加入), subject (王俊昌), object (中国共产党), and temporal adverbial (1943年2月). By creating rules based on semantic role labeling, relationships conforming to these rules can be accurately extracted, though the rule construction relies on manual effort.

2.2.2.2 Entity-Relation Extraction Based on OpenUE

OpenUE is a lightweight toolkit for knowledge graph extraction based on pre-trained language models.

Using the OpenUE toolkit and the default ske dataset for training and extraction, the accuracy is relatively high for simple sentences but less ideal across all textual data due to the complexity of sentences and contextual dependencies, such as missing subjects.

2.2.2.3 Entity-Relation Extraction Based on OpenNRE

OpenNRE is an open-source and extensible toolkit that provides a unified framework for implementing relation extraction models. The project attempted to use OpenNRE for Chinese relation extraction.

Following the instructions from the project’s GitHub, models were trained and relation extraction tasks executed. Results were good for simple sentence structures but showed omissions and errors in more complex sentence patterns. Thus, for accuracy and historical correctness, the project ultimately opted for the semantic role labeling-based extraction method.

3. Knowledge Graph Storage

The project uses the Neo4j graph database to store entity-relation data. There are three main categories of entities: individuals, organizations (schools), and achievements (works). The individual entities include attributes such as name, additional names, birthplace, birth date, death date, occupation, ethnicity, and nationality (for foreign individuals in China).

There are three major categories of entity relations: related individuals, alumni, and creations. Related individuals can be subdivided into seven subcategories and 21 specific relations, as follows:

  • Related Individuals-Family: [‘Father’, ‘Son’, ‘Daughter’, ‘Parent’]
  • Related Individuals-Grandparents: [‘Grandson’, ‘Granddaughter’, ‘Grandfather’, ‘Grandmother’]
  • Related Individuals-Siblings: [‘Brother’, ‘Sister’, ‘Younger Brother’, ‘Older Sister’]
  • Related Individuals-Marriage: [‘Husband’, ‘Wife’]
  • Related Individuals-In-Laws: [‘Son-in-law’, ‘Daughter-in-law’]
  • Related Individuals-Teacher-Student: [‘Student’, ‘Teacher’]
  • Related Individuals-Others: [‘Comrade’, ‘Classmate’, ‘Friend’]

4. Knowledge Graph Applications

The project’s final results are deployed in the cloud using a BS architecture. The backend is packaged into a Docker image and deployed on Alibaba Cloud ECI, while the frontend is hosted on Alibaba Cloud CDN. You can visit http://www.zjuwtx.work/project/kg for access.

4.1 Individual Retrieval

Basic functionality for retrieving individuals, viewing their attributes and relationships with other entities.

4.2 Graph Inference

Rule-based graph inference is implemented via custom Cypher scripts, including relation inference and attribute completion.

4.3 Knowledge Crowdsourcing

Considering the limited data sources and the inevitable issues in data content and processing, which can lead to missing or erroneous knowledge, the project provides a crowdsourcing feature. Users can quickly submit requests to add or modify data, which will be merged into the existing knowledge graph upon approval.

--

--

EpiK Protocol
EpiK Protocol

Written by EpiK Protocol

The World’s First Decentralized Protocol for AI Data Construction, Storage and Sharing. https://www.epik-protocol.io/ | https://twitter.com/EpikProtocol

No responses yet