Knowledge Graph of the Novel “Dou Po Cang Qiong”

EpiK Protocol
8 min readOct 2, 2024

--

1. Project Introduction

This project builds a knowledge graph for the novel “Dou Po Cang Qiong.” The data mainly comes from the Baidu Encyclopedia entry for “Dou Po Cang Qiong” and the original text of the novel. The work to construct this graph is divided into the following parts:

  1. Data Acquisition and Preprocessing
  2. Relationship Extraction
  3. Attribute Extraction
  4. Building a Question-Answering System

2. Data Acquisition and Preprocessing

2.1 Data Crawling

In the data crawling section, we crawled information related to characters, plots, and other aspects of the “Dou Po Cang Qiong” novel from Baidu Encyclopedia and Wiki to serve as the training corpus. We used the requests_html library to access web pages and crawled the HTML files, then used the BeautifulSoup library to parse the web pages and extract useful information. The crawling process is as follows:

First, we needed to obtain the names of characters, names of exotic fires, and names of forces mentioned in the novel from the Baidu Encyclopedia entry on “Dou Po Cang Qiong,” and extract the URLs of these entries for later access. After obtaining these URLs, we needed to individually access them to extract information from each entry. The entries in Baidu Encyclopedia typically consist of a brief introduction and the main text, which need to be processed to become structured data. The crawled data is shown in the figure.

2.2 Building the Entity List

Since the entity names in online novels differ significantly from those in reality, and Baidu Encyclopedia already provides comprehensive entity descriptions, we adopted a matching-based entity recognition strategy due to the limited types of entities involved in the graph. First, we needed to extract entity names and types from the crawled data to build the entity list. It is important to note that in “Dou Po Cang Qiong,” the same character may have different titles; for example, “Yao Chen” may also be referred to as “Yao Lao” or “Yao Zun Zhe.” When building the entity list, it is necessary to unify the titles of each character, and the same applies to the preprocessing of the original text later on.

2.3 Data Preprocessing

Based on the constructed entity list, the original text was preprocessed with the following specific steps:

  1. Remove author-related content from the original text.
  2. Standardize the titles of characters in the original text.
  3. Split the text into sentences based on punctuation.
  4. Match sentences with the entity list, retaining those that contain two or more entities.
  5. Remove overly short sentences to construct the original dataset. The constructed original dataset is shown in the figure.

3. Relationship Extraction

3.1 LLM-Based Relationship Extraction

Recently, large language models (LLMs) have shown great potential in text understanding and dialogue-based question answering, being able to perform specified tasks from given texts. Therefore, for the complex content of online novels, we also attempt to use LLMs for relationship extraction. DeepKE has already provided an LLM-based relationship extraction tool that encapsulates a set of prompt templates, allowing us to combine our context, entities, and candidate relationships into a set of question-and-answer instructions, which are then sent to ChatGPT to guide it in extracting entity relationships from the context.

3.1.1 Modifying the Prompt Template

The prompt template is encapsulated in the EasyInstruct tool. During usage, we identified some issues with the current prompt and made several modifications. First, the context from which relationships need to be extracted may not always reflect the relationships between entities. Therefore, if a relationship cannot be extracted, we instruct GPT to output “unknown.” Additionally, providing enhanced instructions can improve the quality of GPT’s responses. Thus, we modified the prompt template as follows:

After the modifications are complete, we need to reinstall the EasyInstruct tool by executing the command line code: pip install -e .

3.1.2 Data Generation

Due to the cost constraints of using GPT, we do not use the original text of the novel for relationship extraction; instead, we utilize descriptive statements crawled from Baidu Encyclopedia. Each character entity in Baidu Encyclopedia has an “introduction” section, so we organize the context in the form of “(entity) is (character introduction)” as follows:

We have an entity list and use a matching method to identify other entity information from the context, forming head and tail entities for relationship extraction. For example, the relationship extraction between “Xiao Xiao” and “Xiao Yan” will occur as shown in the figure. All entities in the context will undergo extraction.

3.1.3 Relationship Extraction

After data generation, we call the EasyInstruct interface to construct the prompt:

The prompt input is the context used for extraction, and we also need to specify the head and tail entities along with their types and a list of candidate relationships. The candidate relationship list is as follows:

(insert list here). We then call the get_openai_result interface to obtain GPT’s response, storing the results in the form of triples. Considering cost and time issues, we choose to use the gpt-3.5-turbo model for relationship extraction.

3.1.4 Triple Cleaning

The triples extracted using LLMs may contain some erroneous triples that need cleaning and processing. Although the temperature for GPT has been set to 0, there are instances where responses do not adhere to the required format, necessitating manual intervention, as shown below:

Additionally, some incorrect relationships or those that need to be merged can be handled manually, ultimately forming a relationship graph based on LLM extraction.

3.2 DeepKE-Based Relationship Extraction

DeepKE is an open-source knowledge graph extraction tool that can achieve named entity recognition, relationship extraction, and attribute extraction functionalities by customizing the input dataset and model. In this work, we mainly apply this tool for entity recognition and relationship extraction from relevant texts. Given the large volume of original text and the high cost of LLM-based extraction, we attempt to fine-tune a pre-trained model to extract additional relationships from the original text.

3.2.1 Dataset Construction

For the named entity recognition task, we build a vocabulary using the previously obtained entity list and annotate the novel text and the descriptive statements crawled from Baidu Encyclopedia through lookup tables, thereby acquiring a large-scale training dataset. For the relationship extraction task, after determining the list of relationships to be used for model training, we construct the training set using the relationships extracted by GPT and those labeled manually. To facilitate model training, we merge some relationships and label the entities in sentences that lack annotated relationships as “irrelevant.” The final list of relationships used is as follows:

Since the number of labels in our annotated dataset is quite imbalanced, we combine manually added sentences with oversampling and undersampling techniques to achieve a relatively balanced dataset for subsequent training.

3.2.2 Entity Recognition

We utilize DeepKE’s named entity recognition model and train it on a BERT-based model using the prepared dataset. The test results are as follows:

After training the entity recognition model, we can use it to extract the names and categories of entities contained in given sentences, serving as a supplementary recognition scheme based on matching to identify entities not included in the matching list.

3.2.3 Relationship Extraction

We use DeepKE’s relationship extraction model, training it on a BERT+LSTM-based model with the dataset filtered by GPT and manually annotated data. The test results are as follows:

We input sentences along with the entities identified from the entity recognition model into the trained relationship extraction model, extracting relationships for each entity pair in a sentence. After obtaining the relationship data extracted by the model, we perform manual data cleaning on this data and add it to the knowledge graph.

4. Attribute Extraction

In the attribute extraction section, we need to extract relevant attributes for characters and异火 (strange fire) such as aliases, gender for characters, and color and ranking for异火. For attribute extraction, we can directly use the structured data from Baidu Encyclopedia’s introductions. During data processing, we need to filter out some abnormal attributes as well as those unrelated to the novel, such as information about TV adaptations of the novel. The final extracted attributes are formatted similarly to a dictionary and written into a JSON file for storage. The content and structure of the attributes are as follows:

5. Visualization and Knowledge Question Answering

5.1 Knowledge Graph Visualization

Using Neo4j Desktop, we visualize the extracted entities, relationships, and attributes from Dou Po Cang Qiong. We import the previously exported dump file into Neo4j and create a database to store this data. Once the data is imported, we can start the project and access the corresponding database for operations:

We can view the entire knowledge graph by executing MATCH(n) RETURN n.

The graph contains numerous nodes and attributes, so we visualize 25 relationships related to the main character, Xiao Yan, through knowledge querying.

5.2 Knowledge Question Answering

Based on Cyther queries, we use the paddle mode of Jieba for word segmentation of the input query sentences. The first extracted noun phrase is treated as the entity, while the second noun phrase is treated as the relationship, using the answer() function to parse the question. This question-answering system supports queries about character relationships:

For example, if the input question is: “Who is Xiao Yan’s brother?” the system first segments the sentence and extracts the nouns “Xiao Yan” and “brother.” It treats “Xiao Yan” as the entity and “brother” as the relationship. Since our relationships do not define “brother,” but rather “siblings,” the actual query relationship is set to “siblings,” with the condition gender = ‘male’. The constructed query statement is then used to search the database, resulting in:

Other query results are displayed as follows:

--

--