EpiK knowledge graph [Lord-of-the-Rings]

EpiK Protocol
6 min readMay 21, 2024

The project aims to construct a knowledge graph of the characters, families, and races in the novel “The Lord of the Rings.” We will start by extracting relevant triples from Wikipedia to form an initial knowledge graph.

Web Data Extraction
The code in extract.ipynb is responsible for extracting entities from web pages.
The code in extract_relations.ipynb is responsible for extracting triplets corresponding to the entities.

File Descriptions

  • clean_triplets/character_relation.txt: A file containing triplets obtained through extraction.
  • clean_triplets/character_relation_reduced.txt: A file containing triplets that have been preliminarily cleaned using code.
  • clean_triplets/character_relation_reduced_manual.txt: A file containing triplets that have been further cleaned manually.

Tuple Extraction

The triplets were extracted from the wiki page for “The Lord of the Rings”.
First, a list of characters was obtained from the web page, and the website has already provided a consolidated character list: “The Lord of the Rings Character List.”
Using this list, the URL for each character page was obtained. Starting with each character as the initial entity, the relevant relationships were extracted from each character’s page. The main source of relationship data extraction was the information sidebar on the page. Taking the character “Gandalf” as an example, the information sidebar on his web page is as follows:

We can extract the following triplets from the text:

e:Thranduil r:belongs to race e:Elf

e:Thranduil r:belongs to family e:House of Finarfin

e:Thranduil r:father character e:Finwë

e:Thranduil r:mother character e:Anairë

e:Thranduil r:sibling character e:Finrod

e:Thranduil r:sibling character e:Arathorn

e:Thranduil r:sibling character e:Alarion

e:Thranduil r:spouse character e:Elenwe

e:Thranduil r:child character e:Legolas

e:Thranduil r:friendly character e:Thranduil

e:Thranduil r:enemy character e:Morgoth

Additionally, we also captured the aliases and titles of the characters for the purpose of data cleaning later on.

Triplet Cleaning

  • Script Cleaning: Use the web_data_retrieval/clean_triplets/preprocess_relation_txt.py file to perform preliminary cleaning on the extracted triplets in web_data_retrieval/character_relation.txt, resulting in web_data_retrieval/clean_triplets/character_relation_reduced.txt. For example:

○Filter out triplets containing words like “unknown” or “none.”

○Replace “-” with “·”, “(page does not exist)” with an empty string, etc.

  • Entity Extraction

Use the web_data_retrieval/clean_triplets/extract_candidate_entities.py file to extract entities from the preliminarily cleaned triplets for further processing.

  • Manual Entity Reduction

Manually remove some invalid entities, such as:

○Those with male descendants.

○Those yet to be born.

○A Northern Gondor woman (in other versions).

  • Manual Entity Replacement

Manually remove some invalid entities, such as:

○Those with male descendants.

○Those yet to be born.

○A Northern Gondor woman (in other versions).

● Manual Entity Replacement

Manually replace some equivalent entities, such as:

○Replace “Kwen (first)” with “Kwen.”

○Replace “Rohirrim” with “Rohirrim ethnicity.”

○Replace “House of Harlindon” with “Harlindon family.”

○Replace “Prince of Dor Amarthos” with “Dor Amarthos family,” etc.

  • Manual Cleaning:

Based on the results of entity replacement and entity reduction, perform further cleaning on the preliminary cleaned triplets in web_data_retrieval/clean_triplets/character_relation_reduced.txt.

The final cleaned triplets are saved in web_data_retrieval/clean_triplets/character_relation_reduced_manual.txt, which includes removing invalid triplets and performing equivalent entity replacements within the triplets.

Deep Relation Extraction

Novel Text Processing

  • Text Filtering and Sentence Segmentation

Use the deep_relation_extraction/process_txt/text2sentences.py file to read the novel text from deep_relation_extraction/process_txt/the_lord_of_the_ring.txt, filter out invalid text (e.g., book titles, chapter names, meaningless separators like “※”), and segment the novel into sentences. The segmented sentences are stored in sentences.npy.

○Text filtering uses regular expressions to remove irrelevant text, avoiding unnecessary storage and computational overhead.

○Sentence segmentation reduces the text size that the relation extraction model needs to process, enabling efficient relation extraction.

● Word Segmentation

Use the deep_relation_extraction/process_txt/segment_words.py file to read the sentences from deep_relation_extraction/process_txt/sentences.npy and perform word segmentation on each sentence. Additionally, identify person names in the sentences. The segmented word results are stored in deep_relation_extraction/process_txt/tokens.json.

Specifically, we use HanLP’s Chinese word segmentation model to tokenize and perform part-of-speech tagging on the sentences. Words marked as “nr” are recognized as person names.

● Training Sample Generation

Use the deep_relation_extraction/process_txt/generate_samples.py file to read the tokenized results from deep_relation_extraction/process_txt/tokens.json and generate training samples based on the tokenized results. The generated samples are written to samples.jsonl. The format of the training samples is as follows:

Other generation details include:

○If the same person name appears multiple times in a sentence, the position of its first occurrence is chosen as the input.

○If a sentence has three or more person names, each pair of two person names is used to generate a training sample. In total, there will be (n * (n-1)) / 2 training samples, where n is the number of person names.

Model Training and Relation Extraction

Due to the limited number of entity relation triplets obtained from web pages, we plan to extract new relation triplets from pure textual novels. Before that, we need to train a relation extraction model to take a piece of text and two entities as input and obtain the relationship between the two entities. Therefore, we intend to use a Chinese BERT-based entity relation multi-classifier to meet the above requirements. The network architecture is shown in the diagram below.

At the same time, we adopt the approach of pre-training followed by fine-tuning to train the complete BERT model. Therefore, we have decided to initially pre-train the network model using the larger-scale CCKS dataset (https://biendata.com/competition/ccks_2019_ipre/). After pre-training, the model acquires a certain level of generalization for relation extraction. Then, we fine-tune the pre-trained BERT model using the obtained Lord of the Rings relation triplets to enable it to extract specific character relationships. Additionally, we modify the BERT Chinese model to be a multi-classifier, with the number of classifiers corresponding to the number of relations.

The training command for the model is as follows:

python train_and_inference.py

Through pre-training and relation extraction, we obtained a total of 4854 valid entity relation triplets.

Knowledge Graph Construction
Using the “construct_kg_file/construct_json.py” file, we can build a JSON file based on the triplets extracted from web pages and text, which can be used for knowledge graph visualization and automated question answering.

Visualization and Question Answering
We referred to the project “Construction of Medical Knowledge Graph and Automated Question Answering” for an overall approach.

Knowledge Graph Visualization
The “build_graph.py” script is responsible for storing the knowledge graph in a database. We are using Neo4j for this purpose.

Automatic Question Answering

The automatic question answering system consists of the following components:

●question_classifier.py: Responsible for classifying the natural language expressions into corresponding question types.

●question_parser.py: Performs question parsing.

●answer_search.py: Responsible for searching and organizing answer sentences.

●automatic_qa.py: Initiates the automatic question answering program.

Supported Question Types:

Question Types

Examples

character_culture

What culture does Arwen belong to?

character_clan

Which clan does Frodo belong to?

character_enemy

Who is Legolas’ enemy?

character_mate

Who is the mate of Gandalf?

character_father

Who is Frodo’s father?

character_mother

Who is Aragorn’s mother?

character_child

Does Arwen have any children?

character_successor

Who is Theoden’s successor?

character_siblings

Does Legolas have any siblings?

Display of Question Results:

Question: What is the race of Radagast Took?
Answer: The race of Radagast Took is “Hobbit”.

Question: Which clan does Lindis belong to?
Answer: Lindis belongs to the clan “Elros”.

Question: Does Turgon have any enemies?
Answer: Turgon has a mortal enemy named “Morgoth”.

Question: Who is the mate of Arwen?
Answer: Arwen’s mate is “Fingon”.

Question: Who is the father of Thal-Panther?
Answer: The father of Thal-Panther is “Al-Kimizor”.

Question: Who is the mother of Rudolf, Borje?
Answer: The mother of Rudolf Borje is “Afrida Borje”.

Question: Who are the children of Thal-Karmaqil?
Answer: The children of Thal-Karmaqil are “Thal-Ardamin, Kimizagar”.

Question: Does Thal-Atanamir have a successor?
Answer: The successor of Thal-Atanamir is “Thal-Ancalimon”.

Question: Who are the siblings of Rolo Bofur?
Answer: The siblings of Rolo Bofur are “Hugo Bofur, Uffo Bofur, Primrose Belt”.

--

--