Knowledge Graph of Game-of-Thrones

EpiK Protocol
10 min readSep 14, 2024

--

This project aims to construct a knowledge graph of characters, families, and castles from the novel A Song of Ice and Fire. We will first scrape relevant triples from Wikipedia to create an initial knowledge graph. Due to certain errors and inconsistencies in the text on the web pages, we will perform some data cleaning. Additionally, the relationships between entities on the web pages are incomplete, so we will use reasoning to fill in the parts of the graph that can be completed using rules. Subsequently, we will learn a relation extraction model from the original text of the novel and use this model to extract entity relationships that were not present in the original graph. Finally, we will display the entire knowledge graph via a web interface.

The project consists of three main parts:

  1. Data scraping
  2. Data preprocessing and relation extraction
  3. Relation logic reasoning and knowledge graph visualization

The following sections describe each part of the work:

Data Acquisition

  • The code in triple_crawler.py is responsible for data scraping and cleaning.
  • The code in entity_linking.py is responsible for identifying entities in the novel text.
  • The code in tail_matching.py handles the entity reduction process.

File Descriptions

  • asoiaf.ttl: A cleaned triple file, where all triples can be considered correct.
  • candidate_entity_replacement_list_v6.jsonl: Triples along with identified replacement entities for the tail entities; recognition may not be accurate.
  • candidate_entity_replacement_list_v5.jsonl: Triples along with identified replacement entities for the tail entities, where multiple candidate entities are recorded. For example, for the triple e: Joffrey Baratheon, r: Heir, Tommen I, it records multiple entities that could replace Tommen I: [‘e: Joffrey Velaryon’, ‘e: Tommen Lannister I’, ‘e: Tommen Tully’, ‘e: Tommen Baratheon’, ‘e: Jeyne Baratheon’]. During matching, any of these entities can be used to replace Tommen I. This approach was taken primarily because a good method for filtering the correct entities was not found. The number of candidate entities for each triple varies, with a maximum of 14.
  • asoiaf_after_inference.ttl: The knowledge graph after inference completion; specific methods can be found in the inference section of the knowledge graph.
  • graphs_json/: A folder containing files related to graph visualization; specifics can be found in the knowledge graph visualization section.

Triple Scraping

Triples are scraped from the Chinese Wikipedia page for A Song of Ice and Fire.
First, we scrape the character list, castle list, and family list from the following categories:

  • Category: Characters: Scrape the character list
  • Category: Noble Families: Scrape the family list
  • Category: Castles: Scrape the castle list

From these lists, we can obtain the URL for each character, family, or castle page. These form the basic entities, and we scrape related relationships from their pages, mainly extracting relationship data from the information boxes on the pages.

For example, for the character Tyrion Lannister, the information box on the webpage is as follows:

From this information box, we can extract the following triples:

  • e: Tyrion Lannister r: title “Former Hand of the King”.
  • e: Tyrion Lannister r: title “Former Master of Coin”.
  • e: Tyrion Lannister r: house e: House Lannister.
  • e: Tyrion Lannister r: faction “The Second Sons”.
  • e: Tyrion Lannister r: religion “Faith of the Seven”.
  • e: Tyrion Lannister r: alias “The Imp”.
  • e: Tyrion Lannister r: alias “Halfman”.
  • e: Tyrion Lannister r: alias “Boyman”.
  • e: Tyrion Lannister r: alias “Yolo”.
  • e: Tyrion Lannister r: alias “Hugo Hightower”.
  • e: Tyrion Lannister r: birth “273 AC, born in Casterly Rock”.
  • e: Tyrion Lannister r: spouse “First, Tessa”.
  • e: Tyrion Lannister r: spouse “Second, Sansa Stark”.
  • e: Tyrion Lannister r: friend e: Podrick.
  • e: Tyrion Lannister r: conflict e: Cersei Lannister.
  • e: Tyrion Lannister r: romance e: Shae.

Additionally, to facilitate matching in the text later, we add four relationships for each entity: name, given name, surname, and type. The tail entity for type can be character, castle, or house, corresponding to characters, castles, or families, respectively. For the above entity, we additionally add four triples representing these relationships:

  • e: Tyrion Lannister r:name “Tyrion Lannister”.
  • e: Tyrion Lannister r:type “character”.
  • e: Tyrion Lannister r:given name “Tyrion”.
  • e: Tyrion Lannister r:surname “Lannister”.

Triple Cleaning Process

  1. Character Conversion: Convert Traditional Chinese characters to Simplified Chinese.
  2. Relationship Replacement:
  • r: Queen → r: Spouse
  • r: Heir → r: Inheritor
  • r: Husband → r: Spouse
  1. Error Correction:
  • Correct specific triples related to entities and titles.
  1. Removal of Unwanted Information:
  • Eliminate references to original sources, seasons, episode numbers, and purely English triples.
  1. Triple Formation:
  • Compile the cleaned triples into asoiaf.ttl.

Statistical Information

  • Total Entities: 2,870 (2,260 characters, 439 families, 171 castles).
  • Total Triples: 19,103.
  • Unmatched Entities: 7,399 tail entities recorded as strings.

Literal Entity Reduction

  • Replace unmatched literal entities with existing entities to improve matching in sentences.
    For example, replace “詹姆·兰尼斯特(事实)” with e: 詹姆·兰尼斯特.

Replacement Methodology

  1. Calculate Similarity:
  • For head entity s and tail entity o, compute similarity scores with existing entities.
  1. Similarity Calculation:
  • Use Levenshtein distance and longest common substring for similarity metrics.
  1. Candidate Selection:
  • Identify candidates with similarity scores greater than 0.4; if none, use a threshold of 0.1.
  1. Co-occurrence Count:

Count occurrences of candidates in the text.

  1. Feature Extraction:
  • Each candidate has three features: similarity with o, similarity with s, and co-occurrence count.
  1. Skyline Point Identification:
  • Filter candidates to determine skyline points based on their features.
  1. Lower Bound Selection:
  • Sort skyline entities by their attributes and find a single replacement entity.
  1. Record Replacements:
  • Document selected replacement entities in candidate_entity_replacement_list_v6.jsonl.
  1. Skyline Entity Documentation:
  • Record the skyline entities in candidate_entity_replacement_list_v5.jsonl for flexible matching during relationship extraction.

Data Preprocessing and Relation Extraction

Based on the data obtained from web crawling, which includes character entities, family entities, location entities, and the absolutely correct triple relationships among these related entities, we perform data preprocessing in this section to convert the acquired data into training, testing, and validation sets that can be provided to the model for learning. We will also apply the deep neural network models in the models folder, specifically bert.py and bertEntity.py, for training, validating, and predicting the relation extraction on the corresponding datasets.

Converting Novel Text into Trainable Samples

  1. Utilize the process_crawled_triples.py file in the root directory to perform further deep processing on the candidate entities and true entities stored in preprocessed_data/candidate_entity_replacement_list_v5.jsonl and asoiaf.ttl, which have been initially extracted and cleaned. This generates candidate entities and true entities that need special annotation during text segmentation (output to preprocessed_data/literal_vocabulary), as well as preprocessed_data/ent2id.json, enabling different representations of an entity to be mapped to a unified ID.
  2. Use the preprocessor.py file in the root directory to parse the novel text stored in raw_data/novel_text.xhtml, and employ Hanlp’s NLPTokenizer for tokenization. During the tokenization process, ensure that the candidate entities and true entities obtained in the previous step are included in the dictionary so that they can be annotated when generating samples. The parsed file will be output to preprocessed_data/preprocessed_data.jsonl, containing a total of 68,064 sample sentences, with each line representing a sample formatted as follows:

{ “text”: “A paragraph from the novel (calculated as one paragraph based on the <p></p> tags in xhtml)”,

“hanlp_tokens”: [list of Hanlp tokens],

“hanlp_pos”: [part-of-speech tags like nr, true_entity, etc.],

“meta”: {

“source”: “source filename”,

“chapter”: “chapter name”,

“id”: sequential allocation

}}

  1. Utilize the tag_entity.py file in the root directory to annotate entities and triple relationships in the parsed file and generate samples of entity pairs. It is important to note that we initially attempted to remove stop words from sentences but later found that most current Chinese relation extraction methods do not handle stop words, and removing them might affect semantic coherence. Therefore, the final operation did not filter out stop words. We arranged and combined all identified words of types nr (name recognition in Hanlp), true_entity, and candidate_entity in each sentence, ensuring that each sample contains only one pair of entities. Following the idea of distant supervision, if this entity pair exists in the relation triples captured by our web crawler, we will label the sample with the corresponding triple relationship. Based on this approach, we obtained a total of 92,592 entity pair samples, of which 840 were labeled as having triple relationships (corresponding to 25 types of relationships, including their reverse relationships, like father (e1, e2) and father (e2, e1)). The remaining 91,752 did not find matching triple relationships in the web data (this part is what we hope to extract new triple relationships from). Finally, we generated training, validation, and testing sets in a 7:1:2 ratio, stored in:
  • datasets/GOT/corpus_train.jsonl
  • datasets/GOT/corpus_dev.jsonl
  • datasets/GOT/corpus_eval.jsonl

Each line in the files represents a sample, and the format for an entity pair sample is as follows:

{ “text”: “A paragraph from the novel (calculated as one paragraph based on the <p></p> tags in xhtml)”,

“hanlp_tokens”: [list of Hanlp tokens],

“hanlp_pos”: [part-of-speech tags like nr, true_entity, etc.],

“meta”: {

“source”: “source filename”,

“chapter”: “chapter name”,

“id”: sequential allocation

},

“token”: [filtered Hanlp tokens],

“filtered_pos”: [filtered part-of-speech tags],

“ents”: [

[“entity name text”, “entity label (true_entity/candidate_entity/nr*)”, [index in token list]],

[“entity name text”, “entity label (true_entity/candidate_entity/nr*)”, [index in token list]]

],

“label”: “relationship label id”,

“h”: {

“name”: “head entity text”,

“pos”: [head entity start position, head entity end position],

“tag”: “head entity label (true_entity/candidate_entity/nr*)”

},

“t”: {

“name”: “tail entity text”,

“pos”: [tail entity start position, tail entity end position],

“tag”: “tail entity label (true_entity/candidate_entity/nr*)”

}}

Model Training and New Relation Extraction

Considering that the number of sentence samples where the relation triples obtained from web crawling can match in the novel text is very limited, we decided to first fine-tune a pre-trained BERT model using the dataset from the CCKS character relation extraction task (https://biendata.com/competition/ccks_2019_ipre/) where the data volume is more sufficient. This will enable the model to acquire relation extraction capabilities. Subsequently, we will test the fine-tuned BERT model on the text from A Game of Thrones to see what relevant character relationships it can extract (the types and names of the extracted relationships will be determined by the training dataset located in the datasets/CCKS_IPRE folder).

The command to train the model is:

python notransfer_main.py

You can keep the default parameters.

Using the trained model, we perform the discovery of character relationships defined by the CCKS-IPRE task in the novel text, ultimately generating 8,163 new relation triples, which are stored in the snapshot/source_True_pred_notNoneCCKS_triples.jsonl file. Additionally, considering that the nr tokens recognized by Hanlp may not necessarily correspond to actual character entities, we specifically focused on the true entities and their related variants (candidate entities) that we crawled. Among the 8,163 new relation triples, there are 3,003 entries that exclusively occur between true entities or candidate entities (this subset is stored in source_True_pred_notNoneCCKS_triples_notNR.jsonl).

Knowledge Graph Inference and Completion

Development and Runtime Environment

  • Operating System: Windows
  • Java Version: 12.0.2
  • Jena Version: 3.13.1

Model Creation

  1. Creating the Model:
  • Use Jena’s ModelFactory tool to create a default model.
  1. Loading Data:
  • Use Jena’s FileManager tool to automatically read the TTL files into the created model.

Model Creation Details

Refer to the accompanying diagram for detailed steps.

Inference Implementation

  1. Sibling Relationship Inference:
  • Consider all 51 relationships in the triples.
  • For the relationship (siblings), the inference rule is as follows:
  • If <A, siblings, B> and <B, siblings, C>, then infer <A, siblings, C>.
  • Process all 19,103 triples in asoiaf.ttl by counting entities with the sibling relationship.
  • If two entities do not previously have a sibling relationship, create a new triple with the sibling relationship.
  • Result: 1,034 new triples with the relationship “siblings”.
  1. Spouse Relationship Inference:
  • For the relationships (mother), (father), and (spouse), use the following inference rule:
  • If <A, mother, B> and <A, father, C>, then infer <B,spouse, C>.
  • Process all 19,103 triples in asoiaf.ttl to count entities with mother and father relationships.
  • If B and C do not have a spouse relationship, create a new triple with the spouse relationship.
  • Result: 4 new triples with the relationship “spouse”.
  1. Storing Inferred Triples:
  • The completed triples after inference are stored in asoiaf_after_inference.ttl.

Knowledge Graph Visualization

The entity_clustering.py file is responsible for clustering entities in the knowledge graph and calculating PageRank. Here’s how the process works:

Steps for Visualization

  1. PageRank Calculation:
  • Compute the importance of each entity in the graph using the PageRank algorithm. This assigns a score to each node based on its importance within the network.
  1. Spectral Clustering:
  • Perform spectral clustering on the entity network to categorize the entities into different groups.
  1. Visualization with JavaScript:
  • Use JavaScript to display the knowledge graph in a web format.
  • Graph Representation:
  • Circles represent entities.
  • The size of the circle indicates the importance level based on the PageRank score.
  • The color of the circle represents the category assigned by spectral clustering.

Accessing the Visualizations

Family Entities Graph

  • Navigate to the graphs_json/house_graph folder.
  • Run the command:

python JS_graph.py

  • Open your browser and visit:
  • Hover over the circles to view the entity names.

Castle Entities Graph

  • Navigate to the graphs_json/castle_graph folder.
  • Run the command:

python JS_graph.py

  • Hover over the circles to view the entity names.

All Entities Graph

  • Navigate to the graphs_json/all_graph folder.
  • Run the command:

python JS_graph.py

  • Open your browser and visit:
  • Hover over the circles to view the entity names.

Insights from the Visualization

  • The graphs reveal that there are no direct links between different castles. Typically, castles do not have direct relationships and are connected indirectly through families or characters.

--

--