Harry Potter Character Relationship Knowledge Graph
1. Knowledge Extraction
The Chinese Harry Potter Wiki maintained by fans contains an introduction to various aspects of the Harry Potter world. The knowledge in this project mainly comes from this website. The character introduction pages on the wiki include a tabular overview and detailed text introduction of each character.
1.1 Extracting Knowledge from Semi-structured Data
The wiki character introduction pages contain a large amount of relationship information we need. Therefore, we wrote a crawler program to parse the semi-structured data on the pages (extracting the character’s attribute information and relationship information).
Our crawler is based on the Scrapy framework, which is an application framework written to crawl website data and extract structured data. It consists of components like engine, scheduler, downloader, downloader middleware, and pipelines. The crawler code we implemented is located at ./crawler/spiders/wiki_spider.py. The crawled information is stored in ./harry_potter_property.json after preliminary preprocessing (changing heteromorphic characters, unifying naming of the same concepts) in process_data.py. Here is an example:
“Lily Evans”: {
“Birth”: “January 30, 1960 England”,
“Death”: “October 31, 1981 (aged 21) Godric’s Hollow, England”,
“Blood Status”: “Muggle-born”,
…………}
1.2 Extracting Relationships from Pure Text Data
We also tried to use DeepKE (an open-source Chinese relation extraction tool based on deep learning from Zhejiang University) to extract relationships from the pure text character introductions.
Due to the lack of training data, we first searched for open-source person-relationship datasets, then wrote a program to convert the data into the format accepted by DeepKE, and divided it into training, testing, and validation sets at a ratio of 8:1:1, and cleaned the data. Finally, this dataset contains 3097 training data, 306 test data, and 387 validation data, with 14 categories of relationships such as unknown, spouse, parent-child, sibling, superior-subordinate, teacher-student, friend, classmate, cooperation, colleague, lover, grandparent-grandchild, schoolmate, and relative.
We used the CNN model and the default training configuration for training (the optimal validation set accuracy is 50%, and the test set accuracy is 48%).
After training the model, we used predict.py to extract person-relationship from the pure text character introductions of the Harry Potter Wikipedia. Here are three examples:
[Example 1]
Input sentence: Draco Malfoy is a pure-blood wizard, the only child of Lucius Malfoy and Narcissa Malfoy (née Black).
Input head entity: Draco Malfoy
Input head entity type (can be empty, press enter to skip): Person
Input tail entity: Lucius Malfoy
Input tail entity type (can be empty, press enter to skip): Person
[main][INFO] — The relationship between “Draco Malfoy” and “Lucius Malfoy” in the sentence is “parent-child”, with a confidence of 0.20.
[Example 2]
Input sentence: Draco eventually married Astoria Greengrass and had at least one child — Scorpius Malfoy.
Input head entity: Draco
Input head entity type (can be empty, press enter to skip): Person
Input tail entity: Astoria Greengrass
Input tail entity type (can be empty, press enter to skip): Person
[main][INFO] — The relationship between “Draco” and “Astoria Greengrass” in the sentence is “spouse”, with a confidence of 0.34.
[Example 3]
Input sentence: Hermione also took good care of her children. When Ron joked that whoever wasn’t sorted into Gryffindor would be disinherited, Hermione comforted her daughter Rose and nephew Albus.
Input head entity: Hermione
Input head entity type (can be empty, press enter to skip): Person
Input tail entity: Rose
Input tail entity type (can be empty, press enter to skip): Person
[main][INFO] — The relationship between “Hermione” and “Rose” in the sentence is “parent-child”, with a confidence of 0.41.
Since the relationships extracted from the semi-structured data have a finer granularity, higher accuracy, and cover a more comprehensive range, we only used the information obtained in Section 1.1 for the subsequent graph construction.
2. Knowledge Storage
We initially transformed the structure of the ./harry_potter_property.json file by writing programs to form a relationship triple file harry_potter.json for storage, with sample instances as follows:
{“entity1”: “Arthur Weasley”, “entity2”: “Ron Weasley”, “relation”: “father”}
{“entity1”: “Molly Weasley, née Prewett”, “entity2”: “Ron Weasley”, “relation”: “mother”}
{“entity1”: “Bill Weasley”, “entity2”: “Ron Weasley”, “relation”: “brother”}
{“entity1”: “Charlie Weasley”, “entity2”: “Ron Weasley”, “relation”: “brother”}
{“entity1”: “Percy Weasley”, “entity2”: “Ron Weasley”, “relation”: “brother”}
{“entity1”: “Fred Weasley”, “entity2”: “Ron Weasley”, “relation”: “brother”}
{“entity1”: “George Weasley”, “entity2”: “Ron Weasley”, “relation”: “brother”}
{“entity1”: “Ginny Weasley”, “entity2”: “Ron Weasley”, “relation”: “sister”}
{“entity1”: “Hermione Granger”, “entity2”: “Ron Weasley”, “relation”: “wife”}
With further learning of the course, we hope to use Neo4j to store and apply the knowledge graph. Neo4j is a graph database management system that complies with the ACID standard. Unlike relational databases, it natively supports graph storage and processing. To support data storage, we wrote json2rdf.py to convert the crawled semi-structured data into knowledge in the RDF language format (using Turtle format).
RDF (Resource Description Framework) is a resource description language that can represent the semantic relationships between things using a set of triples.
The resource entities in the converted RDF files include characters, organizations, schools, etc. Here is an example:
character:Luna Lovegood
relation:name “Luna Lovegood”;
a relation:Character;
relation:born “13 February 1981, Britain”;
relation:blood-status blood:Pure-blood or Half-blood;
relation:marital-status “Married”;
relation:species “Human”;
relation:gender “Female”;
relation:hair-color “Dirty blonde”;
relation:eye-color “Light silvery-gray”;
relation:skin-color “Pale”;
relation:mother character:Pandora Lovegood;
relation:spouse character:Rolf Scamander;
relation:son character:Lorcan Scamander;
relation:son character:Lysander Scamander;
relation:grandparent character:Newt Scamander;
relation:occupation “Magizoologist”;
relation:house house:Ravenclaw;
relation:affiliated group:Lovegood family;
relation:affiliated group:Scamander family;
relation:affiliated group:Hogwarts School of Witchcraft and Wizardry;
relation:affiliated group:Ravenclaw;
relation:affiliated group:Dumbledore’s Army;
relation:affiliated group:Order of the Phoenix;
relation:affiliated group:The Quibbler.
Next, we used the Neosemantics plugin in Neo4j Labs to import the RDF file into the Neo4j database. Using Neo4j’s visualization tool, we can see the overall structure of the graph.
The node types include characters, organizations, and schools. We can use Cypher queries to query the data in the database. Cypher is a declarative graph query language that allows for expressive and efficient data queries in property graphs. Unlike relational databases, Cypher is specialized for graph database operations, with data structured as nodes and relationships, focusing on how entities are connected and related. We can use Cypher queries to query character-related information, such as finding all members of the Order of the Phoenix.
3.Knowledge Computation
After storing the obtained data in the Neo4j database, we used the Cypher query language to perform a series of mainstream knowledge computation tasks.
3.1Graph Basic Information
We used basic visualization methods to statistics some basic information about this knowledge graph, including the total number of nodes, node degrees, and shortest path lengths. Here are some examples.
3.2 PageRank
PageRank is an algorithm used by Google to rank web pages in their search engine results. It essentially analyzes the importance of a page based on the number and quality of links pointing to it. In this work, we used the Graph Data Science (GDS) tool provided by Neo4j to generate PageRank and perform subsequent community detection. The results (top-scoring characters) are as follows:
3.3 Community Detection
In graph theory, a subset of nodes with relatively dense internal connections corresponds to a community. Communities where node sets do not overlap are called disjoint communities, while those with overlapping node sets are called overlapping communities. The phenomenon of communities in a network graph is called community structure, which is a common feature of networks. The process of finding the community structure of a given network graph is called community detection. A key function of community detection algorithms is to extract useful information from the network.
In this work, we used the Louvain algorithm from the community detection algorithms provided by GDS to extract and process the data. It can be found that entities belonging to the same family are basically aggregated into the same community, which indicates that the algorithm is effective.
4.Knowledge Application — Rule-based Question Answering
Based on the above work, we have tried the application scenario of rule-based knowledge question answering on the knowledge graph. In the actual interaction process, the visitor inputs a text of special or general questioning, the question answering algorithm extracts the keywords from it and then executes a database search, and finally returns the corresponding result.
The question forms we currently support include:
- Querying the attributes of a person, including birth/death/lineage/species/gender/height/marriage/occupation/college
- Querying the relationship objects of a person, such as who is Harry Potter’s father, whose father is Harry Potter
- Querying the subordinate relationship of an organization, including which organizations a person belongs to, and which members an organization has
The specific effects are as follows:
In processing, after obtaining the questioning text, the algorithm first performs word segmentation and identifies the following entities from it: attribute words, proper nouns, organization words, relationship words, and subordinate prepositions. Then, using these entities as keywords, we use the py2neo library to connect to the Neo4j database and perform Cypher queries in the database. Finally, we assemble the returned results to obtain the answer.