Demon Slayer Knowledge Graph Project Report

EpiK Protocol
8 min readAug 15, 2024

--

Data Preprocessing for the Knowledge Graph

Data Sources

The data for this graph comes from the Fandom Demon Slayer wiki and the Moegirl Encyclopedia entries. The project constructs a knowledge graph primarily focused on characters, organizations, and skills. The main steps include:

  1. Data Crawling: Using web crawlers to scrape semi-structured data and text data from the wiki.
  2. Data Processing: Converting the scraped semi-structured data into triples and extracting knowledge from the text data.
  3. Data Merging and Cleaning: Merging the obtained triples, manually checking and cleaning them, and converting them into RDF/XML format.
  4. Database Import: Importing the knowledge graph into a Neo4j database for visualization and building a Q&A application.

Data Preprocessing Sources

The semi-structured data used in this project is sourced from the Fandom Chinese Wikipedia for Demon Slayer, available at Fandom Wiki. This site provides detailed descriptions of the world and story of Demon Slayer, along with rich character traits and relationship information. The text data used for deep knowledge extraction is obtained by merging information from the Moegirl Encyclopedia entry on Demon Slayer, accessible at Moegirl Encyclopedia.

Entity Data Scraping

The entity data required for constructing the Demon Slayer knowledge graph is presented in semi-structured format on the Fandom wiki. The following image illustrates how entities like “Tanjirou Kamado,” “Inosuke Hashibira,” and “Zenitsu Agatsuma” from the main protagonist trio are displayed on the directory page. Analysis shows that different entity URLs share the same prefix, combined with individual character names. This pattern allows us to scrape character entity information from the site.

We use the BeautifulSoup and requests_html libraries to scrape all entity data from the website. Here are the specific commands:

from requests_html import HTMLSession

def Extraction_entity(keywords,web):

all_entities=[]

current_keywords=keywords.copy()

init_keywords=keywords.copy()

while len(current_keywords) >= 1:

seed = current_keywords.pop(0)

print(‘visiting’, seed)

url = web + seed

session = HTMLSession()

response = session.get(url)

a_list = response.html.find(‘a’)

current_ents = []

for a in a_list:

if a.attrs.get(‘class’, ‘’) == (‘category-page__member-link’, ):

current_ents.append(a.attrs[‘title’])

for t in current_ents:

if ‘Template’ in t: continue

if ‘Category’ in t:

if t not in init_keywords:

current_keywords.append(t)

init_keywords.append(t)

else:

all_entities.append(t)

return(all_entities)

#wikipedia of DemonSlayer

web =’https://kimetsu-no-yaiba.fandom.com/zh/wiki/'

#some keywords are listed here

keywords = [‘Category:鬼杀队’,

‘Category:十二鬼月’,

‘Category:血鬼术’,

‘Category:呼吸法’,

‘Category:柱’,

‘Category:无限城’,

‘Category:人类’,

‘Category:鬼’,

‘Category:主角团’

]

all_entities=Extraction_entity(keywords,web)

Semi-Structured Entity Relationship Extraction and Transformation

Taking the character entity “Zenitsu Agatsuma” as an example, its data page is shown in the image below. As seen, the main information about “Zenitsu Agatsuma” is presented on the right side in the form of a knowledge card.

We perform semi-structured entity relationship extraction and transformation on the right side of the knowledge card using dictionary key-value pairs. The final scraped triple relationships for the character entity “Zenitsu Agatsuma” are shown in the image below:

Simplified and Traditional Chinese Conversion

Since the directly scraped entity and relationship information is in Traditional Chinese, we use the zhconv library to convert it to Simplified Chinese. The code is as follows:

def trans_to_zhhans_entites(use_entities):

trans_entities=[]

for i in use_entities:

trans=zhconv.convert(i, ‘zh-hans’)

trans_entities.append(trans)

return(trans_entities)

Data Cleaning

For the scraped entity object information and entity relationship information, we follow these steps for data cleaning:

1. Remove Invalid Information:
Among the scraped triple relationships, we only keep the information necessary for building the knowledge graph, such as profession, relatives, and abilities. Therefore, we skip other key-value pairs. The specific code is as follows:

#remove unwanted info

wanted_relations = []

for d in relations:

if d[1] in [‘日文名’,’状态’,’罗马字’,’体重’,’出生日期’,’年龄’,’身高’,’死因’,’物

种’,’动画’,’漫画’,’首次登场’,’日本声优’,’舞台演员’,’香港配音员’,’中国配音员’,’美国配音

员’,’台湾声优’,’喜欢的食物’,’兴趣’,’使用者’,’出生地’,’伙伴’,’日文’]: continue

wanted_relations.append(d)

2. Correcting Formatting Errors

The relationship data scraped directly from the website often contains formatting errors. For example, in the “relatives” attribute, the relationships for the entity “Kanao Tsuyuri” (such as “father,” “mother,” and “younger brother”) are presented in parentheses following the entity name, as shown in the image below.

To address this, we use the following code to convert each relationship into triples.

import re

clean_relations=[]

for d in wanted_relations:

if d[1] not in [‘亲属’]:

clean_relations.append([d[0],d[2],d[1]])

continue

else:

if d[2][-1] == ‘)’:

relation = d[2][:-1].strip().split(‘)’)

# print(relation)

for relate in relation:

if ‘ (’ in relate:

r = relate.strip().split(‘ (’)

# print(r)

r[0]=clean_r(r[0])

clean_relations.append([d[0], r[0], r[1]])

elif ‘、’ in relate:

r = relate.strip().split(‘(’)

e = r[0].split(‘、’)

for i in e:

clean_relations.append([d[0], i, r[1]])

The relative relationships are ultimately organized into triples in the following format:

  • (Entity, Relationship, Relative Name)

For example:

  • (Kanao Tsuyuri, father, Kanao’s Father)
  • (Kanao Tsuyuri, mother, Kanao’s Mother)
  • (Kanao Tsuyuri, younger brother, Kanao’s Younger Brother)

Text Knowledge Extraction for the Knowledge Graph

Entity Recognition

  • Model: A Bidirectional Long Short-Term Memory network (BiLSTM) is used.
  • Annotation Set: The BMOES format is employed, focusing primarily on named entities (people). The annotation format is as follows:

炭 B-NAME

治 M-NAME

郎 E-NAME

前 O

往 O

花 O

街 O

执 O

行 O

任 O

务 O

  • Dataset

Initially, the dataset utilized the BosonNLP_NER_6C entity corpus, where named entities like person names were annotated as shown in the image below. The annotations need to be converted to the BMOES format mentioned above.

The final training results were not ideal. For example, the name “Tanjiro” was not recognized at all, while “Hanaj街” was recognized. The reason is that the entity recognition targets names from the Japanese anime “Demon Slayer,” which means they are Chinese translations of Japanese names. Thus, the training dataset contained only Chinese names, making it impossible to recognize them.

[‘炭’, ‘治’, ‘郎’, ‘前’, ‘往’, ‘花’, ‘街’, ‘执’, ‘行’, ‘任’, ‘务’, ‘。’]

[‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-ORG’, ‘E-ORG’, ‘O’, ‘E-ORG’, ‘O’, ‘O’]

To better recognize Japanese names, we first found a Chinese corpus of Japanese names on Gitee. The names in this corpus were in traditional Chinese, so we converted them to simplified Chinese. Then, we replaced the names in the BosonNLP_NER_6C entity corpus with our curated data and proceeded with training.

Finally, after training and testing, we obtained the following results:

Results from the original corpus:

[“灶”, “门”, “炭”, “治”, “郎”, “前”, “往”, “花”, “街”, “执”, “行”, “任”, “务”, “。”]

[‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-ORG’, ‘E-ORG’, ‘O’, ‘E-ORG’, ‘O’, ‘O’]

Results from the corpus with replaced Japanese names:

[“灶”, “门”, “炭”, “治”, “郎”, “前”, “往”, “花”, “街”, “执”, “行”, “任”, “务”, “。”]

[‘O’, ‘O’, ‘M-NAME’, ‘M-NAME’, ‘E-NAME’, ‘O’, ‘O’, ‘B-ORG’, ‘E-ORG’, ‘O’, ‘O’, ‘O’, ‘O’]

Results with training on Japanese names added to the test samples:

[“阿”, “保”, “刚”, “前”, “往”, “花”, “街”, “执”, “行”, “任”, “务”, “。”]

[‘B-NAME’, ‘M-NAME’, ‘E-NAME’, ‘O’, ‘O’, ‘B-ORG’, ‘E-ORG’, ‘O’, ‘O’, ‘O’, ‘O’]

It can be seen that compared to the original corpus, the entity corpus with replaced Japanese names is better at recognizing the ending markers of names, while the beginning markers remain difficult to identify. When comparing the results with the training on Japanese names, it is evident that the generalization ability of our trained model is still not ideal.

Plain Text Knowledge Extraction

In comparison, DeepKE performs better in entity recognition and plain text knowledge extraction. Therefore, we ultimately decided to abandon our own trained model and instead use DeepKE’s model for plain text knowledge extraction.

For the extracted samples, we first performed the following preprocessing.

As shown in the image above, the data source consists of encyclopedia entries, which have issues such as lacking subjects and having multiple aliases for the same entity. We began by segmenting the crawled text into sentences using punctuation marks: ,、;、。、!、?. Then, we added complete subjects based on character entities at the beginning of each sentence. Finally, we manually removed some obviously problematic short sentences to form the extracted samples, as illustrated in the image below.

Lastly, we used DeepKE for knowledge extraction and standardized the aliases of the same entity in the extraction results. These were then merged with the previously created semi-structured triples to form the final triples.

Triples Converted to RDF

We converted the knowledge graph triples into RDF/XML format as the default URI namespace, resulting in the RDF/XML formatted file shown below.

Application of the Knowledge Graph (Visualization and Q&A)

Importing the Knowledge Graph into the Neo4j Database

To import the knowledge graph into the Neo4j database, we will install Neo4j using Docker:

docker pull neo4j:4.4.12

Run the container and set up the mapping, including username and password:

sudo docker run -d — name neo4j-4.4.12-container \

-p 7474:7474 -p 7687:7687 \

-v /home/touch/neo4j4.4.12/data:/data \

-v /home/touch/neo4j4.4.12/logs:/logs \

-v /home/touch/neo4j4.4.12/conf/:/var/lib/neo4j/conf \

-v /home/touch/neo4j4.4.12/import/:/var/lib/neo4j/import \

-v /home/touch/neo4j4.4.12/plugins/:/var/lib/neo4j/plugins \

— env NEO4J_AUTH=neo4j/123456 neo4j:4.4.12

Using the Neo4j library, along with the neosemantics and APOC plugins, we will write a Python script to import the acquired data into the Neo4j database:

python ./src/create_graph.py

Knowledge Graph Visualization and Querying

After importing the data into the Neo4j database, you can connect via a web browser using the configured port to execute Cypher statements for database operations, such as displaying all nodes:

MATCH (n) RETURN (n)

The result is as follows. Each node entity is categorized by type with labels: red for character entities (e.g., Tanjiro Kamado, Giyu Tomioka), blue for skill entities (e.g., Sun Breathing, Water Breathing), and green for organization entities (e.g., Demon Slayer Corps, Twelve Kizuki). Directed edges connect the relationships between entities, and each entity can be clicked to view its attributes (e.g., gender, hair color, eye color, weapon, occupation).

--

--

EpiK Protocol
EpiK Protocol

Written by EpiK Protocol

The World’s First Decentralized Protocol for AI Data Construction, Storage and Sharing. https://www.epik-protocol.io/ | https://twitter.com/EpikProtocol

No responses yet