Knowledge Graph and Q&A for Detective Conan

EpiK Protocol
9 min readJun 8, 2024

--

The EpiK protocol is dedicated to building high-quality knowledge graphs covering a wide range of domains. Leveraging its well-developed label-to-earn mechanism, EpiK has attracted over 120,000 registered users from around the world to participate in knowledge curation and annotation collectively.

Today, we are delighted to present the knowledge graph of “Detective Conan” created by the Chinese-speaking community. This graph encompasses rich content related to the character relationships, case plots, and scientific principles featured in the Conan manga and anime series, providing a systematic learning and exchange platform for Conan fans.

Let us explore this captivating “Detective Conan” knowledge graph together, unraveling the mysteries surrounding classic characters like Shinichi, Ran, and Haibara, and immersing ourselves in the allure of detective fiction. This is sure to be a truly valuable knowledge journey.

1. Knowledge Graph Construction

1.1 Data Sources

Data Crawling
Baidu Baike has relatively detailed entries for characters and works in the Detective Conan series, so this project’s knowledge mainly comes from this website. We extracted the required knowledge content by crawling the encyclopedia entries and separately crawling the structured character basic information and unstructured text information for subsequent processing.

When writing the crawler program, we mainly used the Python requests library to obtain the page source code, and then located the required content using the XPath parsing method.

First, we crawled the unstructured information. We chose episode plots and main character introductions as the targets for crawling unstructured text. In the “Episode Plot” entries, we could crawl the titles and main plots of each episode, and also obtain the basic introductions of each character in the “Character Introduction” entries. We wrote the crawled data to a txt document, obtaining 53 major characters and the plots of 1068 anime episodes. The well-organized txt document is shown below.

Next, we crawled the structured information. The image above shows the character introduction part of the “Detective Conan” entry. From the image, we can find that each major character has a corresponding entry hyperlink. We found the XPath of the hyperlink part in the source code and obtained the list of sub-pages to be crawled. On each character’s corresponding sub-page, there is a table format describing the character’s basic information. We crawled this part of the content and organized it into a JSON format file to obtain the structured triple information describing the character relationships.

The crawler program used to crawl the character information table is as follows:

Data Cleaning

(1) After establishing the entity set based on the crawled data, we statistical the list of attributes obtained from the semi-structured data, merged the attributes with the same meaning, cleaned the non-Chinese and English characters, and removed the attributes with low frequency. We retained the following 15 attributes, and the cleaned entities and their attributes can be viewed in the file ./data/new_property_clean.csv or ./data/new_property_json.csv.

Birthday: string, Alma Mater: string, Gender: string, Occupation: string, School: string, Police Rank: string, Identity: string, Unit: string, Age: string, Death Day: string, Appeared Works: string, True Identity: string, First Appearance: string

(2) After establishing the entity set based on the crawled data, we filtered the relationship data where the head entity and the tail entity do not belong to the entity set. We manually split some bidirectional relationships into unidirectional relationships, such as converting the “parent-child” relationship into “father” and “child”. For relationships with bidirectional meanings, such as “classmates” and “colleagues”, we exchanged the head entity and the tail entity to generate new triples, supplementing the missing data.

(3) We converted the crawled relationship data into concepts supported by cnSchema, but since our project mainly deals with the knowledge graph of the Detective Conan world, some relationships are not described in the existing concepts of cnSchema. Therefore, we added some new relationship concepts, and defined a total of 18 relationships:

[‘Frenemies’, ‘Admired’, ‘Junior’, ‘Collaborate’, ‘Teacher-Student’, ‘Parent-Child’, ‘Former Colleagues’, ‘Child’, ‘Classmates’, ‘Senior’, ‘Enemy’, ‘Romantic Partner’, ‘Sibling’, ‘Ex-Partner’, ‘Spouse’, ‘Friend’, ‘Like’, ‘Colleague’]

1.2 Text-based Deep Relation Extraction

In Section 1.1, we used web crawlers to extract semi-structured person information. From the extracted person name list, we identified the entities in the Conan Detective knowledge graph, and constructed the person-person relationships in the knowledge graph using the scraped semi-structured data. Additionally, we attempted to extract person-person relationships from the episode plot summaries and character text introductions scraped in Section 1.1.

Model Training

Reference project http://openkg.cn/dataset/a-harry-potter-kg, the deep relation extraction model uses the Chinese BERT-wwm pre-trained model, and uses the person relationship dataset provided by https://github.com/taorui-plus/OpenNRE for training. The dataset contains fourteen types of person relationships such as parent-child, spouse, grandparent, teacher-student, and classmates. The trained model is saved in ./ckpt/people_chinese_bert_softmax.pth.tar.

The training results are as follows:
=== Epoch 0 train ===
100%|██████████████████████████████████████████████████████████████| 3094/3094 [40:12<00:00, 1.28it/s, acc=0.773, loss=0.687]
=== Epoch 0 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.42it/s, acc=0.934]
Best ckpt and saved.
=== Epoch 1 train ===
100%|██████████████████████████████████████████████████████████████| 3094/3094 [38:17<00:00, 1.35it/s, acc=0.923, loss=0.235]
=== Epoch 1 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00, 2.78it/s, acc=0.972]
Best ckpt and saved.
=== Epoch 2 train ===
100%|██████████████████████████████████████████████████████████████| 3094/3094 [22:43<00:00, 2.27it/s, acc=0.961, loss=0.121]
=== Epoch 2 val ===
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00, 2.71it/s, acc=0.986]
Best ckpt and saved.
Best acc on val set: 0.986000
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.64it/s, acc=0.986]
Accuracy on test set: 0.986

The model achieves an accuracy of 0.97 on the Chinese person relationship extraction task after fine-tuning on the BERT-wwm model.

Relation Extraction from Text

(1) Entity Matching
For the input text, we first split it into sentences, and then match the entity names within the text sentences. The entity names in the text can be in the form of full names or aliases. We have crawled the Japanese surname table from http://htmfiles.englishhome.org/Japsurnames/Japsurnames.htm, and for each Chinese name of the person entity, we extract the surname and given name (e.g., “工藤新一” is split into “工藤” and “新一”). For the less common aliases, we manually curate and supplement some prior knowledge. The aliases and their corresponding entity IDs are stored in the ./data/names file.

We match the possible aliases, get the positions of the entities in the sentence, and store the matched entities and their positions in a list. Since there may be cases of the same surname, we de-duplicate the extracted strings to ensure that the extracted string is not a substring of a previously extracted string.

(2) Relationship Prediction
For clauses with more than 2 entities, predict the relationship between any two entities appearing in that clause. The final relationship list will only retain the prediction results with a confidence score greater than 0.95.

Example predictions:

1.Clause data[‘text’]: After Conan Edogawa, he sought the help of Professor Agasa, and when questioned about his name by his childhood friend Ran Mouri, he used the alias Conan Edogawa.
Entities: Ran Mouri t_pos: (23,25)
Shinichi Kudo h_pos: (0,4)
Prediction result: [‘Ran Mouri’, ‘Conan Edogawa’, (‘couple’, 0.9562152028083801)]

2.Clause data[‘text’]: Based on Professor Agasa’s suggestion, Conan Edogawa lived with Ran’s father Kogoro Mouri, solving various cases while secretly investigating the Black Organization.
Entities: Ran (alias of Ran Mouri) t_pos: (19,19)
Kogoro Mouri h_pos: (23,27)
Prediction result: [‘Kogoro Mouri’, ‘Ran’, (‘parent-child’, 0.9952424764633179)]

3.Clause data[‘text’]: Haibara Ai, a character in the Japanese manga “Detective Conan” and its derivative works, is a first-grade student in Class B of Teitan Elementary School, a member of the Detective Boys, and an original scientist of the Black Organization.
Entities: Haibara Ai t_pos: (0,2)
Conan (alias of Shinichi Kudo) h_pos: (15,16)
Prediction result: [‘Haibara Ai’, ‘Conan’, (‘friend’, 0.9857942461967468)]

However, when multiple entity mentions appear in the character introduction clause, it caused interference in the prediction results, and some inferred relationships need to be manually cleaned.

(3) Cleaning Relationship Prediction Results
Based on the model’s prediction results, there may be multiple relationships between the same pair of entities. We determine the relationship between two entities based on the frequency of the predicted relationships and the confidence scores of the predictions. We prioritize the relationship with the highest frequency, and when the frequencies are the same, we choose the relationship with the higher confidence score. Some of the cleaned prediction results are as follows:

Kyogoku Makoto, Suzuki Sonoko, couple, 0.9990885257720947
Kyogoku Makoto, Ran Mouri, spouse, 0.9921330809593201
Wataru Takagi, Masumi Sera, siblings, 0.9439747333526611
Heiji Hattori, Conan Edogawa, friends, 0.9980144500732422
Heizo Hattori, Heiji Hattori, parent-child, 0.9996293783187866
Kazuha Toyama, Heiji Hattori, friends, 0.982006311416626
Heiji Hattori, Shinichi Kudo, friends, 0.9943497776985168
Kazuha Toyama, Conan Edogawa, parent-child, 0.9980663657188416
Ginshiro Toyama, Kazuha Toyama, parent-child, 0.9994105100631714
Eri Kisaki, Kogoro Mouri, spouse, 0.9956905245780945
Yusaku Kudo, Shinichi Kudo, friends, 0.9704906940460205
Yukiko Kudo, Yusaku Kudo, spouse, 0.9478380084037781

The directly crawled relationships from Section 1.1 are more complete and accurate, while the number of valid relationships predicted from the text is relatively small, and the accuracy is slightly biased. Therefore, we only manually clean the text-extracted results and supplement them partially, mainly using the relationships crawled in Section 1.1 to construct the knowledge graph. The final constructed relationship can be viewed in the file ./data/web_relation_new.csv.

2.Data Storage

Database

We have chosen the neo4j graph database to store our detective Conan character relationship knowledge graph.
neo4j version: 4.4.16
JDK version: 11.0.17
Schema
Nodes:
Node type: People
Node properties:
[peopleId:ID(People), Chinese Name:string, Alias:string, Birthday:string, Alma Mater:string, Gender:string, Occupation:string, Student:string, Rank:string, Identity:string, Organization:string, Age:string, Death Day:string, Appeared Works:string, True Identity:string, First Appearance:string]
Edges:
Edge types:
[‘Friend and Foe’, ‘Liked’, ‘Junior’, ‘Cooperate’, ‘Teacher-Student’, ‘Parent-Child’, ‘Former Colleague’, ‘Child’, ‘Classmate’, ‘Senior’, ‘Enemy’, ‘Couple’, ‘Sibling’, ‘Ex’, ‘Spouse’, ‘Friend’, ‘Like’, ‘Colleague’]

Data Visualization Display

Global display: match (n) return n

Display Kyogoku Makoto’s related relationships: match(n{Chinese Name:’Kyogoku Makoto’})-[r]->(b) return n,r,b

Display Kyogoku Makoto’s attributes:

Data Storage
We use neo4j to perform a full offline database backup, dumping the database to a single file archive called <database>.dump.
Filename: kenan.dump
Import method:

1.Keep the neo4j service turned off

2.Go to the bin directory of the neo4j installation directory, or set the environment variable

3.Execute the command: neo4j-admin.bat load — from=??? — database=kenan — force??? Fill in the dump file path

4.Modify the neo4j.conf file in the conf directory of the neo4j installation directory, add the following line:
dbms.default_database=kenan
As shown in the figure:

5.Open the neo4j service, such as executing the command neo4j.bat console
Note: The above commands are in the Windows environment format, in Linux the .bat suffix is not needed, such as neo4j.bat console is changed to neo4j console

3.Knowledge Graph Application: Semantic Analysis-based Knowledge Graph Q&A
Introduction

Based on the structured data and graph database mentioned above, we have implemented a Q&A application on the knowledge graph. Users can enter some sentences, and the Q&A program will perform keyword analysis and semantic analysis on the input sentences, search for corresponding results in the knowledge graph database, and finally generate the answer to return to the user.

Specific algorithm flow

Receive the user’s question as input.

Perform keyword analysis on the user’s question. Use the AC automaton algorithm to match and identify whether there are keywords such as character names, relationships, attributes, organizations, etc. in the question.

Classify the question based on the keywords identified in the question.

Determine the question type, and connect to the neo4j graph database based on the corresponding keywords to find the corresponding results.

Return the results obtained from the database to the user.

Currently supported question forms

Query character attributes: including Chinese name, first appearance, alias, organization, student, age, gender, alma mater, birthday, appeared works, true identity, occupation, rank. For example: What is Kudou Shinichi’s birthday?

Query character relationship objects, including siblings, ex, colleagues, classmates, liked, cooperated, friends, children, couples, enemies, parents. For example, who are Mouri Ran’s parents?

Query organization members, including police, Black Organization, etc. For example, who are the members of the Black Organization?

--

--

EpiK Protocol
EpiK Protocol

Written by EpiK Protocol

The World’s First Decentralized Protocol for AI Data Construction, Storage and Sharing. https://www.epik-protocol.io/ | https://twitter.com/EpikProtocol

No responses yet