Knowledge Graph of ‘Lord of the Mysteries’
- Project Introduction
This project constructs a knowledge graph for the novel “Lord of the Mysteries.” The data mainly comes from the Baidu Encyclopedia and the original text of the novel. The work to build this graph is divided into several parts:
- Data crawling and preprocessing
- Relationship extraction
- Attribute extraction
- Building a question-answering system
2.Data Crawling and Preprocessing
2.1Data Crawling from Baidu Encyclopedia
We conducted data crawling for the online novel “Lord of the Mysteries” to create a training corpus containing information about characters, plots, and more. We used the requests_html library to access target web pages and retrieve HTML content. Then, we employed the BeautifulSoup library for in-depth parsing to extract valid information.
The detailed crawling process is as follows: First, we extracted key information from relevant entries about “Lord of the Mysteries” on Encyclopedia, such as character names, sequence names, and power names, recording the URLs for future web access. After obtaining these URLs, we accessed each one independently to extract detailed information about each entry. Typically, entries on Encyclopedia include a summary and main text, which we needed to process appropriately to convert into a structured data format for subsequent analysis and application.
The crawled data is as follows:
2.2 Building the Entity List
For the entity recognition task related to the novel “Lord of the Mysteries,” we adopted a matching-based strategy. Given that entity names in online novels often differ significantly from the real world, and Encyclopedia provides comprehensive introductions to these entities, we decided to use these existing resources to assist in entity recognition.
Specifically, we extracted entity names and types from the crawled data to build an entity list. We noted that the same character in “Lord of the Mysteries” might have multiple different names. For instance, the character “Klein Moretti” may also be referred to as “Merlin Hermes,” “The Fool,” or “Zhou Mingrui.” Therefore, during the construction of the entity list, we needed to unify these different names. To ensure the accuracy of text preprocessing, we also replaced and standardized these different names in the original text to maintain consistency and accuracy in entity recognition.
2.3 Text Data Preprocessing
To conduct effective text analysis on the novel “Lord of the Mysteries,” we first preprocessed the original text to enhance the accuracy of subsequent entity recognition and information extraction. The preprocessing steps include:
- Cleaning the Original Text: We removed content related to the author, as it is usually unrelated to the actual plot and may interfere with later data analysis.
- Name Unification: For the various names of characters in the novel, we standardized them to a single name based on the previously constructed entity list to eliminate ambiguity.
- Sentence Segmentation: We segmented the text based on punctuation (such as periods, question marks, exclamation marks) to break the long text into independent sentence units for easier processing.
- Entity Matching: In the segmented text, we matched based on the entity list and retained only those sentences containing two or more entities. These sentences are more likely to contain key information valuable for building the knowledge graph and entity relationship network.
- Filtering Short Sentences: To ensure the quality of the dataset, we removed overly short sentences, as they often lack sufficient information for subsequent analysis.
Through these steps, we constructed a preprocessed dataset, laying a solid foundation for the subsequent entity recognition and information extraction tasks. Sample data from the dataset is as follows:
3.Relationship Extraction
3.1 Relationship Extraction Based on LLM
We defined the following relationships among characters, regions, and sequences:
Large language models (LLMs) have strong text understanding and processing capabilities. Using the EasyInstruct framework, we employed an LLM to perform preliminary relationship annotation on the text within the novel dataset. Considering the cost of LLMs, we used OpenAI’s GPT-3.5 Turbo model to annotate 5% of the original text data, resulting in quadruples of the form (original text, relationship, head entity, tail entity).
However, due to limited context and the “hallucination” problem associated with LLMs, they may sometimes fail to respond according to our instructions, such as returning results not within the specified relationships. In such cases, manual data cleaning is necessary to remove data entries that do not meet the requirements, to avoid impacting subsequent training.
3.2 Relationship Extraction Based on DeepKE
After conducting initial corrections using the LLM and further manual adjustments, we obtained a training set suitable for relationship extraction training with DeepKE. Due to the small size of the training set (about 1,000 entries), we used a Bert+LSTM network model to train on top of the pre-trained model. Given that the Bert model supports a maximum input length of 512, we needed to filter out overly long texts beforehand.
After training the Bert+LSTM model for 50 epochs, we obtained the final predictive model. The prediction results are as follows:
We predicted all entity pairs in the novel dataset and filtered out predictions with low confidence. The predicted data is stored as (relationship, head entity, head entity type, tail entity, tail entity type).
We then conducted further manual cleaning of the data and merged it with the previously annotated training set, thereby obtaining the final predictive data.
4.Attribute Extraction
In the attribute extraction section, our goal is to extract relevant attributes of characters and sequences from the web novel “Lord of the Mysteries.” These attributes include, but are not limited to, characters’ aliases, gender, promotion pathways, as well as the names of countries and their capitals. To achieve this goal, we took the following steps:
- Utilization of Structured Data: We first extracted structured data from the introduction section of Baidu Encyclopedia. Since the introduction typically presents key information in table format, this facilitated our attribute extraction.
- Anomaly Filtering: During the extraction process, we noticed that some attributes might be anomalous or unrelated to the novel’s plot. To ensure data quality and relevance, we filtered and removed these attributes.
- Irrelevant Attribute Filtering: In addition to anomalous attributes, we also needed to remove those attributes that were unrelated to the novel’s plot, such as information about TV adaptations or other derivative works.
- Attribute Dictionary Construction: After filtering and screening, we organized the extracted attributes in a dictionary-like format, with each attribute represented as a key-value pair.
- Writing to JSON File: Finally, we saved the constructed attribute dictionary into a JSON file for ease of subsequent data analysis and application.
Example of the Content and Structure of Attributes:
{“characters”: [
{“name”: “Klein Moretti”,
“aliases”: [“Zhou Mingrui”, “The Fool”],
“gender”: “Male”,
“promotion_path”: “The Fool pathway”
},
{“name”: “Alger Wilson”,
“aliases”: [“The Hanged Man”, “The Overthrower”],
“gender”: “Male”,
“promotion_path”: “The Tyrant pathway”}],
“countries”: [
{“name”: “Ruen”,
“capital”: “Beckland”,
“official_language”: “Ruenese”,
“major_religions”: [“Church of the Night Goddess”, “Church of the Storm”, “Church of Steam and Machinery”, “Church of the Earth Mother”],
“currency”: [“Copper Penny”, “Silver Suler”, “Gold Pound”]
}]}
This summarizes our work and results in the attribute extraction section. Through these steps, we effectively extracted useful attribute information from the original text and saved it as structured data.
5.Visualization and Knowledge Q&A
We imported the knowledge graph into Neo4j for visualization, with some parts of the graph displayed below.
At the same time, we also implemented a simple question-answering system.