World of Warcraft Character Information Knowledge Graph

EpiK Protocol
8 min readApr 12, 2024

“World of Warcraft” (abbreviated as WoW) is a massively multiplayer online role-playing game (MMORPG) developed by Blizzard Entertainment. The storyline of World of Warcraft begins after the events of “Warcraft III: The Frozen Throne.”

Based on the World of Warcraft wiki, we have crawled, cleaned, and extracted relationship information from the game’s character data, and built a knowledge graph. We have stored the basic character information and relationships in a JSON file format. We have generated RDF files with nodes representing characters, locations, organizations, and races. Finally, we have imported the data into a Neo4j database for data analysis and implemented a simple question-answering function based on this database.

1. Data Collection and Organization

1.1 Data Source
We used the Chinese version of the World of Warcraft Wiki, specifically the user-edited Warcraft database called “huijiwiki” (huijiwiki.com), as the primary data source for constructing the knowledge graph. This website is currently the most comprehensive Chinese-language online resource for World of Warcraft, providing information on characters, maps, equipment, and other game content. Among these, the pages related to characters are particularly relevant to driving the game’s background story. Therefore, we focused on collecting and organizing data from the 524 official character profiles documented in the Chinese version of the World of Warcraft Wiki.

1.2 Data Crawling
Initially, we browsed multiple character pages on the Chinese World of Warcraft Wiki and discovered that each page contained an overview table with basic character information. This table included details such as the character’s name, titles, level, faction, and other power-related information. Additionally, there were sections dedicated to character relationships, including relatives and companions. The knowledge graph format is well-suited for storing and processing this type of data. Therefore, we chose to crawl, parse, and organize the basic information tables for each character.

A typical character introduction page includes a table on the right side that provides basic information, power-related details, and character relationships. For data crawling, we utilized the Selenium automation browser tool’s API as the framework and implemented a web crawler using the Python programming language. The crawler mimics user behavior on the wiki website by first obtaining all the relevant links to character pages. It then visits each page and parses the webpage elements to locate the character’s brief information table. The data from the table is crawled in a semi-structured format, such as simple key-value pairs (e.g., {“Status”: “Deceased (Story)\nDefeatable (In-game)”}). This data is then handed over to the program for further detailed parsing and processing.

1.3 Data Parsing and Cleaning
Firstly, we identified the possible attribute categories that may exist in the character tables. We classified the data of an individual character into three different categories for processing:

Single-valued attributes: These are attributes where a character typically has only one corresponding value, such as gender, age, etc. For these attributes, we parse the textual data based on the attribute’s data type (e.g., string, enumeration, integer).

Multi-valued data: In addition to single-valued attributes, a character may have lists of values for attributes like titles, locations, etc. The formatting of this data in the source web pages is often inconsistent. To handle this, we utilized multiple regular expressions and manually addressed any exceptional cases to organize the character’s multi-valued data.

Relationship data: Building upon the multi-valued data, characters often have data related to interpersonal relationships, affiliations, and other entities. To clean this data, we used regular expressions and manual adjustments to match different names referring to the same entity, achieving a higher level of consistency.

Finally, we obtained preliminary cleaned data in .json format for further construction of the knowledge graph. Below is an example of the data for one character:

{
“Name”: “Akimond”,
“Former Faction”: [
“The Awakened”
],
“Faction”: [
“Burning Legion”
],
“Titles”: [
“The Defiler”,
“Eredar Overlord of the Legion Forces”,
“Great One”
],
“Mentors”: [
“Sargeras”,
“Sargil”
],
“Gender”: “Male”,
“Status”: [
“Deceased (Story)”,
“Defeatable (In-game)”
],
“Race”: [
“Man’ari (Demon)”
],
“Class”: [
“Warlock”
],
“English Name”: “Archimonde”,
“Identity”: [
“Lord of the Burning Legion”
],
“Disciples”: [
“Gul’dan”
],
“Faction”: [
“Neutral”
]
}

2. Converting JSON to RDF Format

In order to import the file into a Neo4j database for further processing, we first need to convert it to RDF format.

To begin with, we need to determine which attributes should become nodes. To do this, we sorted all the keys based on their frequency of occurrence:

( Name): 524
(Race): 523
(English Name): 523
(Gender): 480
(Status): 466
(Faction): 429
(Location): 424
(Power): 417
(Title): 308
(Identity): 278
(Attitude Towards Alliance Players): 249
(Attitude Towards Horde Players): 249
(Occupation): 220
(Level): 151
(Former Power): 133
(Health Points): 122
(Mana Points): 54
(Disciple): 45
(Mentor): 44

Please note that this is just a partial list of keys and their frequencies.

Among them, binary attributes (such as status, gender, attitude, etc.) and single-valued attributes (such as Chinese name, occupation, level, health points, magic power, etc.) are not suitable as nodes, but only as attributes of each node. Attributes with low frequency are not suitable as nodes.
In the end, we select “Character,” “Race,” “Power,” and “Location” as nodes.
In RDF, they are defined as follows:

group: Shadow Assault Camp
relation: name “Shadow Assault Camp”;
a relation: Group.

It is worth mentioning that the class name of the group, for convenience, directly uses the content with the same name. This is actually unreasonable, which requires us to use strong regular expressions to process the name of each node again during subsequent processing. The relevant code is as follows:

def preprocess_item(item):

item = re.sub(u”\\(.*?\\)”, “”, item)

item = re.sub(r’[\’+\-*/’“”‘\?\[\],&†\(\)\”‧\n]’, ‘’, item)

item = re.sub(r’\s+’, “”, item)

return item

Finally, the code used for converting to RDF is recorded in the json2rdf.py file.

3. Importing Knowledge Graph into Neo4j Graph Database

To visualize and utilize the knowledge graph effectively for subsequent applications, we are importing the graph into the Neo4j graph database.
Neo4j is a high-performance NoSQL graph database widely used for storing and implementing various knowledge graph applications. Since Neo4j does not provide built-in support for importing RDF-formatted graphs, we are using the open-source plugin neosemantics for seamless import.

Neo4j: https://neo4j.com/
neosemantics: https://github.com/neo4j-labs/neosemantics

After creating an empty local database, you only need to run a command to import the graph:

call n10s.rdf.import.fetch(“The path of .rdf file”,”Turtle”)

Using the Cypher language, we can perform simple graph queries.

For example, we can use the match keyword to query node information in the graph.

match(n) return count(n) // Query the number of nodes
match(n) where not () → (n) return count(distinct n), labels(n) // Query root node information

In this graph, there are a total of 1,897 nodes, of which 379 are root nodes and belong to the Character category.

Therefore, this knowledge graph actually describes the relationships between “heroes” from multiple dimensions. Taking the “Power” node as an example, the following graph shows a subgraph (partial) obtained by querying the relationships between various “heroes” and “power”.

Furthermore, through the combination of functions, statistical characteristics of certain nodes in the graph can be selectively retrieved. The following graph shows the result of sorting the nodes by degree.

Using the Cypher query language, we can further implement many more complex graph algorithms for learning and reasoning on the knowledge graph. Neo4j maintains a popular graph algorithm library called the Neo4j Graph Data Science Library.

GDS: https://github.com/neo4j/graph-data-science

Here, we provide two simple examples of using graph algorithms from GDS.

Before using them, it is necessary to generate a Graph from the database, using all the nodes and relationships in this case.

Then, we invoke the Degree Centrality algorithm to assess the importance of nodes in the graph based on their degrees, both in-degree and out-degree are considered in the algorithm’s computation range. As can be seen, in the graph, “Humans” are the most popular nodes, and many entities in the graph are related to humans.

Furthermore, through algorithms, we can also uncover hidden information in the graph. For example, using the Common Neighbors algorithm, we can predict whether two nodes are likely to be connected based on their shared neighboring nodes.

In the visualization of the graph, we can see that although the two nodes are not directly connected, we successfully predicted a potential connection between them through other nodes.

4. Application of Knowledge Graph — Knowledge Question Answering based on Graph Database

Based on the work mentioned above, we have explored the application of knowledge question answering based on a graph database using the knowledge graph.

In the actual execution process, the user inputs a question or only keywords, and the question answering algorithm extracts the keywords and performs a search in the graph database, ultimately returning the corresponding results.

The question formats we currently support are as follows:
(1) Querying the attributes of heroes, including race/gender/faction/location/power/title/profession/level.
(2) Querying the relationship objects of heroes, such as the apprentice of Antonidas is Jaina. The relationships include relatives/apprentices/mentors/companions.
(3) Querying the hierarchical relationships of location/race/power, including which powers the heroes belong to, which heroes are present in a certain location, etc.

The specific effects are as follows:

During the processing, after obtaining the query text, the algorithm first performs word segmentation on it.

The following entities are identified: attribute words, hero names, power names, race names, location names, affiliation words, and relationship words.

Then, using these entities as keywords in the query, we connect to the Neo4j database using py2neo and perform the query using Cypher statements in the database.

Finally, the returned results are assembled to provide the answer.

--

--