Knowledge Graph for the 2022 FIFA World Cup in Qatar
This project aims to build a knowledge graph of various entities (including players, national teams, clubs, etc.) and the relationships between them (including hierarchical relationships between people and organizations, as well as interpersonal relationships) in the 2022 FIFA World Cup in Qatar.
We first used web crawling techniques to extract all entity names from the Chinese Wikipedia website on the 2022 FIFA World Cup in Qatar to build an entity repository. We then extracted the entity relationships related to each entity in the repository to form the initial knowledge graph triples. Since the text information on the web pages is presented in a semi-structured way, we preprocessed the crawled data to make it conform to the storage standards of the knowledge graph.
To further expand the initially constructed knowledge graph, we first used a pre-trained entity recognition model to extract richer entity names from the player information to expand the entity repository. We then used a pre-trained relationship extraction model to extract the relationships between the existing entities, thereby expanding the entity relationships and constructing the final knowledge graph. After completing the construction of the knowledge graph, we first used the Neo4j graph database to store and visualize the knowledge graph. We then built a simple knowledge question-answering application based on it, using techniques such as natural language part-of-speech tagging and knowledge graph querying.
The main work includes four parts: (1) Data acquisition and preprocessing; (2) Entity recognition and relationship extraction; (3) Import the knowledge graph; (4) Visualization and knowledge Q&A.
1. Data Acquisition and Preprocessing
This step uses web crawling techniques to obtain information on all players participating in the 2022 FIFA World Cup in Qatar from the Chinese Wikipedia website, as well as various specific known entity relationships; and preprocesses the acquired entities and entity relationships to build an initial knowledge graph and prepare for subsequent deep relationship extraction.
Extracting all entity names from semi-structured data
The Chinese Wikipedia website for the 2022 FIFA World Cup in Qatar provides all entity names, which exist in the form of semi-structured data, and the data pages are as shown in the figure below:
Features of the page:
1.The list of participating countries is concentrated in a single module, with corresponding Wikipedia hyperlinks.
2.The player list and individual information for each country are in tables under the same module.
By recursively crawling the table contents, we can obtain information on all the players participating in the 2022 FIFA World Cup in Qatar.
Operational instructions:
python crawler.py
The structured personal information obtained is as follows:
{‘Number’: ‘10’, ‘Position’: ‘Forward’, ‘Player Name’: ‘Neymar Júnior’, ‘Date of Birth’: ‘(30 years old)’, ‘Appearances’: ‘121’, ‘Goals’: ‘75’, ‘Club’: ‘Paris Saint-Germain’, ‘Unnamed: 7’: nan, ‘Detailed Information’: [{‘Full Name’: ‘Neymar da Silva Santos Júnior[1]’}, {‘Birthplace’: ‘Mogi das Cruzes, Brazil[1]’}, {‘Height’: ‘1.75 m (5 ft 9 in)’}, {‘Position’: ‘Left Winger, Attacking Midfielder’}, {‘Current Club’: ‘Paris Saint-Germain’}, {‘Jersey Number’: ‘10’}, {‘Date of Birth’: ‘1992–02–05’}, {‘Honors’: ‘Represented Brazil, Runner-up Copa América 2021, Winner Campeonato Brasileiro Série A 2013, Olympic Team 2012 London, Team 2016 Rio de Janeiro, U-20 South American Championship 2011 Peru’}]},
The obtained personal descriptions have the following form:
In 2016, Roma announced the signing of Alisson for a transfer fee of €75 million (around ₣325 million).
On July 19, 2018, Liverpool announced the signing of Alisson for a transfer fee of £66.8 million (€72.5 million), making him Liverpool’s second-most expensive signing, only behind the £75 million signing of Virgil van Dijk, and also making Alisson the second-most expensive goalkeeper in history.
Ederson was born in Osasco, São Paulo, Brazil, and joined the youth academy of local club São Paulo in 2008, where he spent a season.
On June 27, 2015, Ederson joined Benfica, the reigning Primeira Liga champions.
Data Cleaning and Relationship Extraction
Although the raw data has a certain structure, it still needs to be cleaned before further processing. The main data cleaning tasks include:
- Converting traditional Chinese to simplified Chinese
- Removing English names and special characters, keeping only Chinese text
- Aligning attributes across different files
- Finding and uniformly filling in missing data values with -1
The raw data is in the ‘rawdata/’ directory, containing information on 31 teams. Running:
python clean.py
will generate the cleaned data, stored in the ‘newdata/’ directory, including:
- clubs.json: Information on all clubs
- country.json: Information on all countries
- players.json: Information on all players, including attributes such as number, position, name, age, appearances, goals, and current club. Among these attributes, we choose club, country, and position as nodes in the knowledge graph, with the remaining attributes as player node properties.
- positions.json: Information on player positions (defender, goalkeeper, etc.)
- relations.csv: Contains three main relationships: player->work_for->club, player->play_the_role_of->position, and player->come_from->country, connecting the four types of nodes.
Finally, we obtained 744 player nodes, 29 country nodes, 331 club nodes, 14 position nodes, and 2,232 relationships.
2. Entity Recognition and Relation Extraction
Since the semi-structured data crawled only contains information about players and their current clubs, we want to also include the historical club affiliations of the players in the knowledge graph. The information about a player’s historical club affiliations is often presented in unstructured text, so we hope to extract the entities and their relations using an entity recognition model.
We also crawled data from the web, obtaining text information for 31 teams in the rawtxt/ directory.
Data Cleaning and Annotation
First, we performed simplified Chinese to traditional Chinese conversion on the crawled data, then performed sentence segmentation and removed special characters. The processed data still had a large amount of redundant information, so we manually selected sentences that were of appropriate length and contained clear player names and club names, obtaining all the sentences in the cleantxt/ directory. Here is an example file:
Next, we performed entity recognition. Since the trained model was trained on Chinese datasets, it did not perform well on recognizing foreign names in Chinese form, so we manually annotated the train and dev data using annotation tools — 57 sentences in train and 37 in dev. The test data contained all 324 sentences. We added the train data to the original training set to improve the model’s ability to recognize foreign names, and replaced the original dev and test sets with our annotated data, with the dev set for monitoring training progress and the test set for obtaining the desired recognition results.
We annotated data for 10 countries from Argentina to Costa Rica, with the results stored in the label/ directory; running python clean_txt.py will generate the data structure required for model training, which can be found in the nerdata/ directory. Here is an example input:
{“text”: [“2”, “0”, “0”, “9”, “年”, “1”, “1”, “⽉”, “2”, “0”, “⽇”, “,”, “施”, “捷”,
“斯”, “尼”, “被”, “外”, “借”, “⾄”, “英”, “甲”, “球”, “会”, “宾”, “福”, “特”, “,”, “为”,
“期”, “⼀”, “个”, “⽉”, “。”], “label”: [“O”, “O”, “O”, “O”, “O”, “O”, “O”, “O”, “O”,
“O”, “O”, “O”, “B-NAME”, “I-NAME”, “I-NAME”, “I-NAME”, “O”, “O”, “O”, “O”, “O”, “O”,
“O”, “O”, “B-ORG”, “I-ORG”, “I-ORG”, “O”, “O”, “O”, “O”, “O”, “O”, “O”]}
Entity Recognition
For the named entity recognition model, we chose LEBERT — a Chinese named entity recognition model based on the fusion of lexical information. This model has achieved good results on multiple Chinese NER datasets. We added the extracted train data to the training dataset and replaced the original dev and test sets, hoping the model can be transferred to the sports domain. We selected the pre-trained model on the resume dataset, as it has entity labels like B-ORG and I-ORG that can effectively identify club names.
We chose lebert as the base model, softmax as the decoder, and imported the bert-base-chinese pre-trained checkpoint from the transformers library, and trained for 30 epochs to get the final result. On the original dataset, the model achieved an F1 score of 0.96.
On our annotated dev set, the model achieved an F1 score of 0.89, indicating that the model training was effective and the performance was relatively good.
The details of the model training can be found in the LEBERT-NER-CHINESE/README.md file.
We stored the model’s output as pred.json, from which we obtained new relationships. The model output a series of identifier labels such as B-NAME, from which we decoded the player names and club names in each text and stored them as pairs in the nerdata/raw_relations.txt file, which can be obtained by running the command:
python get_relation_from_ner.py
Then, we manually cleaned the extracted new relationships and obtained the final nerdata/final_relations.txt file.
3. Import the knowledge graph
We imported the processed data into the Neo4j database using the py2neo library. We first imported the structured data, including the nodes and edges, mainly through the create_nodes() function in build_graph_wc.py:
def create_nodes(self, label): #label from clubs/country/positions
print(“import {} nodes”.format(label))
with open(os.path.join(self.path, “{}.json”.format(label))) as f:
data = json.load(f)
for i, d in enumerate(data):
node = Node(label, name=d)
self.g.create(node)
print(“{}-th node of {}”.format(i, label))
Note that when importing the players nodes, since the nodes have other attributes, we also imported them:
Next, we imported the edges by reading the relations file and finding the two nodes to create the relationship:
def create_edges(self):
data = pd.read_csv(os.path.join(self.path, “relations.csv”))
data = data.values.tolist()
Finally, we imported the relationships obtained from the entity recognition. We first checked if the players and clubs already existed in the knowledge graph, and if not, we created them; then we created the “used_to_work_for” relationship:
def create_txt_realtions(self):
with open(os.path.join(self.path, “players.json”)) as f:
players = json.load(f)
with open(os.path.join(self.path, “clubs.json”)) as f:
clubs = json.load(f)
with open(os.path.join(“nerdata/final_relations.txt”)) as f:
rel = f.read().split(“\n”)
The process of importing the knowledge graph can be executed by running:
python build_graph_wc.py
This will take approximately 90 minutes.
4. Knowledge Graph Results Visualization
We display the results of the knowledge graph. The overall visualization of the knowledge graph is as follows:
The visualization results of the top 25 nodes in all player nodes are as follows:
The visualization of all the positions and countries are as follows:
Positions
Countries
The relationship graph between players and clubs is as follows:
The relationship graph between players and positions is as follows:
The relationship graph between players and countries is as follows:
5. Knowledge Question Answering
After importing the knowledge graph, we build a knowledge question answering system based on Cypher query statements using the existing relationships in the knowledge graph. First, we use the Jieba and PaddlePaddle libraries to perform part-of-speech tagging and entity recognition on the user’s input Chinese questions, take the noun phrases as the keywords of the question, and assume that the first noun phrase in the sentence is the name of the person entity, and the second noun phrase is the name of the relationship to be queried. For example, the code for word segmentation and query statement construction when querying the player attributes is as follows:
def cut_words(self, sentence):
words_flags = pseg.cut(sentence, use_paddle=True) # paddle mode
person = ‘’
words = []
for word, flag in words_flags:
if flag == ‘PER’:
person = word
if flag == ‘n’:
words.append(self.neo.similar_words[word])
logging.debug(str(words))
return person, words
def answer(self, sentence):
try:
sentence = re.sub(“[A-Za-z0–9\!\%\[\]\,\。]”, “”, sentence)
sentence = re.sub(‘\W+’, ‘’, sentence).replace(“_”, ‘’)
person, words = self.cut_words(sentence)
words_ = [0 for i in range(len(words))]
for i in range(len(words)):
words_[i] = ‘-[r’+str(i) + ‘:’+words[len(words)-i-1]+’]’
if i != len(words)-1:
words_[i] += ‘->(n’+str(i)+’:Person)’
quary = “match (p:players{name: “ + person + ‘“}) ‘ + \
“return p.”+ words[0]
except Exception as e:
return ‘No Answer’
try:
import ipdb
ipdb.set_trace()
data = self.neo.graph.run(quary)
data = list(data)[0]
logging.debug(str(data))
result = person +’s’+ words[0] +’is’+data[“p.”+words[0]]
except Exception as e:
return ‘No Answer’
Some sample questions, outputs, and the corresponding Cypher statements are as follows:
(1) User: What is the age of Andreas Christensen?
Answer: Andreas Christensen’s age is 26
Cypher:
match (p:players{name: “Andreas Christensen”}) return p.age
(2) User: How many goals has Andreas Christensen scored?
Answer: Andreas Christensen has scored 2 goals
Cypher:
match (p:players{name: “Andreas Christensen”}) return p.goal
(3) User: What is the position of Andreas Christensen?
Answer: Andreas Christensen’s position is defender
Cypher:
match (p:players{name: “Andreas Christensen”})-[:play_the_role_of] ->(q:positions) return q.name
(4) User: What is the country of Andreas Christensen?
Answer: Andreas Christensen’s country is Denmark
Cypher:
match (p:players{name: “Andreas Christensen”})-[:come_from] ->(q:country) return q.name
(5) User: What is the club of Andreas Christensen?
Answer: Andreas Christensen’s club is Barcelona
Cypher:
match (p:players{name: “Andreas Christensen”})-[:work_for] ->(q:clubs) return q.name