Innovation and Practices of Knowledge Graph in the Era of Multimodal Big Data

10 min readMar 26, 2024

Since Google first proposed the concept of a knowledge graph in 2012, major internet companies and research institutions have recognized its importance, placing it on par with deep learning and conducting in-depth research as a critical technology for cognitive intelligence. So, what can a knowledge graph do?

In short, there are two main aspects: first, it enables machines to better understand data, and second, it helps machines to better explain phenomena. In the context of our topic, knowledge graphs have also started to integrate with various perceptual intelligence technologies, such as image recognition and speech recognition, among other deep learning techniques. Furthermore, the dynamic and multimodal nature of the knowledge graph is emerging as a trend.

The lifecycle of a knowledge graph consists of three main parts:

Addressing the source of knowledge and efficient construction of the knowledge graph.
How do you utilize knowledge? Building upon existing knowledge, how do you discover implicit knowledge and generate additional value?
After obtaining a complete knowledge graph, the key lies in applying it on a large scale in various industries and internet applications, enabling intelligent capabilities.

Figure 2:Comparative Analysis of Methods for Constructing Multimodal Knowledge Graphs

Traditionally, the construction of knowledge graphs, especially multimodal ones, involves extracting information from different modalities separately and then integrating them into a final multimodal graph, as shown in the left half of Figure 2. Specifically, information extraction is performed based on text and images, and the resulting specific modality graphs are fused to form a multimodal knowledge graph.

However, this approach has certain limitations. Firstly, it must consider the dependencies and correspondences between different modality features from the source, resulting in a fusion result that fails to accurately capture the various associations inherent in the multimodal data. To address this, we propose a further advancement where the knowledge graph possesses multimodal characteristics from the outset. The constructed multimodal knowledge graph can assist in understanding multimodal data, perform tasks such as visual relationship recognition and cross-modal entity linking, and further be applied in question answering, search, visual analytics, and decision support.

From Knowledge Graph to Multimodal Graph

How can we extend from a traditional knowledge graph to a multimodal graph? For each entity or concept in the graph, we associate corresponding images. We aim to collect content from different angles, perspectives, and themes throughout the day to better characterize multimodal knowledge, especially visual relationships. Initially, the associated images may be limited, so we further employ approximate K-nearest neighbors to expand the photo set. This ensures relevance and diversity, resulting in a more comprehensive representation of the corresponding graph nodes.

Figure 3: From KG to Multimodal KG: Image Selection and Expansion Strategies

Long-Tail Visual Relationship Recognition

Figure 4: Long-Tail Nature of Visual Relationships and Visual Relationship Detection

Long-Tail Nature of Visual Relationships and Visual Relationship Detection

Given an image, we can identify multiple objects, and further, we can detect the relationships between different objects.

As shown in Figure 4, different objects are represented by different colored bounding boxes:

In the second image, the red bounding box represents a person, the green bounding box represents a motorcycle, and the visual relationship between them is “person-on-motorcycle.”

In the third image, the green bounding box represents a helmet, and the visual relationship is “person-wear-helmet,” indicating that the person is wearing a helmet.

In the last image, the red bounding box represents a motorcycle, and the green bounding box represents a wheel. The visual relationship detected is “motorcycle-has-wheel.”

Figure 5: Typical Techniques for Visual Relationship Detection

Visual relationship detection is a critical component of visual scene understanding. However, due to the sparsity of visual relationships, effectively predicting many long-tail relationships is often challenging. A work published in the top computer vision conference CVPR 2017 introduced VTransE, which extends the classic translation-based representation learning method TransE by leveraging knowledge graph embedding techniques. VTransE maps the visual feature space of images to the relationship space, allowing the vectors of the head entities and visual relationships to be close to the vector of the tail entity in the mapped semantic space.

While this approach is simple to implement, it still needs to improve on TransE’s limitations when dealing with one-to-many or many-to-many relationship predictions. Subsequent improvements presented at CVPR 2019 addressed this issue by learning dynamic tree structures to capture visual context and predict visual relationships based on this context. This approach partially alleviates the difficulty of detecting long-tail relationships.

Figure 6: Long-Tail Visual Relationship Recognition: Our Approach

Building upon the work above, we employ a multimodal graph-based approach to further optimize the recognition of long-tail visual relationships. Firstly, in scenarios where features are incredibly sparse, we utilize interactions among various modal features to expand the feature space. Secondly, we leverage similarity graphs formed from objects or relationships across different images and employ message passing to alleviate data-level sparsity.

Cross-Modal Entity Linking

Figure7：Typical Architecture for Cross-Modal Entity Linking

Entity linking is widely used in the intelligent processing of text. In the text on the left side of Figure 7 (a), for the highlighted mention “Michael Jordan,” the task is to automatically disambiguate and link it to either the basketball legend or the renowned expert in machine learning and statistical learning. We refer to this task as entity linking. Figure 7 (b) expands the entity linking to a multimodal scenario, where given an image and its corresponding textual description, the goal is to automatically determine the objects in the image.

For cross-modal entity linking, different neural networks are often employed for processing images, textual descriptions, and the words or phrases to be related. Convolutional Neural Networks (CNNs) are used for images, while Bidirectional LSTM or its variants are used for textual descriptions. The mentioned representation obtained from these networks, including modality-specific attention mechanisms, combines candidate entity representations derived from the graph structure and label descriptions. Semantic matching and ranking of these representations are conducted to accomplish cross-modal entity linking.

Figure 8：Semantic Visual Entity Linking Based on Cross-Modal Interaction Learning

We further consider the relationships between different modalities. When extracting modal features, we consider the correlations between different visual objects in the image, forming a scene graph. At the same time, we employ state-of-the-art (SOTA) models to extract the named entities contained within the textual description. These named entities serve as candidate options for subsequent linking.

Additionally, we take into account the attention mechanism for modality, allowing for the consideration of both textual and visual features during the selection process.

Now that we have discussed the construction of a multimodal graph, what can we do with the acquired knowledge? To discover implicit knowledge, an important task is knowledge reasoning, which involves inferring new knowledge or facts based on existing knowledge. Generally, there are four types of knowledge reasoning:

01.The first type is deductive reasoning, which is based on symbolic logic and involves derived conclusions based on premise conditions.

02.The second type is inductive reasoning, which involves inferring general principles or mechanisms based on limited observed phenomena. Various machine-learning techniques fall into this category.

03.The third type is abductive reasoning, which involves inferring causes based on observed results. It is often used for problem localization and root cause analysis in fault detection and diagnosis.

04.The fourth type is analogical reasoning, which involves mapping and aligning different objects or spaces. It is widely used in tasks such as textual entailment or semantic similarity computation.

Figure 10： Neural Network Approaches for Knowledge Graph Reasoning

Deep learning methods have increasingly been applied in various knowledge reasoning tasks. Firstly, knowledge graphs or knowledge bases are often incomplete; in such cases, we aim to expand the graph. Knowledge graph representation learning and the recent popularization of graph neural networks have been utilized for this task.

Moreover, various architectures such as recurrent neural networks with attention mechanisms, hierarchical graph convolutional networks, and extensions are widely used in complex knowledge question-answering tasks requiring multi-hop reasoning. However, these methods are based on statistical inference and are limited to shallow reasoning, needing more ability to cover the full range of logical deductions. This limitation results in a trade-off with interpretability.

While neural network approaches provide powerful tools for knowledge graph reasoning, they often prioritize performance over explainability. As a result, there is ongoing research and development to address the trade-off between the expressiveness and interpretability of such models.

Figure 11: Neural Network Approaches for Knowledge Graph Reasoning (Continued)

Furthermore, many works have designed neural networks to perform specific logical reasoning or axiom-proving tasks (as shown on the right side of Figure 11). These statistically learned models with semantic equivalence can be further integrated into knowledge graph management systems, enabling support for precise, logical calculations and data-driven and probabilistic inference (as shown on the left side of Figure 11).

Deep learning methods often require large amounts of data. Even in the case of multimodal problems, we frequently encounter situations with limited data and significant data sparsity. How can we leverage knowledge and graph structure to address these challenges? Some beneficial attempts in utilizing knowledge graphs include using them for data augmentation and transfer learning through distant supervision and supporting representation learning with more complex expressive power, such as rules and more potent forms of knowledge. These approaches contribute to leveraging knowledge graphs to support deep learning in limited and sparse data scenarios.

Figure 12: Using Knowledge Graph for Interpreting Intermediate Results in Neural Networks

As mentioned earlier, interpretability is crucial for cognitive intelligence. To better utilize neural network models in various decision-making tasks, the intermediate results obtained through nonlinear transformations can be decoded and mapped to corresponding nodes in the knowledge graph, facilitating better human understanding.

In addition to knowledge reasoning, another typical application in multimodal scenarios is question-answering. Question-answering has gone through various stages, including retrieval-based QA in the 1990s, community-based QA, and knowledge-based QA in personal assistants and multiple industries. We aim to integrate these complementary technologies to support multi-strategy question-answering on different data types.

Specifically, as shown in Figure 13, retrieval-based question-answering techniques (IRQA) can be used for question-answer pairs. Knowledge-based question-answering (KBQA) can be applied to well-structured graph data. Machine reading comprehension-based question-answering (MRCQA) can be utilized for text or corpora data.

With the popularity of pretraining models, starting from word2vec/glove to more context-aware models such as ELMO, GPT, and BERT, we can train models on large-scale general-purpose corpora and fine-tune them on task-specific data, such as question-answering, with a smaller amount of domain-specific data. This transfer learning approach allows us to leverage pre-trained models and achieve better performance on downstream tasks like question-answering.

Figure 13:Typical Question-Answering Data and Technical Paradigms

Moreover, each question-answering technique has its necessary conditions for usage, applicable range of problems, and corresponding advantages and limitations (as described in detail in Figure 14). To develop a practical question-answering system that can be used in real-world scenarios, it is necessary to adopt a multi-strategy approach to integrate the strengths of different question-answering systems.

Here are a few examples of typical applications of multimodal knowledge graphs by EpiK Protocol:

Financial Securities Domain

A typical application in the financial securities domain is identifying ultimate beneficial owners. By integrating data from various sources, especially multimodal data, scattered across different locations, we can uncover the connections and traces between them, ultimately identifying the hidden ultimate beneficial owners. This can also be applied to credit risk assessment and detecting related-party transactions.

Industrial Internet

A common application in the industrial internet is fault detection in power systems. This involves multidisciplinary knowledge and multimodal data, where various computations and neural networks are used to obtain empirical formula results. Applying techniques such as causal reasoning can identify anomalies, and corresponding fault detection and classification problems can be addressed. This allows for discovering potential causes and recommending relevant detection strategies.

In the era of artificial intelligence, with the blueprint of multimodality and knowledge graphs, they will play a more significant role in fields such as finance, customer service, education, and healthcare.

Innovation and Practices of Knowledge Graph in the Era of Multimodal Big Data

From Knowledge Graph to Multimodal Graph

Long-Tail Nature of Visual Relationships and Visual Relationship Detection

Written by EpiK Protocol