Enabling Machines to Understand Human Language by Knowledge Graphs

Can machine think like


Language is the tool of thinking

It is the ability of language speaking and understanding that distinguish us from animals
Enabling machine to understand human language is the essential path to realize intelligent information processing and smart robot brain.

Obstacles of machine language
• Language understanding of machines
needs knowledge bases
• Large scale
• Semantically rich
• Friendly structure
• Traditional knowledge representations
can not satisfy these requirements
• Ontology
• Semantic network
• Texts

Knowledge Graph

• Knowledge graph is a large scale semantic

network consisting of entities/concepts as well as
the semantic relationships among them
• Higher coverage over entities and concept
• Richer semantic relationships
• Usually organized as RDF
• Quality insurance by Crowdsourcing
• Why Knowledge Graphs?
• Understanding the semantic of text needs
background knowledge
• A robot brain needs knowledge base to understand
the world
• Yago҅WordNet, FreeBase, Probase, NELL, CYC,


How to enable machine to understand human languages by knowledge graphs?

Understanding Human Languages
• Understanding a concept/category (IJCAI2016)
• Understanding a set of entities (under review)
• Understanding a bag of words (IJCAI2015)
• Understanding verb phrases (AAAI2016)
• Understanding short texts (EMNLP2016)
• Understanding natural language questions (VLDB2017҅
• Inference of missing facts (AAAI2017)


Language Cognitive Ability
• Conceptualization
• Newton-> Scientist
• Association
• Microsoft-> Bill Gates
• Inference
• Man has brain, brain can think-> Man can think
• Induction
• Ceremony, bride, rose-> wedding
• Categorization
• Sex=man, Marriage status=unmarried! Bachelor



Probase and Probase+

Extracted by Hearst pattern
• NP such as NP, NP, …, and|or NP such
NP as NP,* or|and NP
• NP, NP*, or other NP
• NP, NP*, and other NP
• NP, including NP,* or | and NP NP,
especially NP,* or|and NP
• domestic animals such as cats and
dogs …
• China is a developing country.
• Life is a box of chocolate.




DBPedia and CN-DBPedia

• DBpedia

• Extract structured information
from Wikipedia
• Make this information available on
the Web under an open license
• Interlink the DBpedia dataset with
other datasets on the Web
• Contributors
• Freie Universität Berlin (Germany)
• Universität Leipzig (Germany)
• OpenLink Software (UK)
• Linking Open Data

CN-DBpedia: a Chinese counterpart of
• Developed by Knowledge Works at Fudan
• Rich for entities’ structured information.
• Contains many category, tags for entities


Understanding a Concept/Category


How do we understand a concept category?


Bachelor : Sex=man Marriage status=unmarried

What is Defining Features of Categories

Defining features are assumed to establish the
necessary and sufficient conditions to
characterize the meaning of the category.
• Any entity with the defining features should
belong to the category
• Any entity belonging to the category must
contain the defining features

E.g., Category “Jay Chou albums”
Defining Features
{(Type, album), (Singer, Jay Chou)}
Non-Defining Features
{(Type, album), (Singer, Jay Chou), (genre, Pop music)}
{(Type, single), (Singer, Jay Chou)}


Solution and Results

How to measure the goodness
of a set of features

Challenge and Solutions

The search space of candidate feature set is of
Using frequent pattern mining to find candidate
defining feature sets that are frequent enough


Solution Framework

Repeat until no new DFs can be found
Step 1: Using a score function to find DFs of some
Step 2 & 3: Using a rule-based method to get more DFs
of categories
Step 4: Populate DBpedia by using DFs of categories
discovered so far

We finally obtain 60,247 new C-DFs with average quality
score 2.82

Understanding a Set of Entities

Given a set of entities, can we understand
its concept and recommend a most
related entity?

E-commerce: if users are searching
samsung s6, and iPhone 6, what should we
recommend and why?







Understanding a Set of Entities

A naive solution:
Use the taxonomy, such as Probase, to find
the nearest common ancestor

• Problem:
• A concept not necessarily exists
• We can find China, Russia, Brazil and India
under the topic ‘developing country’, but
there is not an exact topic ‘BRIC’.
• A concept may cover many non-relevant
• Under the topic ‘developing country’ there
are many other countries which makes
us difficult to find the most related
• The best concept in most cases is implicit
• The information in Probase is not clean


Model 1: using concepts as hidden
variables and punish concepts with too
much member entities

Model 2: the entity whose union with the
query entities should preserve the concept
distribution of queries.

Understanding a Set of Entities

Understanding Verb Phrase

E.g. I watched The Amazing Spider-man 2 and
thought it’s impressive.
How to understand “The Amazing Spider-man
2” using verb “watch”?
Pattern: watch $movie -> “The Amazing Spiderman
2” isA movie
Linguists [Sinclair 1990] found two principles for
verb phrases:
• idiom patterns: Kick the ass/ watch step
• conceptualized patterns: eat fruit (apple/
banana etc.) drink beverage (wine, tea
Model: extract the patterns of verb phrases

Conceptualization using verbs
The apple(object) he ate(verb) yesterday
has a bad taste.
Pattern: eat $food -> apple isA $food
Parsing: Finding subjective/objective/etc. of a

Understanding Verb Phrase

Challenge: trade-off between generality and specificity

Generality: One general pattern is
better than several specific pattern.
Specificity: A pattern’s assigned
entities and the pattern itself should
be matched

By using MDL




Our approach outperforms the competitors
Verb patterns are helpful for conceptualization

Understanding Short Texts

Cover of iPhone 6 plus

• Short texts are
• web queries
• instant messages



distance from earth from moon

Understanding the semantic
of short texts
• syntactic parsing


thai food located in houston





Understanding Short Texts

Syntactic parsing of short
texts is challenging
• Grammatical signals from
function words and word
order are not available
• There is no labeled
dependency trees
(treebank) for web
queries, nor is there a
standard for constructing
such dependency trees

Our solution
• Inferring tree decency from
complete sentences by heuristic
• e.g. Connected via function

Train a transition based
decency parser


Understanding Short Texts

• Stanford Parser heavily relies on grammar signals
such as function words and word order, while
QueryParser relies more on the semantics of the query
• QueryParser consistently outperformed competitors
on short query parsing task

Understanding a bag of words


Given a bag of words, can we inference
what the article is talking about?
china, japan, india, korea -> asian country
dinner, lunch, food, child, girl -> meal, child
bride, groom, dress, celebration -> wedding

Challenge: how to measure the “goodness” of the
labels we assign to a bag of words
Coverage: the conceptual labels should cover as
many words and phrases in the input as
Minimality: the number of conceptual labels
should be as small as possible

Topic labelling
• A topic is a bag of words that do not have
explicit semantics
• Conceptual labeling turns each topic into
a small set of meaningful concepts
Language understanding
• Verb role labeling
• Can summarize verb eat’s direct objects
apple, breakfast, pork, beef, bullet, into
a small set of concepts, such as fruit,
meal, meat, bullet




Minimum description length

The best concepts should capture the regularities of the words as much as possible, which
enables us to compress the data as much as possible

Problem: Given a bag of word xm,find


• Our solutions can find minimum
number of concepts to label a
bag of words
• Most conceptual labels are specific
• Noise words will be ignored






• Our models can fully employ the
attributes of concepts to
generate a better label




Understanding Natural Language Questions

Online procedure parses and answers a
• Question Parsing: convert questions to
templates by NER and
• Predicate Lookup: lookup the entity and
the predicate of given template and
return corresponding value
Offline procedure learns the mapping from
templates to predicates
• Template Extraction: learns templates
and their corresponding predicates
• Predicate Expansion: learn predicate




Key idea:

Understanding a question’s intent by its




A probabilistic generative model for the template based

predicate inference
1. Starting from question q, generate its entity e according to the
distribution P(e|q).
2. Generate the template t according to the distribution P(t|q,e).
3. Infer predicate p by P(p|t), where the predicate p only
depends on t.
4. Generate the answer value v by P(v|e,p).



KBQA finds significantly more templates and predicates
than its competitors despite that the corpus size of
bootstrapping is larger

Missing isA Facts Inference

There are many missing links in a data driven conceptual
taxonomy, such as Probase
Newton isA scientist
Steve jobs isA billionaire

Can we infer missing facts from existing
facts in knowledge base?

Data Bias: many common
sense like facts can not
be observed from data

Can we infer that Steve Jobs is a billionaire
from the fact that Bill Gates is a billionaire?

Missing isA facts Inference- Ideas and Results

Inference from similar instances

Inference from similar concepts

• Our features are effective to find
missing facts
• Our models can consistently achieve
90% precision
• More similar entities/concepts, the
higher the accuracy

Open Challenges
• Common sense knowledge
• human cannot fly
• the sun rises from the east
• the object will fall to
ground without any
• Reasoning in language
• Obama is a white manҘ

• Why understanding common
sense knowledge is challenging
• No one will mention it
explicitly in texts
• No source to extract
• Why reasoning is so hard
• Hard inference always suffers
from exceptions
• birds can fly but ostrich
cannot fly

Research Outline


Graph Analytic
1̵Models for symmetry (Physical Review E 2008)
2̵Graph Simplification (Physical Review E 2008)
3̵Complexity/distance measurement (Pattern
Recognition 2008, Physica A 2008)
4̵Graph Index Compression (EDBT2009)
5̵Graph anonymization (EDBT2010)



Knowledge Graph Construction
1̵IsA taxonomy completion (TKDE2017)
2̵Implicit isA relation inference (AAAI2017)
3̵Error isA correction (AAAI2017)
4̵Cross-lingual type inference(DASFAA2016)
5̵End-to-end knowledge harvesting
6̵Domain-specific knowledge harvesting



Natural Language Understanding by KG
1̵Understanding bag of words (IJCAI2015)
2̵Understanding a set of entities
3̵Understanding verb phrase (AAAI2016)
4̵Understanding a concept (IJCAI 2106)
5̵Understanding short text (EMNLP2016)
6̵Understanding natural languages (IJCAI2016)


Knowledgable Search/Recommendation
1̵Recommendation by KG (WWW2014̵DASFAA2015)
2̵User profiling by KG (ICDM2015̵CIKM2015)
3̵Categorization by KG (CIKM 2015)
4̵Entity suggestion with conceptual explanation
5̵Entity search by long concept query


Big Graph Management
1̵Big graph systems(SIGMOD12)
2̵Overlapping community search (SIGMOD2013)
3̵Local Community search (SIGMOD2014)
4̵Big graph partitioning (ICDE2014҂
5̵Shortest distance query (VLDB2014҂
6̵Fast graph exploration (VLDB 2016)



1. CN-DBPedia CN-DBpedia is an effort to extract
structured information from Chinese encyclopedia
sites, such as Baidu Baike, and make this information
available on the Web. CN-DBpedia allows you to ask
sophisticated queries against Chinese encyclopedia
sites, and to link the different data sets on the Web to
Chinese encyclopedia sites data
2. Probase Plus Probase is a web-scale taxonomy
that contains 10 millions of concepts/entities and 16
millions of isA relations. In addition, ProbasePlus is a
updated taxonomy that has more isA relations inferred
from the original Probase. They are useful for
conceptualization, reasoning, etc
3. Verb Base
Verb pattern is a probabilistic semantic representation on
verbs. We introduce verb patterns to represent verbs’
semantics, such that each pattern corresponds to a single
semantic of the verb. We constructed verb patterns with
the consideration of their generality and specificity.


• Kowledge Works@FUDAN
• http://Kw.fudan.edu.cn
• Knowledge works is a studio focusing on building
and managing large scale knowledge graphs of
high quality as well as the applications of
knowledge graphs in text understanding,
intelligent search and robot brain.
• Graph Data Management Lab@FUDAN
• http://gdm.fudan.edu.cn
• GDM@FUDAN focuses on studying and developing effective and efficient solutions to
manage and mine these graph data, aiming at
understanding real graphs and supporting real
applications built upon large real graphs.
Recently, we are especially interested in
knowlege graphs and its application.

Our Mission: The construction, management and application of large scale
knowledge graphs


Knowledge Graph
a kind of semantic network that consists of entities/
concepts as well as their semantic relationships. Higher
coverage over entities and concepts, more abundant
semantic relationships, constructed in an more
automatic way, higher accuracy is expected.
The key of intelligent information processing.
KG has shown its potential power in solve problems such as search intent understanding,
relationship explaining, user profiling. It is of great business value in intelligent search,
intelligent software, cybernetic security and intelligent business.
The key to build a machine that think like human
KG provides necessary background knowledge to enable machine to understand language
and think like human.


Leave a Reply

Your email address will not be published. Required fields are marked *