Link Analysis of Life Science Linked Data
1
Wei Hu1, Honglei Qiu1, and Michel Dumontier2
1State Key Laboratory for Novel Software Technology, Nanjing University, China
2Center for Biomedical Informatics Research, Stanford University
@micheldumontier::ISWC 2015
Linked Data offers links between
datasets, but they are often
incomplete and may contain
errors.
@micheldumontier::ISWC 20152
Network Analysis
• Network analysis has long been
used to study link structures
– The structure of the Web
– Network medicine: cellular
networks and implications
@micheldumontier::ISWC 20153
Power law is scale free
A graph demonstrates the small world
phenomenon, if its clustering coefficient is
significantly higher than that of a random
graph on the same node set, and if the graph
has a shorter average distance.
BTC2010
The clustering coefficient quantifies how close
its neighbors are to be a clique. The average
distance is the average shortest path length
between all nodes in the graph.
Dataset link analysis
(using RDF data model)
Entity link analysis
(using cross-references)
Term link analysis
(using ontology matching)
@micheldumontier::ISWC 20154
@micheldumontier::ISWC 2015
Linked Data for the Life Sciences
5
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• Release 3 (June 2014)
• 35 datasets
• 11B RDF triples
• 1B entities
• 2K classes
• 4K properties
Dataset Links
@micheldumontier::ISWC 20156
Network Properties
1. Well linked
2. Hubs and authorities
3. small-world phenomenon
Average distance = 2.77 vs 6
Clustering coefficient = 0.22 vs
0.13
4. robust on systematic removal
of nodes
Entity Link Analysis
How well do entities link to each other?
• 76% entity links involve a special kind of RDF triples
– e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>
– x-relations have under-specified semantics
• May be truly identical, may refer to another related entity …
• Degree distribution
– Some do not follow power law
• Exponent is too large (close to 5)
7
BTC2010
@micheldumontier::ISWC 2015
symmetry of entity links varies
between different pairs of datasets
• Over 99% of links are reciprocated in DrugBank-PharmGKB and
OMIM-HGNC
– Suggests link sharing and synchronization
• Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanet
links are reciprocal
– Suggests incomplete mapping
• 28% of OMIM-Orphanet links are malposed
– Suggests variation in model (omim:Phenotype to orphanet:Disorder)
8 @micheldumontier::ISWC 2015
Transitivity Analysis:
Find mismatches and discover new links
@micheldumontier::ISWC 20159
Evaluation of Entity Matching
How accurate are current entity matching approaches?
• Built a benchmark from the reciprocal links between similarly-typed
entities
• Evaluated several entity matching approaches
– Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard
– Machine learning: Linear regression, logistic regression with 5 properties
• Many-to-one links are difficult to be discovered
10 @micheldumontier::ISWC 2015
Term Link Analysis
How similar are the topics in the data network?
• Use ontology matching to generate term link graph
– Falcon-AO (linguistic matchers + structural matcher + synonyms)
• Created 83K class mappings, 1.5K object property mappings, and 858 data
property mappings
– Similarity threshold = 0.9
– Top-5 popular labels for classes and properties
• Significant overlap in topics, does not follow power law as in broader SW
11 @micheldumontier::ISWC 2015
Correlation of Link Graphs
To what degree are each of the three link graphs are correlated?
• Spearman’s rank correlation coefficient:
– Entity link graph  dataset pairs: entity links / entities
– Term link graph  dataset pairs: term mappings / terms
– Dataset link graph  dataset pairs: shortest path length
• All positively correlated
– Closer datasets in distance have more linked entities and terms
– Number of linked entities contributes little to overlap of topics
12 @micheldumontier::ISWC 2015
Summary of Findings
• Dataset, entity and term link graphs do not necessarily share the same
characteristics with the Hypertext / Semantic Web
– Degree distribution of entity links does not follow power law
– Data hubs
• A significant number of entities have been linked using x-relations, but
their intended semantics differs
– Classes are identical or equivalent  entity links represent logical equivalence
• Symmetric and transitive entity links do exist, but their utility is weakened
due to their small number
– Meanings of entity links may shift during transitive closure
• Only matching the labels of entities may fail, while combining different
properties and using simple learning algorithms achieve good accuracy
13 @micheldumontier::ISWC 2015
dumontierlab.com
michel.dumontier@stanford.edu
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier
14 @micheldumontier::ISWC 2015

Link Analysis of Life Sciences Linked Data

  • 1.
    Link Analysis ofLife Science Linked Data 1 Wei Hu1, Honglei Qiu1, and Michel Dumontier2 1State Key Laboratory for Novel Software Technology, Nanjing University, China 2Center for Biomedical Informatics Research, Stanford University @micheldumontier::ISWC 2015
  • 2.
    Linked Data offerslinks between datasets, but they are often incomplete and may contain errors. @micheldumontier::ISWC 20152
  • 3.
    Network Analysis • Networkanalysis has long been used to study link structures – The structure of the Web – Network medicine: cellular networks and implications @micheldumontier::ISWC 20153 Power law is scale free A graph demonstrates the small world phenomenon, if its clustering coefficient is significantly higher than that of a random graph on the same node set, and if the graph has a shorter average distance. BTC2010 The clustering coefficient quantifies how close its neighbors are to be a clique. The average distance is the average shortest path length between all nodes in the graph.
  • 4.
    Dataset link analysis (usingRDF data model) Entity link analysis (using cross-references) Term link analysis (using ontology matching) @micheldumontier::ISWC 20154
  • 5.
    @micheldumontier::ISWC 2015 Linked Datafor the Life Sciences 5 Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF. chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications • Release 3 (June 2014) • 35 datasets • 11B RDF triples • 1B entities • 2K classes • 4K properties
  • 6.
    Dataset Links @micheldumontier::ISWC 20156 NetworkProperties 1. Well linked 2. Hubs and authorities 3. small-world phenomenon Average distance = 2.77 vs 6 Clustering coefficient = 0.22 vs 0.13 4. robust on systematic removal of nodes
  • 7.
    Entity Link Analysis Howwell do entities link to each other? • 76% entity links involve a special kind of RDF triples – e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002> – x-relations have under-specified semantics • May be truly identical, may refer to another related entity … • Degree distribution – Some do not follow power law • Exponent is too large (close to 5) 7 BTC2010 @micheldumontier::ISWC 2015
  • 8.
    symmetry of entitylinks varies between different pairs of datasets • Over 99% of links are reciprocated in DrugBank-PharmGKB and OMIM-HGNC – Suggests link sharing and synchronization • Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanet links are reciprocal – Suggests incomplete mapping • 28% of OMIM-Orphanet links are malposed – Suggests variation in model (omim:Phenotype to orphanet:Disorder) 8 @micheldumontier::ISWC 2015
  • 9.
    Transitivity Analysis: Find mismatchesand discover new links @micheldumontier::ISWC 20159
  • 10.
    Evaluation of EntityMatching How accurate are current entity matching approaches? • Built a benchmark from the reciprocal links between similarly-typed entities • Evaluated several entity matching approaches – Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard – Machine learning: Linear regression, logistic regression with 5 properties • Many-to-one links are difficult to be discovered 10 @micheldumontier::ISWC 2015
  • 11.
    Term Link Analysis Howsimilar are the topics in the data network? • Use ontology matching to generate term link graph – Falcon-AO (linguistic matchers + structural matcher + synonyms) • Created 83K class mappings, 1.5K object property mappings, and 858 data property mappings – Similarity threshold = 0.9 – Top-5 popular labels for classes and properties • Significant overlap in topics, does not follow power law as in broader SW 11 @micheldumontier::ISWC 2015
  • 12.
    Correlation of LinkGraphs To what degree are each of the three link graphs are correlated? • Spearman’s rank correlation coefficient: – Entity link graph  dataset pairs: entity links / entities – Term link graph  dataset pairs: term mappings / terms – Dataset link graph  dataset pairs: shortest path length • All positively correlated – Closer datasets in distance have more linked entities and terms – Number of linked entities contributes little to overlap of topics 12 @micheldumontier::ISWC 2015
  • 13.
    Summary of Findings •Dataset, entity and term link graphs do not necessarily share the same characteristics with the Hypertext / Semantic Web – Degree distribution of entity links does not follow power law – Data hubs • A significant number of entities have been linked using x-relations, but their intended semantics differs – Classes are identical or equivalent  entity links represent logical equivalence • Symmetric and transitive entity links do exist, but their utility is weakened due to their small number – Meanings of entity links may shift during transitive closure • Only matching the labels of entities may fail, while combining different properties and using simple learning algorithms achieve good accuracy 13 @micheldumontier::ISWC 2015
  • 14.