Machine Learning and Deep
Contemplation of Data
Joel Saltz
Department of Biomedical Informatics
Stony Brook University
CCDSC
October 5, 2016
From BDEC: “Domain”: Spatio-temporal Sensor
Integration, Analysis, Classification
• Multi-scale material/tissue structural, molecular, functional
characterization. Design of materials with specific structural, energy
storage properties, brain, regenerative medicine, cancer
• Integrative multi-scale analyses of the earth, oceans, atmosphere,
cities, vegetation etc – cameras and sensors on satellites, aircraft,
drones, land vehicles, stationary cameras
• Digital astronomy
• Hydrocarbon exploration, exploitation, pollution remediation
• Solid printing integrative data analyses
• Data generated by numerical simulation codes – PDEs, particle
methods
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Precision Medicine Meta Application
• Predict treatment
outcome, select,
monitor treatments
• Reduce inter-observer
variability in diagnosis
• Computer assisted
exploration of new
classification schemes
• Multi-scale cancer
simulations
Imaging and Precision Medicine - Pathomics, Radiomics
Identify and segment trillions of objects – nuclei, glands,
ducts, nodules, tumor niches … from Pathology,
Radiology imaging datasets
Extract features from objects and spatio-temporal
regions
Support queries against ensembles of features extracted
from multiple datasets
Statistical analyses and machine learning to link
Radiology/Pathology features to “omics” and outcome
biological phenomena
Principle based analyses to bridge spatio-temporal scales
– linked Pathology, Radiology studies
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Current Driving Applications
• Checkpoint Inhibitors –
when to use, when to
stop
• Pathology, Imaging data
obtained prior to and
during treatment
• Integration of “omics”,
tissue and imaging to
manage treatment
• Non Small Cell Lung
Cancer, Melanoma, Brain
• Virtual Tissue Respository
• SEER Cancer
Epidemiology
• 500K Cancer Patients per
year
• DOE/NCI pilot involving
text
• Our co-located
companion Virtual Tissue
Repository pilot targets
SEER images
Radiomics
Decoding tumour phenotype
by noninvasive imaging using
a quantitative radiomics
approach
Hugo J. W. L. Aerts et. Al.
Nature Communications 5, Article
number: 4006
doi:10.1038/ncomms5006
Features
Patients
Integrative
Morphology/”omics”
Quantitative Feature Analysis
in Pathology: Emory In Silico
Center for Brain Tumor
Research (PI = Dan Brat,
PD= Joel Saltz)
NLM/NCI: Integrative
Analysis/Digital Pathology
R01LM011119, R01LM009239
(Dual PIs Joel Saltz, David
Foran)
J Am Med Inform Assoc. 2012
Integrated morphologic
analysis for the
identification and
characterization of disease
subtypes.
Pathomics
Lee Cooper, Jun Kong
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Robust Nuclear Segmentation
• Robust ensemble algorithm to segment nuclei across tissue types
• Optimized algorithm tuning methods
• Parameter exploration to optimize quality
• Systematic Quality Control pipeline encompassing tissue image
quality, human generated ground truth, convolutional neural
network critique
• Yi Gao, Allen Tannenbaum, Dimitris Samaras, Le Hou, Tahsin Kurc
Cell Morphometry Features
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
3D Slicer Pathology – Generate High Quality
Ground Truth
Apply Segmentation Algorithm
Adjust algorithm parameters, manual fine tuning
Sanity Check Features
Relationship Between Image and FeaturesLeveraging Visualization to Aid in Feature Management
Step 1: Choose a case from the TCGA atlas (case #20)
Step 2: Select two features of interest; X
axis (area), Y axis (perimeter)
Step 3: Zoom in on region of interest
Step 4: Pick a specific nucleus of interest.
Each dot represents a single nucleus
Step 5: Evaluate the features selected in the context of
the specific nucleus and where this nucleus is located
within the whole slide image
The tool provides visual context for feature evaluation. This technique maps both intuitive features (i.e.
size, shape, color) and non-intuitive features (i.e. wavelets, texture) to the ground truth of source
images through an interactive web-based user interface.
Selected nucleus
geolocated within
whole slide image
Detects
elongated
nucleus
Going from the whole slide data set to selected features and back to the image
Adding a visual perspective by using a live web-based interactive tool (http://sbu-bmi.github.io/featurescape/u24/Preview.html)
Select Feature Pair – dots correspond to nuclei
Subregion selected – form of gating analogous to flow
cytometry
Sample Nuclei from Gated Region
Gated Nuclei in Context
Compare Algorithm Results
Heatmap – Depicts Agreement Between Algorithms
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Auto-tuning and feature extraction
• Goal – correctly segment trillions of objects (nuclei)
• Adjust algorithm parameters
• Autotuning – finds parameters that best match ground
truth in an image patch
• Region template runtime support to optimize
generation and management of multi-parameter
algorithm results
• Eliminates redundant computation, manages locality
• Active Harmony – Jeff Hollingsworth!!
• Collaboration – George Teodoro, Tahsin Kurc
E=Eliminate Duplicate Compuations
Performance Optimization
256 nodes of Stampede. Each node of the cluster has a dual socket Intel Xeon
E5-2680 processors, an Intel Xeon Phi SE10P co-processor and 32GB RAM.The
nodes are inter-connected via Mellanox FDR Infiniband switches.
Good Bad
Test as Good 2916 33
Test as Bad 28 2094
Machine Learning and Quality Critiquing
SVM Approach
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Feature Explorer - Integrated Pathomics Features, Outcomes
and “omics” – TCGA NSCLC Adeno Carcinoma Patients
Feature Explorer - Integrated Pathomics Features, Outcomes
and “omics” – TCGA NSCLC Adeno Carcinoma Patients
Collaboration with MGH – Feature Explorer – Radiology Brain
MR/Pathology Features
Collaboration with SBU Radiology – TCGA NSCLC Adeno Carcinoma
Integrative Radiology, Pathology, “omics”, outcome
Mary Saltz, Mark Schweitzer SBU Radiology
Things that Need to be Done with Spatio Temporal Data
• Generation of Features
• Sanity Checking and Data Cleaning
• Qualitative Exploration
• Descriptive Statistics
• Classification
• Identification of Interesting Phenomena
• Prediction
• Control
• Save Data for Later (Compression)
Classification
• Automated or semi-automated identification of
tissue or cell type
• Variety of machine learning and deep learning
methods
• Classification of Neuroblastoma
• Classification of Gliomas
• Quantification of lymphocyte infiltration
Classification and Characterization of Heterogeneity
Gurcan, Shamada, Kong, Saltz
Hiro Shimada, Metin Gurcan, Jun Kong, Lee Cooper Joel Saltz
BISTI/NIBIB Center for Grid Enabled Image Analysis - P20 EB000591, PI Saltz
lassification and Characterization of
Heterogeneity
Neuroblastoma Classification
FH: favorable histology UH: unfavorable histology
CANCER 2003; 98:2274-81
<5 yr
Schwannian
Development
≥50%
Grossly visible Nodule(s)
absent
present
Microscopic
Neuroblastic
foci
absent
present
Ganglioneuroma
(Schwannian stroma-dominant)
Maturing subtype
Mature subtype
Ganglioneuroblastoma, Intermixed
(Schwannian stroma-rich)
FH
FH
Ganglioneuroblastoma, Nodular
(composite, Schwannian stroma-rich/
stroma-dominant and stroma-poor) UH/FH*
Variant forms*
None to <50%
Neuroblastoma
(Schwannian stroma-poor)
Poorly differentiated
subtype
Undifferentiated
subtype
Differentiating
subtype
Any age UH
≥200/5,000 cells
Mitotic & karyorrhectic cells
100-200/5,000 cells
<100/5,000 cells
Any age
≥1.5 yr
<1.5 yr
UH
UH
FH
≥200/5,000 cells
100-200/5,000 cells
<100/5,000 cells
Any age UH
≥1.5 yr
<1.5 yr
≥5 yr
UH
FH
UH
FH
Multi-Scale Machine Learning Based Shimada Classification System
• Background Identification
• Image Decomposition (Multi-resolution
levels)
• Image Segmentation (EMLDA)
• Feature Construction (2nd order statistics,
Tonal Features)
• Feature Extraction (LDA) + Classification
(Bayesian)
• Multi-resolution Layer Controller
(Confidence Region)
No
Yes
Image Tile
Initialization
I = L
Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within Confidence
Region ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING
Brain Tumor Classification – CVPR 2016
Combining Information from Patches
Brain Tumor Classification Results
Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Liz Vanner, James
Davis, Joel Saltz
Tumor Infiltrating Lymphocyte quantification
• Convolutional neural
network to classify
lymphocyte
infiltration in tissue
patches
• Convolutional neural
network and random
forest to classify
individual
segmented nuclei
• Extensive collection
of ground truth
• Joint work with
Emory and TCGA
PanCanAtlas
Immune group
Unsupervised Autoencoder – 100 feature dimensions
Lymphocyte identification
Lymphocytes Infiltration No Lymphocyte
Infiltration
Receiver Operating Characteristic – Area Under Curve – 95%
Lymphocyte Classification Heat Map
Trained with 22.2K image patches
Pathologist corrects and edits
Commonalities
• Provided quick but pretty deep dive into aspects of
spatio temporal data analytics
• Requirements, methods and I think core
infrastructure can be shared between disparate
application classes
• These application classes are definitely data but
spatio-temporal aspects are HPC community
context friendly
• Most of this holds for analysis of scientific program
generated data – ORNL Klasky collaborations
ITCR Team
Stony Brook University
Joel Saltz
Tahsin Kurc
Yi Gao
Allen Tannenbaum
Erich Bremer
Jonas Almeida
Alina Jasniewski
Fusheng Wang
Tammy DiPrima
Andrew White
Le Hou
Furqan Baig
Mary Saltz
Emory University
Ashish Sharma
Adam Marcus
Oak Ridge National Laboratory
Scott Klasky
Dave Pugmire
Jeremy Logan
Yale University
Michael Krauthammer
Harvard University
Rick Cummings
Funding – Thanks!
• This work was supported in part by U24CA180924-
01, NCIP/Leidos 14X138 and HHSN261200800001E
from the NCI; R01LM011119-01 and R01LM009239
from the NLM
• This research used resources provided by the
National Science Foundation XSEDE Science
Gateways program under grant TG-ASC130023 and
the Keeneland Computing Facility at the Georgia
Institute of Technology, which is supported by the
NSF under Contract OCI-0910735.
Thanks!

Machine Learning and Deep Contemplation of Data

  • 1.
    Machine Learning andDeep Contemplation of Data Joel Saltz Department of Biomedical Informatics Stony Brook University CCDSC October 5, 2016
  • 2.
    From BDEC: “Domain”:Spatio-temporal Sensor Integration, Analysis, Classification • Multi-scale material/tissue structural, molecular, functional characterization. Design of materials with specific structural, energy storage properties, brain, regenerative medicine, cancer • Integrative multi-scale analyses of the earth, oceans, atmosphere, cities, vegetation etc – cameras and sensors on satellites, aircraft, drones, land vehicles, stationary cameras • Digital astronomy • Hydrocarbon exploration, exploitation, pollution remediation • Solid printing integrative data analyses • Data generated by numerical simulation codes – PDEs, particle methods
  • 3.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 4.
    Precision Medicine MetaApplication • Predict treatment outcome, select, monitor treatments • Reduce inter-observer variability in diagnosis • Computer assisted exploration of new classification schemes • Multi-scale cancer simulations
  • 5.
    Imaging and PrecisionMedicine - Pathomics, Radiomics Identify and segment trillions of objects – nuclei, glands, ducts, nodules, tumor niches … from Pathology, Radiology imaging datasets Extract features from objects and spatio-temporal regions Support queries against ensembles of features extracted from multiple datasets Statistical analyses and machine learning to link Radiology/Pathology features to “omics” and outcome biological phenomena Principle based analyses to bridge spatio-temporal scales – linked Pathology, Radiology studies
  • 6.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 7.
    Current Driving Applications •Checkpoint Inhibitors – when to use, when to stop • Pathology, Imaging data obtained prior to and during treatment • Integration of “omics”, tissue and imaging to manage treatment • Non Small Cell Lung Cancer, Melanoma, Brain • Virtual Tissue Respository • SEER Cancer Epidemiology • 500K Cancer Patients per year • DOE/NCI pilot involving text • Our co-located companion Virtual Tissue Repository pilot targets SEER images
  • 8.
    Radiomics Decoding tumour phenotype bynoninvasive imaging using a quantitative radiomics approach Hugo J. W. L. Aerts et. Al. Nature Communications 5, Article number: 4006 doi:10.1038/ncomms5006 Features Patients
  • 9.
    Integrative Morphology/”omics” Quantitative Feature Analysis inPathology: Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz) NLM/NCI: Integrative Analysis/Digital Pathology R01LM011119, R01LM009239 (Dual PIs Joel Saltz, David Foran) J Am Med Inform Assoc. 2012 Integrated morphologic analysis for the identification and characterization of disease subtypes. Pathomics Lee Cooper, Jun Kong
  • 10.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 12.
    Robust Nuclear Segmentation •Robust ensemble algorithm to segment nuclei across tissue types • Optimized algorithm tuning methods • Parameter exploration to optimize quality • Systematic Quality Control pipeline encompassing tissue image quality, human generated ground truth, convolutional neural network critique • Yi Gao, Allen Tannenbaum, Dimitris Samaras, Le Hou, Tahsin Kurc
  • 13.
  • 14.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 15.
    3D Slicer Pathology– Generate High Quality Ground Truth
  • 16.
  • 17.
    Adjust algorithm parameters,manual fine tuning
  • 18.
    Sanity Check Features RelationshipBetween Image and FeaturesLeveraging Visualization to Aid in Feature Management Step 1: Choose a case from the TCGA atlas (case #20) Step 2: Select two features of interest; X axis (area), Y axis (perimeter) Step 3: Zoom in on region of interest Step 4: Pick a specific nucleus of interest. Each dot represents a single nucleus Step 5: Evaluate the features selected in the context of the specific nucleus and where this nucleus is located within the whole slide image The tool provides visual context for feature evaluation. This technique maps both intuitive features (i.e. size, shape, color) and non-intuitive features (i.e. wavelets, texture) to the ground truth of source images through an interactive web-based user interface. Selected nucleus geolocated within whole slide image Detects elongated nucleus Going from the whole slide data set to selected features and back to the image Adding a visual perspective by using a live web-based interactive tool (http://sbu-bmi.github.io/featurescape/u24/Preview.html)
  • 20.
    Select Feature Pair– dots correspond to nuclei
  • 21.
    Subregion selected –form of gating analogous to flow cytometry
  • 22.
    Sample Nuclei fromGated Region
  • 23.
  • 24.
  • 25.
    Heatmap – DepictsAgreement Between Algorithms
  • 26.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 27.
    Auto-tuning and featureextraction • Goal – correctly segment trillions of objects (nuclei) • Adjust algorithm parameters • Autotuning – finds parameters that best match ground truth in an image patch • Region template runtime support to optimize generation and management of multi-parameter algorithm results • Eliminates redundant computation, manages locality • Active Harmony – Jeff Hollingsworth!! • Collaboration – George Teodoro, Tahsin Kurc
  • 29.
  • 30.
    Performance Optimization 256 nodesof Stampede. Each node of the cluster has a dual socket Intel Xeon E5-2680 processors, an Intel Xeon Phi SE10P co-processor and 32GB RAM.The nodes are inter-connected via Mellanox FDR Infiniband switches.
  • 31.
    Good Bad Test asGood 2916 33 Test as Bad 28 2094 Machine Learning and Quality Critiquing SVM Approach
  • 32.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 33.
    Feature Explorer -Integrated Pathomics Features, Outcomes and “omics” – TCGA NSCLC Adeno Carcinoma Patients
  • 34.
    Feature Explorer -Integrated Pathomics Features, Outcomes and “omics” – TCGA NSCLC Adeno Carcinoma Patients
  • 35.
    Collaboration with MGH– Feature Explorer – Radiology Brain MR/Pathology Features
  • 36.
    Collaboration with SBURadiology – TCGA NSCLC Adeno Carcinoma Integrative Radiology, Pathology, “omics”, outcome Mary Saltz, Mark Schweitzer SBU Radiology
  • 37.
    Things that Needto be Done with Spatio Temporal Data • Generation of Features • Sanity Checking and Data Cleaning • Qualitative Exploration • Descriptive Statistics • Classification • Identification of Interesting Phenomena • Prediction • Control • Save Data for Later (Compression)
  • 38.
    Classification • Automated orsemi-automated identification of tissue or cell type • Variety of machine learning and deep learning methods • Classification of Neuroblastoma • Classification of Gliomas • Quantification of lymphocyte infiltration
  • 39.
    Classification and Characterizationof Heterogeneity Gurcan, Shamada, Kong, Saltz Hiro Shimada, Metin Gurcan, Jun Kong, Lee Cooper Joel Saltz BISTI/NIBIB Center for Grid Enabled Image Analysis - P20 EB000591, PI Saltz lassification and Characterization of Heterogeneity
  • 40.
    Neuroblastoma Classification FH: favorablehistology UH: unfavorable histology CANCER 2003; 98:2274-81 <5 yr Schwannian Development ≥50% Grossly visible Nodule(s) absent present Microscopic Neuroblastic foci absent present Ganglioneuroma (Schwannian stroma-dominant) Maturing subtype Mature subtype Ganglioneuroblastoma, Intermixed (Schwannian stroma-rich) FH FH Ganglioneuroblastoma, Nodular (composite, Schwannian stroma-rich/ stroma-dominant and stroma-poor) UH/FH* Variant forms* None to <50% Neuroblastoma (Schwannian stroma-poor) Poorly differentiated subtype Undifferentiated subtype Differentiating subtype Any age UH ≥200/5,000 cells Mitotic & karyorrhectic cells 100-200/5,000 cells <100/5,000 cells Any age ≥1.5 yr <1.5 yr UH UH FH ≥200/5,000 cells 100-200/5,000 cells <100/5,000 cells Any age UH ≥1.5 yr <1.5 yr ≥5 yr UH FH UH FH
  • 41.
    Multi-Scale Machine LearningBased Shimada Classification System • Background Identification • Image Decomposition (Multi-resolution levels) • Image Segmentation (EMLDA) • Feature Construction (2nd order statistics, Tonal Features) • Feature Extraction (LDA) + Classification (Bayesian) • Multi-resolution Layer Controller (Confidence Region) No Yes Image Tile Initialization I = L Background? Label Create Image I(L) Segmentation Feature Construction Feature Extraction Classification Segmentation Feature Construction Feature Extraction Classifier Training Down-sampling Training Tiles Within Confidence Region ? I = I -1 I > 1? Yes Yes No No TRAINING TESTING
  • 43.
  • 44.
  • 45.
    Brain Tumor ClassificationResults Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Liz Vanner, James Davis, Joel Saltz
  • 46.
    Tumor Infiltrating Lymphocytequantification • Convolutional neural network to classify lymphocyte infiltration in tissue patches • Convolutional neural network and random forest to classify individual segmented nuclei • Extensive collection of ground truth • Joint work with Emory and TCGA PanCanAtlas Immune group Unsupervised Autoencoder – 100 feature dimensions
  • 47.
  • 48.
    Receiver Operating Characteristic– Area Under Curve – 95%
  • 49.
    Lymphocyte Classification HeatMap Trained with 22.2K image patches Pathologist corrects and edits
  • 50.
    Commonalities • Provided quickbut pretty deep dive into aspects of spatio temporal data analytics • Requirements, methods and I think core infrastructure can be shared between disparate application classes • These application classes are definitely data but spatio-temporal aspects are HPC community context friendly • Most of this holds for analysis of scientific program generated data – ORNL Klasky collaborations
  • 51.
    ITCR Team Stony BrookUniversity Joel Saltz Tahsin Kurc Yi Gao Allen Tannenbaum Erich Bremer Jonas Almeida Alina Jasniewski Fusheng Wang Tammy DiPrima Andrew White Le Hou Furqan Baig Mary Saltz Emory University Ashish Sharma Adam Marcus Oak Ridge National Laboratory Scott Klasky Dave Pugmire Jeremy Logan Yale University Michael Krauthammer Harvard University Rick Cummings
  • 52.
    Funding – Thanks! •This work was supported in part by U24CA180924- 01, NCIP/Leidos 14X138 and HHSN261200800001E from the NCI; R01LM011119-01 and R01LM009239 from the NLM • This research used resources provided by the National Science Foundation XSEDE Science Gateways program under grant TG-ASC130023 and the Keeneland Computing Facility at the Georgia Institute of Technology, which is supported by the NSF under Contract OCI-0910735.
  • 53.

Editor's Notes

  • #41 This is Dr. Shimada’s prognosis classfication