Some Things I Wish I Had Known
Before Scaling Machine Learning
Solutions
Invector Labs
Today’s
session is
about
differentiating
BS from
reality…
Agenda
• Myths and realities of machine learning solutions in the real world
• 15 Lessons I learned when building large scale machine learning
systems
• Challenge
• What we learned?
• Solution
The different
dimensions of
machine
intelligence
solutions…
We can discuss the theoretical definitions or,
instead, focus on the pragmatic one…
But the reality
remains that
building machine
learning
solutions
remains brutally
difficult
But not just because of the obvious reasons…
Challenges of Machine Learning in the Real
World
High
Technological
Barrier
Limited
Talent
Availability
Labeled
Datasets
Cost
…
A lifecycle
we haven’t
seen
before…
We are dealing with a new app lifecycle…
Traditional App Lifecycle Machine Learning App
Lifecycle
Experimentation
Model Creation
Training
Testing
Regularization
Deployment
Monitoring
Optimization
Design Implementation Deployment
Management/
Monitoring
The
Ecosystem
is Incredibly
Crowded
The Aspects of a Machine Learning Solution
that will Drive You Crazy
Strategy &
Processes
Data Engineering
Experimentation Model Training
Model
Operationalization
Runtime
Execution
Security
Lifecycle
Management
Optimization …
Lessons
learned when
building high
scale machine
learning
solutions…
Strategy & Processes…
Lesson #1:
Data
scientists
make horrible
engineers…
Challenges Data scientists are great at experimentation
Not so much at writing high quality code
Experimentation deep learning frameworks
don’t necessarily make great production
frameworks, ex: PyTorch vs. TensorFlow
Some Ideas to Consider
•Write notebooks and
experimentation
models
Data Science
Team
•Refactor or rewrite
models for production
environments
•Automate training
and optimization jobs
Engineering
Team •Deploy models
•Monitor, retrain, and
optimize models
DevOps Teams
• Divide data science and
data engineering teams
Lesson #2
Neither Agile nor
Waterfall
Methodologies
Work in Machine
Learning
Challenges Waterfall methods don’t work
because you rarely know what
machine learning methods are
going to work for a specific problem
Agile methods don’t work because
you need very specific
requirements
Some Ideas to Consider
Agile Waterfall Agile
• Split the
development
lifecycle into agile
and waterfall
iterations
Data Engineering…
Lesson # 3 :
Feature
extraction can
become a
reusability
nightmare…
Challenges Different models require the same
features from a dataset
Feature extraction jobs are
computationally expensive
Different teams create proprietary
ways to capture and store feature
information
Some Ideas to Consider
Dataset Preparation
Job1
Dataset Preparation
Job2
Dataset Preparation
JobN
Representation
Learning Task1
Representation
Learning Task1
Representation
Learning Task1
Feature
Store
Model 1
Model N
 Implement a centralized
feature store
 Leverage
representation learning
to extract relevant
features from a dataset
 Look for reference
architectures: ex:
Uber’s Michelangelo
Lesson #4 :
Data labeling is
so easy to
underestimate
Challenges Data experts spend a lot of time
labeling datasets
The logic for data labeling is often not
reusable
Subjective data labeling strategy fail to
differentiate between useful and
useless features
Some Ideas to Consider
 Implement an
automated data
labeling strategy
 Generative learning can
help to structure more
effective labels
 Project Snorkel is one of
the leading automated
data labeling
frameworks in the
market
Model Experimentation…
Lesson #5: The
single machine
learning
framework
fallacy
Challenges Enterprises like to standardize on a
single machine learning framework
Different teams have different
technology preferences
Providing a consistent machine learning
platform across different machine
learning frameworks is no easy task
Some Ideas to Consider
Experimentation
Framework
Intermediate
Representation
Production
Framework
 Optimize for productivity, not
consistency
 Enable enough flexibility to
leverage different frameworks for
experimentation and production
 ONNX is a great solution for
intermediate representations
Lesson #6: Too
much time
going from
notebooks to
production
programs
Challenges Notebooks are ideal for model
experimentation and testing
Notebooks typically have performance
challenges when executed at scale
Scaling Notebook environments can be
challenging
Parametrizing Notebook executions is
far from trivial
Some Ideas To Consider
• Jupyter,
Zeppelin
Model
Experimentation
• Papermill
• Netflix’s
Meson
Scheduling
Notebooks • Docker
Containers
• Kubernetes
Running
Complex
Workflows
 Enable an infrastructure to
operationalize data science
notebooks
 Use containers for the most
complex machine learning
workflows
Lesson #7:
Model
selection can
be a machine
learning
problem
Challenges Data scientists make very subjective
decisions when comes to model
selection
The same problem can be solved using
different machine learning models
Very often is almost impossible to
differentiate between similar models
Some Ideas To Consider
 Represent machine learning
requirements as a dataset
with an objective attribute
 Leverage AutoML-based
techniques for model
selection
Problem
Dataset
AutoML
Proposed
Models
Machine learning training…
Lesson #8:
Training is
a
continuous
task…
Challenges The No Free Lunch Theorem
Trained models can perform poorly
against new datasets
New engineers and DevOps need to
understand how to re-train existing
models
Some Ideas to Consider
DataLake
Data Outcomes/Feature
Store
Training Job1
Training Job2
Training JobN
 Automate Training Jobs
 Orchestrate scheduled
execution of training jobs
Lesson #9:
Training
should be
incremental…
Challenges Training machine learning models can
be computationally expensive
Most machine learning models need to
be retrained entirely based on the
arrival of new data
Its nearly impossible to quantify the
impact that new datasets have in the
performance of a model
Some Ideas to Consider
 Implement continual
learning models
 Consider transfer learning
as a fundamental enabler
Lesson #10:
Training a
model requires
as much
coding as
creating it…
Challenges Data engineers spend a lot of time
writing training routines for machine
learning models
Comparing the performance of different
models on the same datasets remains
tricky
Changes on a training dataset often
imply changes on the training code
Some Ideas to Consider
 Explore a configuration-
driven training process
 Uber’s Ludwig is an
innovative, no-code
framework for training
machine learning models
Executing Machine Learning Models…
Lesson #11:
Different models
require different
execution
patterns…
Challenges Not all models can be executed via APIs
Some models take a long time to run
In some scenarios, different models
need to be executed at the same time
based on a specific condition
Some Ideas to Consider
Scheduled
Activation
Model Model
Pub-Sub
Activation
Model Model
On-Demand
Activation
Model Model
Model API
Gateway
Event
Gateway Enable different
execution modes based
on client’s requirements
Lesson #12:
Mobile deep
learning is
more
complicated
than you think
Challenges Centralized cloud deep learning models don’t
scale
On-device deep learning models are hard to
distribute and train
Tons of privacy challenges
Some Ideas to Consider
 Consider using
federated learning
or similar patterns
for mobile based
machine learning
Machine Learning Operationalization…
Lesson
#13:
Debugging
is a
nightmare
Challenges The accuracy-interpretability friction
The unpredictability factor
Limited toolset
Some Ideas to Consider
•Use tools like
TensorBoard to
visualize the structure
of neural networks
Visualize the Network
and its Results
•High training error is a
sign of underfitting
•High test error and
low training error is a
sign of overfitting
Compare Training and
Test Errors •Helps to determine
whether the error is in
the code or in the data
Test with Small
Datasets
•Monitor the number
of activations in
hidden units
Monitor Activations
and Gradient Values
Understanding How
Nodes are Activated
Understanding what
Hidden Layers Do
Understanding How
Concepts are Formed
Interpretability
 Establish systematic
practices to debug
machine learning
models
 Onboard modeling
visualization and
interpretability tools
Security…
Lesson #14:
Machine
learning
models are so
easy to hack
Challenges Most neural networks are vulnerable to
adversarial attacks
Attackers don’t need access to the models but
can simply manipulate input datasets
Most of the times adversarial attacks go
undetected
Some Ideas to Consider
 Test your neural
networks for
adversarial robustness
 IBM’s adversarial
robustness toolbox is
one of the leading
stacks in neural
network security
Lesson # 15:
Data privacy
is the
elephant in
the machine
learning room
Challenges Machine learning models intrinsically build
knowledge about private datasets
Most machine learning techniques require
clear access to data which, in many cases,
contains sensitive information
There are no established techniques to
evaluating the privacy robustness of machine
learning models
Some Ideas to Consider
 Private machine learning is
an emerging area of
research
 Leverage techniques such
as secured multi-party
computations or zero-
knowledge-proofs to
obfuscate training datasets
 PySyft is an emerging
framework to enable
privacy in machine learning
models
Some not-well-known, reference
architectures that might help…
DAWN Project from Stanford University Michelangelo from Uber
MLFlow from DataBricks
FBLearner from Facebook
TFX from Google
The challenges go beyond the obvious…
Three Foundational Challenges for the
Mainstream Adoption of Machine Learning
Lowering the Technological Entry Point
• Can mainstream developers embrace machine learning stacks?
Talent Availability
• Can companies and governments nurture local data science
talent?
Data Democratization
• Can rich datasets stop being a privilege of large corporations
and governments ?
Some Initiatives to Consider
Lowering the Technological Entry Point
• AutoML, low-code machine learning frameworks
Talent Availability
• Google AI Academy, Coursera, Udacity…
Data Democratization
• Decentralized AI platforms
Summary
• Implementing machine learning solutions in the real world remains
incredibly challenging
• There is a large gap between the advancements in AI research and the
practical viability of those techniques
• Machine learning applications require a new lifecycle different from
traditional software models
• Each aspect of that lifecycle brings a unique set of challenges
• Start small, iterate…
Thanks
jr@invectoriq.com
jr@intotheblock.io
https://medium.com/@jrodthoughts
https://twitter.com/jrdothoughts

Implementing Machine Learning in the Real World

  • 1.
    Some Things IWish I Had Known Before Scaling Machine Learning Solutions Invector Labs
  • 2.
  • 3.
    Agenda • Myths andrealities of machine learning solutions in the real world • 15 Lessons I learned when building large scale machine learning systems • Challenge • What we learned? • Solution
  • 4.
  • 5.
    We can discussthe theoretical definitions or, instead, focus on the pragmatic one…
  • 8.
    But the reality remainsthat building machine learning solutions remains brutally difficult
  • 9.
    But not justbecause of the obvious reasons…
  • 10.
    Challenges of MachineLearning in the Real World High Technological Barrier Limited Talent Availability Labeled Datasets Cost …
  • 11.
  • 12.
    We are dealingwith a new app lifecycle… Traditional App Lifecycle Machine Learning App Lifecycle Experimentation Model Creation Training Testing Regularization Deployment Monitoring Optimization Design Implementation Deployment Management/ Monitoring
  • 13.
  • 14.
    The Aspects ofa Machine Learning Solution that will Drive You Crazy Strategy & Processes Data Engineering Experimentation Model Training Model Operationalization Runtime Execution Security Lifecycle Management Optimization …
  • 15.
    Lessons learned when building high scalemachine learning solutions…
  • 16.
  • 17.
  • 18.
    Challenges Data scientistsare great at experimentation Not so much at writing high quality code Experimentation deep learning frameworks don’t necessarily make great production frameworks, ex: PyTorch vs. TensorFlow
  • 19.
    Some Ideas toConsider •Write notebooks and experimentation models Data Science Team •Refactor or rewrite models for production environments •Automate training and optimization jobs Engineering Team •Deploy models •Monitor, retrain, and optimize models DevOps Teams • Divide data science and data engineering teams
  • 20.
    Lesson #2 Neither Agilenor Waterfall Methodologies Work in Machine Learning
  • 21.
    Challenges Waterfall methodsdon’t work because you rarely know what machine learning methods are going to work for a specific problem Agile methods don’t work because you need very specific requirements
  • 22.
    Some Ideas toConsider Agile Waterfall Agile • Split the development lifecycle into agile and waterfall iterations
  • 23.
  • 24.
    Lesson # 3: Feature extraction can become a reusability nightmare…
  • 25.
    Challenges Different modelsrequire the same features from a dataset Feature extraction jobs are computationally expensive Different teams create proprietary ways to capture and store feature information
  • 26.
    Some Ideas toConsider Dataset Preparation Job1 Dataset Preparation Job2 Dataset Preparation JobN Representation Learning Task1 Representation Learning Task1 Representation Learning Task1 Feature Store Model 1 Model N  Implement a centralized feature store  Leverage representation learning to extract relevant features from a dataset  Look for reference architectures: ex: Uber’s Michelangelo
  • 27.
    Lesson #4 : Datalabeling is so easy to underestimate
  • 28.
    Challenges Data expertsspend a lot of time labeling datasets The logic for data labeling is often not reusable Subjective data labeling strategy fail to differentiate between useful and useless features
  • 29.
    Some Ideas toConsider  Implement an automated data labeling strategy  Generative learning can help to structure more effective labels  Project Snorkel is one of the leading automated data labeling frameworks in the market
  • 30.
  • 31.
    Lesson #5: The singlemachine learning framework fallacy
  • 32.
    Challenges Enterprises liketo standardize on a single machine learning framework Different teams have different technology preferences Providing a consistent machine learning platform across different machine learning frameworks is no easy task
  • 33.
    Some Ideas toConsider Experimentation Framework Intermediate Representation Production Framework  Optimize for productivity, not consistency  Enable enough flexibility to leverage different frameworks for experimentation and production  ONNX is a great solution for intermediate representations
  • 34.
    Lesson #6: Too muchtime going from notebooks to production programs
  • 35.
    Challenges Notebooks areideal for model experimentation and testing Notebooks typically have performance challenges when executed at scale Scaling Notebook environments can be challenging Parametrizing Notebook executions is far from trivial
  • 36.
    Some Ideas ToConsider • Jupyter, Zeppelin Model Experimentation • Papermill • Netflix’s Meson Scheduling Notebooks • Docker Containers • Kubernetes Running Complex Workflows  Enable an infrastructure to operationalize data science notebooks  Use containers for the most complex machine learning workflows
  • 37.
    Lesson #7: Model selection can bea machine learning problem
  • 38.
    Challenges Data scientistsmake very subjective decisions when comes to model selection The same problem can be solved using different machine learning models Very often is almost impossible to differentiate between similar models
  • 39.
    Some Ideas ToConsider  Represent machine learning requirements as a dataset with an objective attribute  Leverage AutoML-based techniques for model selection Problem Dataset AutoML Proposed Models
  • 40.
  • 41.
  • 42.
    Challenges The NoFree Lunch Theorem Trained models can perform poorly against new datasets New engineers and DevOps need to understand how to re-train existing models
  • 43.
    Some Ideas toConsider DataLake Data Outcomes/Feature Store Training Job1 Training Job2 Training JobN  Automate Training Jobs  Orchestrate scheduled execution of training jobs
  • 44.
  • 45.
    Challenges Training machinelearning models can be computationally expensive Most machine learning models need to be retrained entirely based on the arrival of new data Its nearly impossible to quantify the impact that new datasets have in the performance of a model
  • 46.
    Some Ideas toConsider  Implement continual learning models  Consider transfer learning as a fundamental enabler
  • 47.
    Lesson #10: Training a modelrequires as much coding as creating it…
  • 48.
    Challenges Data engineersspend a lot of time writing training routines for machine learning models Comparing the performance of different models on the same datasets remains tricky Changes on a training dataset often imply changes on the training code
  • 49.
    Some Ideas toConsider  Explore a configuration- driven training process  Uber’s Ludwig is an innovative, no-code framework for training machine learning models
  • 50.
  • 51.
    Lesson #11: Different models requiredifferent execution patterns…
  • 52.
    Challenges Not allmodels can be executed via APIs Some models take a long time to run In some scenarios, different models need to be executed at the same time based on a specific condition
  • 53.
    Some Ideas toConsider Scheduled Activation Model Model Pub-Sub Activation Model Model On-Demand Activation Model Model Model API Gateway Event Gateway Enable different execution modes based on client’s requirements
  • 54.
    Lesson #12: Mobile deep learningis more complicated than you think
  • 55.
    Challenges Centralized clouddeep learning models don’t scale On-device deep learning models are hard to distribute and train Tons of privacy challenges
  • 56.
    Some Ideas toConsider  Consider using federated learning or similar patterns for mobile based machine learning
  • 57.
  • 58.
  • 59.
    Challenges The accuracy-interpretabilityfriction The unpredictability factor Limited toolset
  • 60.
    Some Ideas toConsider •Use tools like TensorBoard to visualize the structure of neural networks Visualize the Network and its Results •High training error is a sign of underfitting •High test error and low training error is a sign of overfitting Compare Training and Test Errors •Helps to determine whether the error is in the code or in the data Test with Small Datasets •Monitor the number of activations in hidden units Monitor Activations and Gradient Values Understanding How Nodes are Activated Understanding what Hidden Layers Do Understanding How Concepts are Formed Interpretability  Establish systematic practices to debug machine learning models  Onboard modeling visualization and interpretability tools
  • 61.
  • 62.
  • 63.
    Challenges Most neuralnetworks are vulnerable to adversarial attacks Attackers don’t need access to the models but can simply manipulate input datasets Most of the times adversarial attacks go undetected
  • 64.
    Some Ideas toConsider  Test your neural networks for adversarial robustness  IBM’s adversarial robustness toolbox is one of the leading stacks in neural network security
  • 65.
    Lesson # 15: Dataprivacy is the elephant in the machine learning room
  • 66.
    Challenges Machine learningmodels intrinsically build knowledge about private datasets Most machine learning techniques require clear access to data which, in many cases, contains sensitive information There are no established techniques to evaluating the privacy robustness of machine learning models
  • 67.
    Some Ideas toConsider  Private machine learning is an emerging area of research  Leverage techniques such as secured multi-party computations or zero- knowledge-proofs to obfuscate training datasets  PySyft is an emerging framework to enable privacy in machine learning models
  • 68.
  • 69.
    DAWN Project fromStanford University Michelangelo from Uber MLFlow from DataBricks FBLearner from Facebook TFX from Google
  • 70.
    The challenges gobeyond the obvious…
  • 71.
    Three Foundational Challengesfor the Mainstream Adoption of Machine Learning Lowering the Technological Entry Point • Can mainstream developers embrace machine learning stacks? Talent Availability • Can companies and governments nurture local data science talent? Data Democratization • Can rich datasets stop being a privilege of large corporations and governments ?
  • 72.
    Some Initiatives toConsider Lowering the Technological Entry Point • AutoML, low-code machine learning frameworks Talent Availability • Google AI Academy, Coursera, Udacity… Data Democratization • Decentralized AI platforms
  • 73.
    Summary • Implementing machinelearning solutions in the real world remains incredibly challenging • There is a large gap between the advancements in AI research and the practical viability of those techniques • Machine learning applications require a new lifecycle different from traditional software models • Each aspect of that lifecycle brings a unique set of challenges • Start small, iterate…
  • 74.