Cluster schedulers
Agenda
• What is cluster scheduler and why one would need it?

• Cluster scheduler architectures

• Specifics of YARN, Kubernetes, Mesos and Nomad:

• Architecture

• Specific features / positioning

• Pros and cons
What is cluster scheduler?
Do I really need it?
• Software component (monolith or distributed) with two major functions:

• Allocate resources on node(s) for incoming workload

• Maintain task lifecycle on allocated resources (distribute, run, keep
up, shutdown)

• Cluster scheduler is different from application scheduler

• You need one (and probably using one) if you run distributed
application

• You need a real one if you run more than one application and need
some elasticity
Monolith architecture
• Scheduler is a single process
that controls everything about
workloads

• Examples: Hadoop
JobTracker, Kubernetes (kube-
scheduler)

• Simple initial implementation

• Hard to implement different
requirements for different
workloads
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
Two-level architecture
• Task lifecycle is separated
from resource allocation

• Examples: YARN (you have to
see it), Mesos

• Easy to add different types of
application

• Hard to implement anti-
interference measures, priority
cross-application preemption
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
Shared-state architecture
• Each scheduler (i.e.
application type) maintains its
own state of the cluster and
commits changes as a
transactions (that could
succeed or fail)

• Example: Nomad

• State synchronisation has to
be done
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
Distributed architecture
• No centralised resource
allocation, simplified model

• Example: Sparrow

• Has great advantages on fine-
grained tasks randomly
distributed on large cluster

• Any synchronisation (e.g. to
avoid interference) is hard
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
YARN: Yet Another
Resource Negotiator
History
• MapReduce JobTracker generalisation (decoupled Resource
Manager and Application Master), one of two parts of
“Hadoop”

• Resource allocation based on requests

• Works fine with large containers and batch processes, not so
much with fine-grained / services

• All Hadoop frameworks have 1st class support for YARN
(MRv2, Pig, Hive, Spark)

• Supports pluggable schedulers (cluster-level), containerisation
Architecture
* Picture source: Apache Hadoop Website
Specific features / issues
• Pluggable “queue management” scheduler:

• FairScheduler: memory-fair by default, possible DRF policy for specific queue

• CapacityScheduler: pluggable resource calculator,
DominantResourceCalculator supports CPU and Memory

• Data locality support possible (e.g. MRv2)

• Preemption: across queues and intra queues (2.8.0/3.0.0)

• Kerberos authentication, ACLs on queue and cluster

• Awful metric system, no support for metric collection from “frameworks”

• No volume management
Google Kubernetes
History
• Kubernetes happened after internal “Borg” project in Google

• Initially: greenfield implementation of container orchestration
targeted for services

• kube-scheduler is a small part of what K8s does 

• Best for micro services on cloud

• Huge momentum

• Very ops friendly, Google dogfooding it (Google Cloud
Engine is upstream K8s)
Architectures
* Picture source: Wikipedia
Specific features
• Pod / Controllers / Services 

• Controllers: Replica Sets / StatefulSets / Daemon Sets

• Volumes!

• Resources, oversubscription and QoS

• Service Discovery / Load Balancing

• Secrets

• Authentication / Authorizations / Admission Controls

• Monitoring: Heapster / cAdvisor

• Federation!

• …
Issues
• Many concepts, hard to master and reason about (e.g.
controllers are like schedulers, but not really)

• Monolith kube-scheduler could be slow

• No IO isolation, not suitable for analytical workloads on
large on-premise clusters 

• No real enterprise support (that I know of)
Apache Mesos
History
• UC Berkely 2009, Apache top-tier 2013

• Clean two-level architecture implementation

• Resource allocation based on offers

• Initially part of BDAS groups, targeted at Big Data first
(Apache Spark is Proof-of-Concept for Mesos)

• Popularised by Mesosphere in DC/OS product
Architecture
* Picture source: Apache Mesos Website
Specific features
• Flexible in terms of resources available that could be
allocated: cpus, memory, disks / volumes, gpus

• Pluggable: schedulers (called frameworks), containerizers,
loggers, networking (CNI/libnetwork)

• Oversubscription, revocable resources, quotas

• Some volume management

• Very rough around edges
Framework support
• Although it’s very common when somebody runs X on Y, Mesos is a
leader in terms of hosting other stuff

• It’s really easy to develop Mesos framework

• Some examples:

• Marathon/Aurora for container orchestration (some people even tried
K8s, but that is too much)

• HDFS/Kafka/NoSQL DBs - if you like to live on the edge

• Jenkins/Artifactory/Gitlab

• Spark/TF/Flink/Storm
Real world example
Hashicorp Nomad
History
• 2015, developed by Hashicorp

• Shared-state architecture (service/batch/system
schedulers) Docker scheduler

• Dependent on other Hashicorp tools: Consul, Vault
Architecture
* Picture source: Nomad Website
Specific features & issues
• Multi-DC and multi-region support based on Gossip

• Service/batch/system schedulers

• No authorisations, only basic TLS on communication

• No volume management

• No IO isolation

• Preemption?
Q & A

Cluster schedulers

  • 1.
  • 2.
    Agenda • What iscluster scheduler and why one would need it? • Cluster scheduler architectures • Specifics of YARN, Kubernetes, Mesos and Nomad: • Architecture • Specific features / positioning • Pros and cons
  • 3.
    What is clusterscheduler? Do I really need it? • Software component (monolith or distributed) with two major functions: • Allocate resources on node(s) for incoming workload • Maintain task lifecycle on allocated resources (distribute, run, keep up, shutdown) • Cluster scheduler is different from application scheduler • You need one (and probably using one) if you run distributed application • You need a real one if you run more than one application and need some elasticity
  • 4.
    Monolith architecture • Scheduleris a single process that controls everything about workloads • Examples: Hadoop JobTracker, Kubernetes (kube- scheduler) • Simple initial implementation • Hard to implement different requirements for different workloads * Picture source: http://www.firmament.io/blog/scheduler-architectures.html
  • 5.
    Two-level architecture • Tasklifecycle is separated from resource allocation • Examples: YARN (you have to see it), Mesos • Easy to add different types of application • Hard to implement anti- interference measures, priority cross-application preemption * Picture source: http://www.firmament.io/blog/scheduler-architectures.html
  • 6.
    Shared-state architecture • Eachscheduler (i.e. application type) maintains its own state of the cluster and commits changes as a transactions (that could succeed or fail) • Example: Nomad • State synchronisation has to be done * Picture source: http://www.firmament.io/blog/scheduler-architectures.html
  • 7.
    Distributed architecture • Nocentralised resource allocation, simplified model • Example: Sparrow • Has great advantages on fine- grained tasks randomly distributed on large cluster • Any synchronisation (e.g. to avoid interference) is hard * Picture source: http://www.firmament.io/blog/scheduler-architectures.html
  • 8.
  • 9.
    History • MapReduce JobTrackergeneralisation (decoupled Resource Manager and Application Master), one of two parts of “Hadoop” • Resource allocation based on requests • Works fine with large containers and batch processes, not so much with fine-grained / services • All Hadoop frameworks have 1st class support for YARN (MRv2, Pig, Hive, Spark) • Supports pluggable schedulers (cluster-level), containerisation
  • 10.
    Architecture * Picture source:Apache Hadoop Website
  • 11.
    Specific features /issues • Pluggable “queue management” scheduler: • FairScheduler: memory-fair by default, possible DRF policy for specific queue • CapacityScheduler: pluggable resource calculator, DominantResourceCalculator supports CPU and Memory • Data locality support possible (e.g. MRv2) • Preemption: across queues and intra queues (2.8.0/3.0.0) • Kerberos authentication, ACLs on queue and cluster • Awful metric system, no support for metric collection from “frameworks” • No volume management
  • 12.
  • 13.
    History • Kubernetes happenedafter internal “Borg” project in Google • Initially: greenfield implementation of container orchestration targeted for services • kube-scheduler is a small part of what K8s does • Best for micro services on cloud • Huge momentum • Very ops friendly, Google dogfooding it (Google Cloud Engine is upstream K8s)
  • 14.
  • 15.
    Specific features • Pod/ Controllers / Services • Controllers: Replica Sets / StatefulSets / Daemon Sets • Volumes! • Resources, oversubscription and QoS • Service Discovery / Load Balancing • Secrets • Authentication / Authorizations / Admission Controls • Monitoring: Heapster / cAdvisor • Federation! • …
  • 16.
    Issues • Many concepts,hard to master and reason about (e.g. controllers are like schedulers, but not really) • Monolith kube-scheduler could be slow • No IO isolation, not suitable for analytical workloads on large on-premise clusters • No real enterprise support (that I know of)
  • 17.
  • 18.
    History • UC Berkely2009, Apache top-tier 2013 • Clean two-level architecture implementation • Resource allocation based on offers • Initially part of BDAS groups, targeted at Big Data first (Apache Spark is Proof-of-Concept for Mesos) • Popularised by Mesosphere in DC/OS product
  • 19.
    Architecture * Picture source:Apache Mesos Website
  • 20.
    Specific features • Flexiblein terms of resources available that could be allocated: cpus, memory, disks / volumes, gpus • Pluggable: schedulers (called frameworks), containerizers, loggers, networking (CNI/libnetwork) • Oversubscription, revocable resources, quotas • Some volume management • Very rough around edges
  • 21.
    Framework support • Althoughit’s very common when somebody runs X on Y, Mesos is a leader in terms of hosting other stuff • It’s really easy to develop Mesos framework • Some examples: • Marathon/Aurora for container orchestration (some people even tried K8s, but that is too much) • HDFS/Kafka/NoSQL DBs - if you like to live on the edge • Jenkins/Artifactory/Gitlab • Spark/TF/Flink/Storm
  • 22.
  • 23.
  • 24.
    History • 2015, developedby Hashicorp • Shared-state architecture (service/batch/system schedulers) Docker scheduler • Dependent on other Hashicorp tools: Consul, Vault
  • 25.
  • 26.
    Specific features &issues • Multi-DC and multi-region support based on Gossip • Service/batch/system schedulers • No authorisations, only basic TLS on communication • No volume management • No IO isolation • Preemption?
  • 27.