USENIX NSDI 2016 (Session: Resource Sharing)

USENIX NSDI2016
Session: Resource Sharing
2016-‐‑‒05-‐‑‒29 @oraccha

Co-‐‑‒located Events
• ACM Symposium on SDN Research 2016 (SOSR), March 13-‐‑‒17
• 2016 Open Networking Summit (ONS), March 14-‐‑‒17
• The 12th ACM/IEEE Symposium on Architectures for Networking
and Communications Systems (ANCSʼ’16), March 17-‐‑‒19
• The 13th USENIX Symposium on Networked Systems Design and
Implementation (NSDIʼ’16)
• The USENIX Workshop on Cool Topics in Sustainable Data
Centers (CoolDCʼ’16), March 19
2

Session: Resource Sharing
• “Ernest: Efficient Performance Prediction for Large-‐‑‒Scale Advanced
Analytics,” Shivaram Venkataraman, Zongheng Yang, Michael Franklin,
Benjamin Recht, and Ion Stoica, University of California, Berkeley
• “Cliffhanger: Scaling Performance Cliffs in Web Memory Caches,”
Asaf Cidon and Assaf Eisenman, Stanford University; Mohammad
Alizadeh, MIT CSAIL; Sachin Katti, Stanford University
• “FairRide: Near-‐‑‒Optimal, Fair Cache Sharing,” Qifan Pu and Haoyuan
Li, University of California, Berkeley; Matei Zaharia, Massachusetts
Institute of Technology; Ali Ghodsi and Ion Stoica, University of California,
Berkeley
• “HUG: Multi-‐‑‒Resource Fairness for Correlated and Elastic Demands,”
Mosharaf Chowdhury, University of Michigan; Zhenhua Liu, Stony Brook
University; Ali Ghodsi and Ion Stoica, University of California, Berkeley,
and Databricks Inc.
3

Ernest: Efficient Performance Prediction for
Large-‐‑‒Scale Advanced Analytics
• Who?：SparkやMesos等で知られるUCB AMPLabの⼤大学院⽣生。⼤大規模
データ分析に対するシステムやアルゴリズムが専⾨門で、SoCC12、
EuroSys13、OSDI14、SIGMOD16等で発表あり。
• What?：クラウド環境における機械学習、ゲノム解析などのデータ分析
ワークロードを効率率率的に性能予測するフレームワークの提案
4
DO CHOICES MATTER ?
0
5
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
16 r3.large
Matrix Multiply: 400K by 1K
0
5
10
15
20
25
30
35
Time(s)
QR Factorization 1M by 1K
Network Bound
Mem Bandwidth Bound
DO CHOICES MATTER ? MATRIX MULTIPLY
10
15
20
25
30
Time(s)
1 r3.8xlarge
2 r3.4xlarge
4 r3.2xlarge
8 r3.xlarge
Matrix size: 400K by 1K
Cores = 16
Memory = 244 GB
Cost = $2.66/hr
Cosine
Transform
Normalization
Linear Solver
~100 iterations
Iterative
(each iteration many jobs)
Long Running à Expensive
Numerically Intensive
7
Keystone-ML TIMIT PIPELINE
Raw
Data
Properties
0
10
20
30
0
100
200
300
400
500
600
Time(s)
Cores
Actual
Ideal
r3.4xlarge instances, QR Factorization:1M by 1K
13
Do choices MATTER ?
Computation + Communication à Non-linear Scaling

5
• How?：⼩小規模なTraining jobの実⾏行行結果から性能を予測。実験計画法
を使ってTraining job数を削減。
OPTIMAL Design of EXPERIMENTS
1%
2%
4%
8%
1
2
4
8
Input
Machines
Use off-the-shelf solver
(CVX)
USING ERNEST
Training
Jobs
Job
Binary
Machines,
Input Size
Linear
Model
Experiment
Design
Use few iterations for
training
0
200
400
600
800
1000
1
30
900
Time
Machines
ERNEST
BASIC Model
time = x1 + x2 ∗
input
machines
+ x3 ∗ log(machines)+ x4 ∗ (machines)
Serial
Execution
Computation (linear)
Tree DAG
All-to-One DAG
Collect Training Data
Fit Linear Regression

• Results：
6
TRAINING TIME: Keystone-ml
TIMIT Pipeline on r3.xlarge instances, 100 iterations
29
7 data points
Up to 16 machines
Up to 10% data
EXPERIMENT DESIGN
0
1000
2000
3000
4000
5000
6000
42 machines
Time (s)
Training Time
Running Time
0%
20%
40%
60%
80%
100%
Regression
Classiﬁcation
KMeans
PCA
TIMIT
Prediction Error (%)
Experiment Design
Cost-based
Is Experiment Design useful ?
30

Cliffhanger: Scaling Performance Cliffs
in Web Memory Caches
• Who?：Stanford CS出⾝身で、現在はクラウドセキュリティ会社Sookasa
のCEO（共同創業者）。クラウドストレージが専⾨門、SIGCOMM12、
USENIX ATC13, 15で発表あり。
• What?：Performance cliffに対する、Memcachedの動的キャッシュ割
当て機構（Slab allocator）の改良良
70 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Items in LRU Queue
Hitrate
Concave Hull
Application 19, Slab 0
Performance Cliff,
Talus[HPCA15]
+1 cache hit-‐‑‒rate
↓
+35% speedup
The cache hit-‐‑‒rate of
Facebookʼ’s Memcached pool
is 98.2%[SIGMETRICS12]
Hit-‐‑‒rate Curve

• How?：shadow queues
– Hill climbing algorithm: Hit rate curveの勾配の⼩小さいqueue (slab)から⼤大
きいqueueにメモリを回す。
– Cliff scaling algorithm: performance cliff（凹区間）の始まりと終わりを⾒見見
つける。
8
Using&Shadow&Queues&to&Estimate&
Local&Gradient
823221
879
53
Queue$1
Queue$2
Physical$Queue Shadow$Queue
Physical$Queue Shadow$Queue
Credits
Queue&1 2
Queue&2 @2
1
Resize$Queues
Cliffhanger+Runs+Both+Algorithms+in+
Parallel
Par$$oned)
Original)Queue)
Par$$oned)
Queues)
Track)le4)of)pointer)
Track)le4)of)pointer)
Track)right)of)pointer)
Track)right)of)pointer)
Track)hill)climbing)
Track)hill)climbing)
• Algorithm+1:+incrementally+optimize+memory+
across+queues
– Across+slab+classes
– Across+applications
• Algorithm+2:+scales+performance+cliffs

• 汎⽤用に使えそうな技術。次の発表のFairRideのようなFairnessに対する
考慮はない。
9
Cliffhanger+Reduces+Misses+and+Can+
Save+Memory
• Average+misses+reduced:+36.7%
• Average+potential+memory+savings:+45%
Cliffhanger+Outperforms+Default+and+
Optimized+Schemes
• Average+Cliffhanger+hit+rate+increase:+1.2%

FairRide: Near-‐‑‒Optimal, Fair Cache
Sharing
• Who?：UCB AMPLabの⼤大学院⽣生。MobiCom13、SIGCOMM15で発表
あり。
• What?：Isolation guaranteeとStrategy proofnessを満たし、Pareto
Efficiencyを準最適にするファイルキャッシュポリシの提案。
106
… … …
Statically allocated
*
Globally shared
Cache
Backend (storage/network)
… … …
Backend (storage/network)
CacheCacheCache
What we want
Isolation
Strategy-proof
Higher utilization
Share data
Isolation
Guarantee
Strategy
Proofness
Pareto
Efficiency
✓ ✓max-min fairness ✗
priority allocation
max-min rate
✗ ✓ ✓
✓✗ ✗
static allocation ✓ ✓ ✗
Isolation
Guarantee
Strategy
Proofness
Pareto
Efficiency
106
Properties
FairRide ✓ ✓ Near-optimal
SIP定理理：ファイル共有において
下記の三つは同時に満たせない

Sharing
• How？
– Max-‐‑‒minポリシにProbabilistic blockingを導⼊入することでチートに対する
dis-‐‑‒incentiveを与える。
– Alluxio (Tachyon)[SoCC14]ベースに実装。
11
LEGEND
A
C
5
5
A
B
C
5
5
10
B
A
B
C
5
5
10
true access
free-ride
cheat
blocked
Figure 3: Example with 2 users, 3 files and total cache
size of 2. Numbers represent access frequencies. (a). Al-
to get 1 hit/sec access rate for a unit file. To
mize over the utility, which is defined as the to
rate, a user’s optimal strategy is not to cache th
that one has highest access frequencies, but the
with lowest cost/(hit/sec). Compare a file of 10
shared by 2 users and another file of 100MB, share
users. Even though a user access the former 10 tim
and the latter only 8 times/sec, it is overall eco
to cache the second file (comparing 5MB/(hit/se
2.5MB/(hit/sec)).
(a) Max-‐‑‒min
fairness
(b) second user
makes cheating
(c) blocking free-‐‑‒
riding access
Probabilistic blocking
• FairRide blocks a user with p(nj) = 1/(nj+1) probability
– nj is number of other users caching file j
– e.g., p(1)=50%, p(4)=20%
• The best you can do in a general case
– Less blocking does not prevent cheating
25

Sharing
12
0
15
30
45
60
0 150 300 450 600 750 900 1050
missratio(%)
Time (s)
user 1
user 2
Cheating under FairRide
user 2 cheats
user 1 cheats
32
FairRide dis-incentives users from cheating.
400
300
200
100
0
Avg.response(ms)
Facebook experiments
FairRide outperforms max-min fairness by 29%
34
0
15
30
45
60
1-10 11-50 51-100 101-500 501-
RedcutioninMedian
JobTime(%)
Bin (#Tasks)
max-min
FairRide

HUG: Multi-‐‑‒Resource Fairness for
Correlated and Elastic Demands
• Who?：ミシガン⼤大の助教。UCB AMPLab出⾝身。ネットワークが専⾨門
（coflow-‐‑‒based networking, multi-‐‑‒resource allocation in dataceters,
compute and storage for big data, network virtualization）でSIGCOMM
で毎年年のように発表。DRF[NSDI11]、FairCloud[SIGCOMM12]の発展。
• What?：ネットワーク帯域の割当て最適化問題
13
…
M1 M2 M3 MN
Congestion-Less Core
L1 L2 L3 LNLN+1 LN+2 LN+3 L2N
How to share the links
between multiple
tenants to provide
1. optimal performance
guarantees and
2. maximize utilization?
Tenant-A’s VMs
Tenant-B’s VMs

• Highest Utilization with the Optimal Isolation Guarantee
14
Isolation Guarantee
Utilization
Work-
Conserving
Low
Low Optimal
PS-P
DRF
Per-Flow Fairness
HUG
HUG in Cooperative Setting
1. Optimal Isolation
Guarantee
2. Work Conservation
Isolation Guarantee
Utilization
Work-
Conserving
Low
Low Optimal
PS-P
DRF
Per-Flow Fairness
HUG
1. Optimal Isolation
Guarantee
2. HighestUtilization
3. Strategyproof
HUG in Non-Cooperative Setting
Intuitively, we want to maximize the minimum
progress over all tenants, i.e., maximize mink Mk,
where mink Mk corresponds to the isolation guaran-
tee of an allocation algorithm. We make three observa-
tions. First, when there is a single link in the system,
this model trivially reduces to max-min fairness. Sec-
ond, getting more aggregate bandwidth is not always bet-
ter. For tenant-A in the example, ⟨50Mbps, 100Mbps⟩ is
better than ⟨90Mbps, 90Mbps⟩ or ⟨25Mbps, 200Mbps⟩,
even though the latter ones have more bandwidth in to-
tal. Third, simply applying max-min fairness to individ-
ual links is not enough. In our example, max-min fairness
allocates equal resources to both tenants on both links,
resulting in allocations ⟨1
2 , 1
2 ⟩ on both links (Figure 1b).
Corresponding progress (MA = MB = 1
2 ) result in a
suboptimal isolation guarantee (min{MA, MB} = 1
2 ).
Dominant Resource Fairness (DRF) [33] extends max-
min fairness to multiple resources and prevents such sub-
Cloud Network Sharing
Dynamic Sharing
Flow-Level
(Per-Flow Fairness)
No isolation guarantee
VM-Level
(Seawall, GateKeeper)
No isolation guarantee
Tenant-/Network-Level
Non-Cooperative
Environments
Require
strategy-proofness
Highest Utilization for
Optimal IsolationGuarantee
(HUG)
Cooperative
Environments
Do not require
strategy-proofness
Reservation
(SecondNet, Oktopus, Pulsar, Silo)
Uses admission control
Low
Utilization
(DRF)
Optimal isolation guarantee
Work-Conserving
Optimal Isolation Guarantee
(HUG)
Suboptimal
IsolationGuarantee
(PS-P, EyeQ, NetShare)
Work-conserving

• 100台のEC2インスタンスで実験。
• ３つのテナント
– テナントA、C：pairwise one-‐‑‒to-‐‑‒one communication
– テナントB：all-‐‑‒to-‐‑‒all communication
15
0
50
100
0 60 120 180 240 300 360 420 480 540
TotalAlloc(Gbps)
Time (Seconds)
Tenant A
Tenant B
Tenant C
(a) Per-ﬂow Fairness (TCP)
0
50
100
0 60 120 180 240 300 360 420 480 540
TotalAlloc(Gbps)
Time (Seconds)
Tenant A
Tenant B
Tenant C
(b) HUG
Figure 10: [EC2] Bandwidth consumptions of three tenants arriving over time in a 100-machine EC2 cluster. Each tenant has 100
VMs, but each uses a different communication pattern (§5.1.1). We observe that (a) using TCP, tenant-B dominates the network by
creating more ﬂows; (b) HUG isolates tenants A and C from tenant B.

感想
• 本セッションの対象はデータセンタ内の資源管理理
• ⾰革新的なアイデアがあるわけではなくが、問題をきちんと定式化し、そ
れに基づいて実⽤用的なシステムを構築するという研究のお⼿手本のような
論論⽂文が多い。さすがNSDI。
• シングルセッションで全発表を聞けるのはうれしいが、発表時間20分
は短い（スライドだけ⾒見見てもよくわからないところがある）
• UCB AMPLab強い
• Facebook trace data欲しい
16
本資料料で使⽤用したすべての図はNSDI2016ホームページの
proceedingsおよびslidesから引⽤用しました。

USENIX NSDI 2016 (Session: Resource Sharing)

More Related Content

What's hot

Similar to USENIX NSDI 2016 (Session: Resource Sharing)

More from Ryousei Takano

Recently uploaded

USENIX NSDI 2016 (Session: Resource Sharing)