Shmulik Ladkani, 2018
Building Network Functions with eBPF & BCC
This work is licensed under a Creative Commons Attribution 4.0 International License.
Agenda
● Intro
● Theory
○ Classical BPF
○ eBPF
○ BCC
● Practice
○ Examples and demo
Berkeley Packet Filter
Berkeley Packet Filter
New Architecture for User-level Packet Capture
● McCanne/Jacobson 1993
● Standardized API
● Performant
Berkeley Packet Filter
● Allows user program to attach a filter onto a socket
● Available on most *nix systems
Design
● Abstract-machine architecture
○ Registers, memory, addressing modes…
○ Instruction set (load, store, branch, ALU…)
● In-kernel interpreter
Example program: assembly / machine instructions
(000) ldh [12] { 0x28, 0, 0, 0x0000000c },
(001) jeq #0x800 jt 2 jf 5 { 0x15, 0, 3, 0x00000800 },
(002) ldb [23] { 0x30, 0, 0, 0x00000017 },
(003) jeq #0x6 jt 4 jf 5 { 0x15, 0, 1, 0x00000006 },
(004) ret #262144 { 0x6, 0, 0, 0x00040000 },
(005) ret #0 { 0x6, 0, 0, 0x00000000 },
Modus Operandi
struct sock_filter code[] = {
/* ... machine instructions ... */
};
struct sock_fprog bpf = {
.filter = code,
.len = ARRAY_SIZE(code),
};
sock = socket(...);
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
Applications
● Libpcap
○ Tcpdump, Wireshark, Nmap...
● DHCP stacks
● WPA 802.1x stacks
● Android 464XLAT
● android.net.NetworkUtils
● Custom user-space protocol stacks
Linux Enhancements
Packet Metadata Access
Extension Description
len skb->len
proto skb->protocol
type skb->pkt_type
ifidx skb->dev->ifindex
hatype skb->dev->type
mark skb->mark
rxhash skb->hash
vlan_tci skb_vlan_tag_get(skb)
vlan_avail skb_vlan_tag_present(skb)
vlan_tpid skb->vlan_proto
nla Netlink attribute of type X with offset A
nlan Nested Netlink attribute of type X with offset A
Linux Enhancements
Just-In-Time Compiler
● Converts BPF instructions directly into native code
● As of v3.0 (x86_64)
○ SPARC, PowerPC, ARM, ARM64, MIPS, s390 followed
Linux Enhancements
Hooking Points
● IPTables xt_bpf
○ Competitive with traditional u32 match
○ As of v3.9
○ iptables -A OUTPUT 
-m bpf --bytecode '4,48 0 0 9,21 0 1 6,6 0 0 1,6 0 0 0' -j ACCEPT
● TC cls_bpf
○ Alternative to ematch / u32 classification
○ As of v3.13
○ tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1
tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
Linux Enhancements
Seccomp BPF
● Filters system calls using a BPF filter
○ Operates on syscall number and syscall arguments
○ As of v3.5
○
● Used by Chrome, Firefox, OpenSSH, Android…
static struct filter = {
/* ... */
// load syscall number
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
// only allow ‘read’
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_read, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
};
/* ... */
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filterprog);
Summary
● Fixed filter program
● Few injection points
● Two domains
○ Packet filtering
○ Syscall filtering
● Functional, stateless
● Kernel data is immutable
● No kernel interaction
User-program injected into kernel to control behavior
Extended BPF
eBPF
● Abstract-machine engine running injected user programs
● On steroids
○ New domain (tracing/profiling)
○ Numerous hooking points
○ LLVM backend
○ Actions (mutates data)
○ Data-structures (“maps”)
○ Kernel callable helper functions
Applications (network)
● Network Security (DDoS, IDS, IPS …)
● Load Balancers
● Custom Statistics
● Monitoring
● Container Networking
● Custom Forwarding Stacks
● Network Functions
● Write
○ Restricted C
● Compile
○ clang & llc
● Load
○ bpf(BPF_PROG_LOAD, ...)
● Attach
○ Subsystem dependent
Modus Operandi
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(long),
.max_entries = 256,
};
SEC("socket1") int bpf_prog1(struct __sk_buff *skb)
{
int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
long *value;
if (skb->pkt_type != PACKET_OUTGOING)
return 0;
value = bpf_map_lookup_elem(&my_map, &index);
if (value)
__sync_fetch_and_add(value, skb->len);
return 0;
}
samples/bpf/sockex1_kern.c
load_bpf_file(filename); // assigns prog_fd, map_fd
sock = open_raw_sock("lo");
setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0]));
f = popen("ping -c5 localhost", "r");
for (i = 0; i < 5; i++) {
long long tcp_cnt, udp_cnt, icmp_cnt;
key = IPPROTO_TCP;
bpf_map_lookup_elem(map_fd[0], &key, &tcp_cnt);
key = IPPROTO_UDP;
bpf_map_lookup_elem(map_fd[0], &key, &udp_cnt);
key = IPPROTO_ICMP;
bpf_map_lookup_elem(map_fd[0], &key, &icmp_cnt);
printf("TCP %lld UDP %lld ICMP %lld bytesn", tcp_cnt, udp_cnt, icmp_cnt);
sleep(1);
}
samples/bpf/sockex1_user.c
eBPF Maps
● Key-value store
○ Keeps program state
○ Accessible from the eBPF program
○ Accessible from userspace
● Allows context aware behavior
● Numerous data structures
BPF_MAP_TYPE_HASH
BPF_MAP_TYPE_ARRAY
BPF_MAP_TYPE_LRU_HASH
BPF_MAP_TYPE_LPM_TRIE
more ...
Determines: context, whence, access rights
BPF_PROG_TYPE_SOCKET_FILTER packet filter
BPF_PROG_TYPE_SCHED_CLS tc classifier
BPF_PROG_TYPE_SCHED_ACT tc action
BPF_PROG_TYPE_LWT_* lightweight tunnel filter
BPF_PROG_TYPE_KPROBE kprobe filter
BPF_PROG_TYPE_TRACEPOINT tracepoint filter
BPF_PROG_TYPE_PERF_EVENT perf event filter
BPF_PROG_TYPE_XDP packet filter from XDP
BPF_PROG_TYPE_CGROUP_SKB packet filter for control groups
BPF_PROG_TYPE_CGROUP_SOCK same, allowed to modify socket options
Program Types
Helper Functions
● eBPF program may call a predefined set of functions
● Differs by program type
● Examples:
BPF_FUNC_skb_load_bytes
BPF_FUNC_csum_diff
BPF_FUNC_skb_get_tunnel_key
BPF_FUNC_get_hash_recalc
...
BPF_FUNC_skb_store_bytes
BPF_FUNC_skb_pull_data
BPF_FUNC_l3_csum_replace
BPF_FUNC_l4_csum_replace
BPF_FUNC_redirect
BPF_FUNC_clone_redirect
BPF_FUNC_skb_vlan_push
BPF_FUNC_skb_vlan_pop
BPF_FUNC_skb_change_proto
BPF_FUNC_skb_set_tunnel_key
...
BCC
BPF Compiler Collection
● Toolkit for creating and using eBPF
● Makes eBPF programs easier to write
○ Kernel instrumentation in C
○ Frontends in Python and Lua
● Numerous examples
● Documentation and tutorials
Example #1
Custom Statistics
Histogram of packets by their size
Example #2
Custom Filtering
Drop egress ARP Requests for specific Target Addresses
Example #3
Custom Network Function
Network Load Balancer
Example #3 - Topology
Server1
VIP 192.0.2.50
10.50.1.9
Server2
VIP 192.0.2.50
10.50.2.9
Test Machine
10.33.33.10
10.33.33.11
10.33.33.12
10.33.33.13
10.33.33.14
Load Balancer
192.0.2.50 dev multigre0
Set GRE tunnel destination by flow hash
Src: 10.33.33.10
Dst: 192.0.2.50
Src: 10.50.1.1
Dst: 10.50.1.9
Src: 10.33.33.10
Dst: 192.0.2.50
Further Topics
● bpfilter
● Open vSwitch eBPF datapath
● XDP
● Hardware Offloads
● Tracing / Profiling
Thank You!

Building Network Functions with eBPF & BCC

  • 1.
    Shmulik Ladkani, 2018 BuildingNetwork Functions with eBPF & BCC This work is licensed under a Creative Commons Attribution 4.0 International License.
  • 2.
    Agenda ● Intro ● Theory ○Classical BPF ○ eBPF ○ BCC ● Practice ○ Examples and demo
  • 5.
  • 6.
    Berkeley Packet Filter NewArchitecture for User-level Packet Capture ● McCanne/Jacobson 1993 ● Standardized API ● Performant
  • 7.
    Berkeley Packet Filter ●Allows user program to attach a filter onto a socket ● Available on most *nix systems
  • 8.
    Design ● Abstract-machine architecture ○Registers, memory, addressing modes… ○ Instruction set (load, store, branch, ALU…) ● In-kernel interpreter Example program: assembly / machine instructions (000) ldh [12] { 0x28, 0, 0, 0x0000000c }, (001) jeq #0x800 jt 2 jf 5 { 0x15, 0, 3, 0x00000800 }, (002) ldb [23] { 0x30, 0, 0, 0x00000017 }, (003) jeq #0x6 jt 4 jf 5 { 0x15, 0, 1, 0x00000006 }, (004) ret #262144 { 0x6, 0, 0, 0x00040000 }, (005) ret #0 { 0x6, 0, 0, 0x00000000 },
  • 9.
    Modus Operandi struct sock_filtercode[] = { /* ... machine instructions ... */ }; struct sock_fprog bpf = { .filter = code, .len = ARRAY_SIZE(code), }; sock = socket(...); setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
  • 10.
    Applications ● Libpcap ○ Tcpdump,Wireshark, Nmap... ● DHCP stacks ● WPA 802.1x stacks ● Android 464XLAT ● android.net.NetworkUtils ● Custom user-space protocol stacks
  • 11.
    Linux Enhancements Packet MetadataAccess Extension Description len skb->len proto skb->protocol type skb->pkt_type ifidx skb->dev->ifindex hatype skb->dev->type mark skb->mark rxhash skb->hash vlan_tci skb_vlan_tag_get(skb) vlan_avail skb_vlan_tag_present(skb) vlan_tpid skb->vlan_proto nla Netlink attribute of type X with offset A nlan Nested Netlink attribute of type X with offset A
  • 12.
    Linux Enhancements Just-In-Time Compiler ●Converts BPF instructions directly into native code ● As of v3.0 (x86_64) ○ SPARC, PowerPC, ARM, ARM64, MIPS, s390 followed
  • 13.
    Linux Enhancements Hooking Points ●IPTables xt_bpf ○ Competitive with traditional u32 match ○ As of v3.9 ○ iptables -A OUTPUT -m bpf --bytecode '4,48 0 0 9,21 0 1 6,6 0 0 1,6 0 0 0' -j ACCEPT ● TC cls_bpf ○ Alternative to ematch / u32 classification ○ As of v3.13 ○ tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1 tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
  • 14.
    Linux Enhancements Seccomp BPF ●Filters system calls using a BPF filter ○ Operates on syscall number and syscall arguments ○ As of v3.5 ○ ● Used by Chrome, Firefox, OpenSSH, Android… static struct filter = { /* ... */ // load syscall number BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)), // only allow ‘read’ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_read, 0, 1), BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL) }; /* ... */ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filterprog);
  • 15.
    Summary ● Fixed filterprogram ● Few injection points ● Two domains ○ Packet filtering ○ Syscall filtering ● Functional, stateless ● Kernel data is immutable ● No kernel interaction User-program injected into kernel to control behavior
  • 16.
  • 17.
    eBPF ● Abstract-machine enginerunning injected user programs ● On steroids ○ New domain (tracing/profiling) ○ Numerous hooking points ○ LLVM backend ○ Actions (mutates data) ○ Data-structures (“maps”) ○ Kernel callable helper functions
  • 18.
    Applications (network) ● NetworkSecurity (DDoS, IDS, IPS …) ● Load Balancers ● Custom Statistics ● Monitoring ● Container Networking ● Custom Forwarding Stacks ● Network Functions
  • 19.
    ● Write ○ RestrictedC ● Compile ○ clang & llc ● Load ○ bpf(BPF_PROG_LOAD, ...) ● Attach ○ Subsystem dependent Modus Operandi
  • 20.
    struct bpf_map_def SEC("maps")my_map = { .type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(u32), .value_size = sizeof(long), .max_entries = 256, }; SEC("socket1") int bpf_prog1(struct __sk_buff *skb) { int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long *value; if (skb->pkt_type != PACKET_OUTGOING) return 0; value = bpf_map_lookup_elem(&my_map, &index); if (value) __sync_fetch_and_add(value, skb->len); return 0; } samples/bpf/sockex1_kern.c
  • 21.
    load_bpf_file(filename); // assignsprog_fd, map_fd sock = open_raw_sock("lo"); setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])); f = popen("ping -c5 localhost", "r"); for (i = 0; i < 5; i++) { long long tcp_cnt, udp_cnt, icmp_cnt; key = IPPROTO_TCP; bpf_map_lookup_elem(map_fd[0], &key, &tcp_cnt); key = IPPROTO_UDP; bpf_map_lookup_elem(map_fd[0], &key, &udp_cnt); key = IPPROTO_ICMP; bpf_map_lookup_elem(map_fd[0], &key, &icmp_cnt); printf("TCP %lld UDP %lld ICMP %lld bytesn", tcp_cnt, udp_cnt, icmp_cnt); sleep(1); } samples/bpf/sockex1_user.c
  • 22.
    eBPF Maps ● Key-valuestore ○ Keeps program state ○ Accessible from the eBPF program ○ Accessible from userspace ● Allows context aware behavior ● Numerous data structures BPF_MAP_TYPE_HASH BPF_MAP_TYPE_ARRAY BPF_MAP_TYPE_LRU_HASH BPF_MAP_TYPE_LPM_TRIE more ...
  • 23.
    Determines: context, whence,access rights BPF_PROG_TYPE_SOCKET_FILTER packet filter BPF_PROG_TYPE_SCHED_CLS tc classifier BPF_PROG_TYPE_SCHED_ACT tc action BPF_PROG_TYPE_LWT_* lightweight tunnel filter BPF_PROG_TYPE_KPROBE kprobe filter BPF_PROG_TYPE_TRACEPOINT tracepoint filter BPF_PROG_TYPE_PERF_EVENT perf event filter BPF_PROG_TYPE_XDP packet filter from XDP BPF_PROG_TYPE_CGROUP_SKB packet filter for control groups BPF_PROG_TYPE_CGROUP_SOCK same, allowed to modify socket options Program Types
  • 24.
    Helper Functions ● eBPFprogram may call a predefined set of functions ● Differs by program type ● Examples: BPF_FUNC_skb_load_bytes BPF_FUNC_csum_diff BPF_FUNC_skb_get_tunnel_key BPF_FUNC_get_hash_recalc ... BPF_FUNC_skb_store_bytes BPF_FUNC_skb_pull_data BPF_FUNC_l3_csum_replace BPF_FUNC_l4_csum_replace BPF_FUNC_redirect BPF_FUNC_clone_redirect BPF_FUNC_skb_vlan_push BPF_FUNC_skb_vlan_pop BPF_FUNC_skb_change_proto BPF_FUNC_skb_set_tunnel_key ...
  • 25.
  • 26.
    BPF Compiler Collection ●Toolkit for creating and using eBPF ● Makes eBPF programs easier to write ○ Kernel instrumentation in C ○ Frontends in Python and Lua ● Numerous examples ● Documentation and tutorials
  • 27.
  • 28.
    Example #2 Custom Filtering Dropegress ARP Requests for specific Target Addresses
  • 29.
    Example #3 Custom NetworkFunction Network Load Balancer
  • 30.
    Example #3 -Topology Server1 VIP 192.0.2.50 10.50.1.9 Server2 VIP 192.0.2.50 10.50.2.9 Test Machine 10.33.33.10 10.33.33.11 10.33.33.12 10.33.33.13 10.33.33.14 Load Balancer 192.0.2.50 dev multigre0 Set GRE tunnel destination by flow hash Src: 10.33.33.10 Dst: 192.0.2.50 Src: 10.50.1.1 Dst: 10.50.1.9 Src: 10.33.33.10 Dst: 192.0.2.50
  • 31.
    Further Topics ● bpfilter ●Open vSwitch eBPF datapath ● XDP ● Hardware Offloads ● Tracing / Profiling
  • 32.