High-Performance I/O using io_uring via osmo_io

Intro

Who is speaking to you?

  • Developing system-level software on Linux for > 25 years

  • coincidentally CCC member for more than the same time

  • former Linux kernel networking hacker

    • left that field about 15 years ago

    • not involved in the development of any of the technologies presented

  • developing FOSS software implementing cellular protocol stacks ever since

Overview

  • context

  • traditional I/O Models

  • problem statement

  • io_uring introduction

  • eBPF

  • XDP

Audience

This is a fairly technical talk. It assumes some general background in

  • computer networking

  • Operating System architecture, specifically here Linux

  • the system call boundary between kernel and processes

  • BSD / POSIX APIs for developing networked applications

You’ve been warned!

Context

  • programs doing network I/O use BSD sockets interface designed ~ 40 years ago

  • network speeds have increased from kilobits to gigabits

  • CPUs, RAM, etc also got faster - but it’s not all about throughput

  • packet sizes have not really scaled

    • MTUs have gone from 500 to 1500, only occasionally 9000

    • lots of packets are inherently smaller than MTU anyway

  • there is a fixed overhead processing each packet

Classic network stack / sockets

how we’ve done networking since BSD sockets were invented:

  • kernel receives packet from NIC

  • kernel processes packet layer by layer (Ethernet, IP, TCP, …​)

  • various sub-systems like tunnels, traffic shaping, IPsec, NAT, packet filter

  • eventually data form packet gets enqueued to socket receive queue

  • userspace process gets woken up

  • data gets copied to userspace process

Classic network stack / sockets

  • this all worked fine with modem speeds and up to Gigabit Ethernet.

  • higher wire speeds mean higher pps (packets per second) rates

  • amount of software processing per packet is high

  • touching lots of sub-systems means lots of cache misses, branches, etc.

  • we usually copy data several times while packet is processed

  • system call boundary is expensive and adds latency

  • scheduler can schedule other tasks; caches no longer hot

Bandwidth vs pps

Due to the fixed overhead of processing a packet

  • most people outside [software] networking think in bandwidth (gigabits)

  • people working on networking software think in packets-per-second

bits per secondbytes per secondworst-case packets per second

10 Mbps

1.25 MByte/s

15.8 kpps

100 Mbps

12.5 MByte/s

148 kpps

1 Gbps

125 MByte/s

1.48 Mpps

10 Gbps

1.25 Gbyte/s

14.8 Mpps

100 Gbps

12.5 Gbyte/s

148 Mpps

Host vs. Forwarding

Packet processing significantly differs between

  • hosts (end nodes sending or receiving data)

    • they mostly use socket based I/O from application programs

  • routes, bridges/switches, firewalls, load balancers, …​

    • they mostly use packet processing inside the kernel from one NIC to another

Chapter 1: io_uring

I/O in the context of this talk primarily refers to read/write operations on file descriptors, primarily sockets. This includes the various socket-specific APIs like

  • recv, recvfrom, recvmsg

  • send, sendto, sendmsg

Traditional POSIX / Linux I/O models

  • Blocking I/O

  • Non-blocking I/O

    • via select()

    • via poll()

    • via epoll()

    • via POSIX AIO

    • via libaio

Blocking I/O

  • the most naive way of doing I/O: calling thread is blocked until I/O operation completes

  • only works in very trivial applications, or with applications spawning threads for each I/O operation

non-blocking select()

  • only works for file descriptors < 1024

  • the fd-set needs to be re-initialized every time you enter select()

  • complexity of the inner loop is O(max_fd + 1)

    • relatively expensive even if you only have very few fd’s

non-blocking poll()

  • no limit on maximum file descriptor number

  • doesn’t need to reset the fd-set on ever iteration

  • complexity of the inner loop is O(n)

  • kernel-side code is mostly shared with select

epoll()

  • Introduced in Linux 2.5.x (2.6.x stable series)

  • complexity of the inner loop is O(ready_fds)

    • so if you have many fds and only few are readable/writable, then it outperforms poll()

POSIX AIO

  • aio_{read,write,fxync,error,return,suspend,cancel}() functions

  • notifies I/O completion using signals :(

  • mostly userspace implementation in glibc using threads, so no performance improvement, just portability wrapper!

libaio

  • doesn’t support sockets, so not relevant in this context

io_uring

  • first introduced (by Jens Axboe) to the Linux kernel in 5.1

  • liburing library provides convenience wrapper API around raw syscall

  • pro

    • literally, the best thing since sliced bread in terms of I/O efficiency

  • con

    • Google’s security team reported that 60% of Linux kernel exploits submitted to their bug bounty program in 2022 were exploits of io_uring vulnerabilities :(

io_uring basics

  • memory-mapped submission + completion queues between kernel and userspace program

    • application can add I/O requests to submission queue

    • kernel adds completed I/O events to completion queue

  • mapped nature avoids copying submission/completion events

io_uring visualization

uring diagram

io_uring basics

  • io_uring_setup() to create an instance

  • io_uring_enter() will notify kernel we have submitted work

  • completion can be notified via an eventfd

    • this affords excellent integration with existing select/poll I/O loops.

    • Some of your I/O can go via io_uring, while other I/O can use existing poll/select based mechanisms.

io_uring evolution

  • more and more opcodes added over time, supporting more syscalls than just read and write but for example sendmsg and recvmsg as needed for example for SCTP sockets.

  • all the features we need in libosmocore / osmo_io are present in kernel versions shipped by Debian 11 or later.

Personal experience with io_uring

  • In Osmocom we often have applications that deal with lots of small I/O with thousands of peers

    • packets are always much smaller than the MTU (typically dozens to hundreds of bytes)

    • almost nothing is TCP, most traffic is either SCTP or UDP. If it’s TCP, the window is never full

  • we’ve converted many of our libraries and applications to io_uring

  • performance gains are incredible (CPU load reduction of 50..70%)

typical perf trace with recv/send/poll

perf bsc classic

CPU cycle reduction with io_uring

70 to 20 percent

Only the green graph is relevant here. Guess when io_uring was enabled…​

Early osmocom io_uring experience

  • I learned about io_uring when it was first being released in 2019

  • Always wanted to use it in a project, but in osmocom we rarely live at the bleeding edge of performance requirements

  • Finally, in 2021: gtp-load-gen

The path to io_uring in libosmocore

Problem in migrating osmo-* to io_uring?

  • exiting osmo_fd api in src/core/select.c abstraction is around notifying I/O availability while actual read/write is done by application itself

  • io_uring is based around performing the actual I/O asynchronously

Result: New API / abstraction level needed. We called it osmo_io

Example: I/O using old osmo_fd

  static int my_fd_cb(struct osmo_fd *ofd, unsigned int what)
  {
    if (what & OSMO_FD_READ) {
      struct msgb *msg = msgb_alloc(...);
      int rc = read(ofd.fd, msg->tail, msg->tailroom);
      msgb_put(msg, rc);
      /* perform actual processing of received data in msg */
    }
  }

  struct osmo_fd ofd;
  fd = socket(...);
  connect(fd, ...);
  osmo_fd_setup(&ofd, fd, OSMO_FD_READ, &read_cb, ...);
  osmo_fd_register(&ofd);

Example: I/O using new osmo_io

  static int my_fd_cb(struct osmo_fd *ofd, unsigned int what)
  static int my_read_cb(struct osmo_io_fd *iofd, int res, struct msgb *msg)
  {
    /* perform actual processing of received data in msg */
  }

  fd = socket(...);
  connect(fd, ...);
  struct osmo_io_ops ops = {
    .read_cb = my_read_cb,
  };
  struct osmo_io_fd *iofd = osmo_iofd_setup(ctx, fd, ..., &ops);
  osmo_iofd_register(iofd, -1);
  ...
  osmo_select_main()

osmo_io backends

  • osmo_io is not io_uring specific, but generic I/O abstraction

  • default back-end uses good old poll()

  • alternate io_uring back-end can be selected via environment variable LIBOSMO_IO_BACKEND=IO_URING

  • this allows users to give io_uring a try and switch back in an instance if needed.

io_uring summary

io_uring …​

  • optimizes the kernel/userspace interface for [not just] networking I/O

  • drastically reduces the number of syscalls needed

  • still retains the same semantics

    • the operations are still read/write, send/recv, sendmsg/recvmsg, sendto/recvfrom, …​

    • you just enqueue them for asynchronous processing

But:

  • Your packets still go through a lot of kernel code before ever appearing in your application.

  • this costs a lot of CPU cycles

So what can we do?

Chapter 2: XDP, AF_XDP and eBPF

Kernel features for high-speed packet processing

  • XDP (Express Data Path)

  • AF_XDP (Express Data Path Address Family)

  • eBPF (evolved Berkeley Packet Filter)

So-called solutions like DPDK

  • remove entire NIC from operating system (linux) control

  • memory-map buffers etc. to userspace

  • perform packet processing in userspace program

  • achieve very high performance

  • isn’t really Linux anymore

    • not possible to use any Linux network features

XDP (Express Data Path) to the rescue

  • XDP was introduced in Linux kernel 4.18

  • XDP hooks very early in-kernel packet processing, just after NIC driver has received packet

  • small program decides if packet should go

    • XDP_DROP: drop the packet, or

    • XDP_PASS: normal [slow] path [XDP_PASS], or

    • XDP_TX: transmit [modified] packet on the same NIC it was received on

    • XDP_REDIRECT: redirect packet to another NIC or to a user space AF_XDP socket

XDP in the network stack

Netfilter packet flow

(Image credits:Jan Engelhardt / CC-BY-SA)

AF_XDP (XDP sockets)

  • operates on raw packet (like AF_PACKET)

  • memory mapped Rx + Tx rings for zero-copy

  • single socket can be used for Rx + Tx

  • Rx and Tx descriptors on ring point to memory

  • A Tx descriptor can point to the same memory as a Rx → zero-copy forwarding

    Note

    AF_XDP sockets are bound to a single RX queue of your NIC!

How to write that small program for decision-making?

  • traditionally, it would have needed to be a kernel module

  • however, these days, the kernel has the eBPF virtual machine

  • eBPF code can be loaded/unloaded to the kernel at runtime, no need to compile modules

eBPF in-kernel Virtual Machine

  • an evolved version of the traditional BPF (berkeley packet filter)

    • if you didn’t know traditional BPF: it’s what e.g. tcpdump uses to compile the capture filter into bytecode which execute in-kernel to pass only matching packets via AF_PACKET to the tcpdump program in userspace

  • not tied specifically to networking

  • in-kernel verifier ensures program has no risk of endless loops, access invalid memory, etc.

  • in-kernel JIT to compile eBPF bytecode to native executable code

  • Instruction Set Architecture now standardized inside IETF BFP/eBPF working group

eBPF program types

  • XDP is just one of many use cases of eBPF

  • socket filtering, kprobe, tc scheduler (action, classifier), performance events,

  • tracing: function entry, exit, modify return values

  • cgroups, light-weight tunnels,

  • TCP stream parser (e.g. kernel-assisted TLS and KCM)

  • …​

Chances are high your system uses bpf already. bpftool prog lists 23 programs on my Laptop.

But let’s focus on XDP here!

eBPF maps

  • key-value store for sharing data between (eBPF programs in) kernel and userspace

  • several types like hash, array, radix-tree, bloom filter

  • also includes _percpu variants for multi-processor efficiency

eBPF and XDP

  • decision making program is compiled for the eBPF VM

    • can be C programs compiled via clang -target bpf

  • loaded into the kernel and registered at XDP hook

  • decision-making program can select which packets should go to AF_XDP sockets

  • another program (normal userland program, written in any language you choose) handles the traffic at the AF_XDP socket

eBPF XDP example program (NOP)

// SPDX-License-Identifier: GPL-2.0

#define KBUILD_MODNAME "xdp_dummy"
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int xdp_dummy_prog(struct xdp_md *ctx)
{
	return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

eBPF XDP example program

/* SPDX-License-Identifier: GPL-2.0 */

#include <linux/bpf.h>

#include <bpf/bpf_helpers.h>

struct {
	__uint(type, BPF_MAP_TYPE_XSKMAP);
	__type(key, __u32);
	__type(value, __u32);
	__uint(max_entries, 64);
} xsks_map SEC(".maps");

struct {
	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
	__type(key, __u32);
	__type(value, __u32);
	__uint(max_entries, 64);
} xdp_stats_map SEC(".maps");

SEC("xdp")
int xdp_sock_prog(struct xdp_md *ctx)
{
    int index = ctx->rx_queue_index;
    __u32 *pkt_count;

    pkt_count = bpf_map_lookup_elem(&xdp_stats_map, &index);
    if (pkt_count) {

        /* We pass every other packet */
        if ((*pkt_count)++ & 1)
            return XDP_PASS;
    }

    /* A set entry here means that the correspnding queue_id
     * has an active AF_XDP socket bound to it. */
    if (bpf_map_lookup_elem(&xsks_map, &index))
        return bpf_redirect_map(&xsks_map, index, 0);

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

Applications / Use cases for AF_XDP

  • device driver for DPDK, VPP and related userland networking frameworks

  • load-balancers (e.g. dnsdist)

  • software switches (experimental in Open vSwitch)

  • DoS protection filters

  • Intrusion Detection systems (eg. Suricata)

  • eupf, an eBPF + AF_XDP accelerated 5G UPF (User Plane Function)

Approaches for XDP

  • direct kernel fast-path (keep packet in the kernel)

    • receive packet at XDP eBPF program

    • determine where to re-transmit it / what changes to apply

    • enqueue on the transmit NIC

  • AF_XDP path to userspace

    • slower than kernel fast-path

    • still very fast [for userspace]

In-kernel processing vs. AF_XDP

  • decision can be made on each packet

  • example:

    • bulk of data plane traffic stays in kernel

    • occasional exceptions are handled in userspace

Pretty much the same logic as we do in GTP tunnel driver

  • known packets processed in kernel

  • all other packets passed to userspace

Further Reading

EOF

End of File