Flow dissector in BPF

Posted at 2018-12-17

はじめに

Linux Advent Calendar 2018の17日目の記事です。

net-nextブランチ¹にマージされた²Flow dissector in BPF³という機能について調べました。Flow dissectorと呼ばれるパケットヘッダ情報抽出機能をBPFで置き換えるという、BPFとしては真っ当な応用の話です。

Flow dissector

Flow dissectorはパケットのフロー情報を抽出・アクセスを補助する機能です。おそらく一番良く知られている使用例は、フロー情報からハッシュ値を計算するskb_get_hash関数(もしくはflow_hash_from_keys関数)だと思います⁴。

ただ、抽出対象がかなり追加され、フロー情報というよりプロトコルヘッダの特定フィールドを抽出する、より汎用的な使い方もできそうです。例えばIPv4ヘッダのTTLなども抽出対象になっています。

データ構造

Flow dissectorの主要なデータ構造の一つに、抽出したデータを保存するstruct flow_keysがあります。これ以外にもEthernetのMACアドレスを扱うstruct flow_dissector_key_eth_addrs、ARPパケットのヘッダ情報を扱うstruct flow_dissector_key_arpなどがあります。

例えばstruct flow_dissector_key_arpは以下のように定義されています。

struct flow_dissector_key_arp {
	__u32 sip;
	__u32 tip;
	__u8 op;
	unsigned char sha[ETH_ALEN];
	unsigned char tha[ETH_ALEN];
};

どのようなプロトコルを扱えるかを知るにはenum flow_dissector_key_idを見ると良いと思います。ICMP, VLAN, MPLS, カプセル化されたパケットの内側のプロトコルヘッダなど、かなり多くの情報を抽出できるようです。

もう一つ重要なデータ構造にstruct flow_dissectorがあります。これはどの情報を抽出するかを指定するデータ構造です。1つ以上のenum flow_dissector_key_idを指定してフロー情報を抽出します。

とはいえ、ほとんどの機能は、あらかじめ決められた組み合わせの情報を抽出するようです⁵。柔軟にflow_dissectorを指定しているのは今のところFlower classifier(NET_CLS_FLOWER)だけのようです。

関数

機能のコアとなるのは__skb_flow_dissect関数です。この関数でデータの抽出を行なっています。

bool __skb_flow_dissect(const struct sk_buff *skb,
                        struct flow_dissector *flow_dissector,
                        void *target_container,
                        void *data, __be16 proto, int nhoff, int hlen,
                        unsigned int flags)

必要な引数の説明だけすると、__skb_flow_dissectはflow_dissectorを基にskbからフロー情報を取り出しtarget_containerに入れて返すという動作をします。

例えばVLANタグの抽出の当該コードは以下のようになっています。確かにこの手のコードは自分で書きたくはないですね。

		if (dissector_uses_key(flow_dissector, dissector_vlan)) {
			key_vlan = skb_flow_dissector_target(flow_dissector,
							     dissector_vlan,
							     target_container);

			if (!vlan) {
				key_vlan->vlan_id = skb_vlan_tag_get_id(skb);
				key_vlan->vlan_priority =
					(skb_vlan_tag_get_prio(skb) >> VLAN_PRIO_SHIFT);
			} else {
				key_vlan->vlan_id = ntohs(vlan->h_vlan_TCI) &
					VLAN_VID_MASK;
				key_vlan->vlan_priority =
					(ntohs(vlan->h_vlan_TCI) &
					 VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
			}
			key_vlan->vlan_tpid = saved_vlan_tpid;
		}

Flow dissector in BPF

概要

機能の概要を把握するためにマージコミットの説明文を見てみましょう。

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

(snip)

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds.
	$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
		-t 10

In-kernel Dissector:
	__skb_flow_dissect overhead: 2.12%
	Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
	__skb_flow_dissect overhead: 1.63%
	Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op BPF Dissector:
	__skb_flow_dissect overhead: 1.52%
	Total Packets: 3,330,635 (from output of ./test_flow_dissector)

BPFに置き換えたときのメリットとして以下のようなものがあると主張しています。

BPF検証機能(verifier)があるので安全
- 無限ループに陥らない
  - 過去にヘッダの解析時に無限ループする脆弱性が存在した
- パケットの外側を間違って読まない
必要なプロトコルだけ解析できるように動的に設定変更できる
- Attack surfaceを減らせる
バグがあったときでもカーネルの再コンパイルや再起動が不要

また性能評価結果も書かれており、カーネル内のCの実装よりもオーバヘッドが小さくなっていることがわかります(2.12%→1.63%)。

コミット

flow_dissector: implements flow dissector BPF hook

1番目のコミットでBPFフックポイントが実装され、4番目のコミットでプロトコルヘッダの解析コードがselftestの一部として提供されています。

コード解析

struct bpf_flow_keysはBPFプログラムの中で扱うフロー情報を表わすデータ構造です。struct flow_keysと比較するとわかるように、flow dissectorがサポートしているすべてのフロー情報を扱えるようになっているわけではなさそうです。

struct bpf_flow_keys {
        __u16   nhoff;
        __u16   thoff;
        __u16   addr_proto;                     /* ETH_P_* of valid addrs */
        __u8    is_frag;
        __u8    is_first_frag;
        __u8    is_encap;
        __u8    ip_proto;
        __be16  n_proto;
        __be16  sport;
        __be16  dport;
        union {
                struct {
                        __be32  ipv4_src;
                        __be32  ipv4_dst;
                };
                struct {
                        __u32   ipv6_src[4];    /* in6_addr; network order */
                        __u32   ipv6_dst[4];    /* in6_addr; network order */
                };
        };
};

BPFフック

__skb_flow_dissectに追加されたコードが以下の部分です。

	rcu_read_lock();
	attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
		       : NULL;
	if (attached) {
		/* Note that even though the const qualifier is discarded
		 * throughout the execution of the BPF program, all changes(the
		 * control block) are reverted after the BPF program returns.
		 * Therefore, __skb_flow_dissect does not alter the skb.
		 */
		struct bpf_flow_keys flow_keys = {};
		struct bpf_skb_data_end cb_saved;
		struct bpf_skb_data_end *cb;
		u32 result;

		cb = (struct bpf_skb_data_end *)skb->cb;

		/* Save Control Block */
		memcpy(&cb_saved, cb, sizeof(cb_saved));
		memset(cb, 0, sizeof(cb_saved));

		/* Pass parameters to the BPF program */
		cb->qdisc_cb.flow_keys = &flow_keys;
		flow_keys.nhoff = nhoff;

		bpf_compute_data_pointers((struct sk_buff *)skb);
		result = BPF_PROG_RUN(attached, skb);

		/* Restore state */
		memcpy(cb, &cb_saved, sizeof(cb_saved));

		__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
					 target_container);
		key_control->thoff = min_t(u16, key_control->thoff, skb->len);
		rcu_read_unlock();
		return result == BPF_OK;
	}
	rcu_read_unlock();

大雑把な処理内容は、BPF_PROG_RUNでアタッチされたBPFプログラムを実行、__skb_flow_bpf_to_targetでstruct bpf_flow_keysに取得した情報をtarget_containerに詰めなおすといった感じです。

既存のFlow dissectorのコードに無理やり機能追加した感じで、最適化をする余地がまだありそうですね。

BPFプログラム

flow_dissector: implements eBPF parserで実装されているパーサは、dissect関数がBPFプログラムの入り口です。

SEC("dissect")
int dissect(struct __sk_buff *skb)
{
        if (!skb->vlan_present)
                return parse_eth_proto(skb, skb->protocol);
        else
                return parse_eth_proto(skb, skb->vlan_proto);
}

parse_eth_protoでまずEthernetヘッダを解析してレイヤ3プロトコルに従ってジャンプ先を変えます。

/* Dispatches on ETHERTYPE */
static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
{
        struct bpf_flow_keys *keys = skb->flow_keys;

        keys->n_proto = proto;
        switch (proto) {
        case bpf_htons(ETH_P_IP):
                bpf_tail_call(skb, &jmp_table, IP);
                break;
        case bpf_htons(ETH_P_IPV6):
                bpf_tail_call(skb, &jmp_table, IPV6);
                break;
        case bpf_htons(ETH_P_MPLS_MC):
        case bpf_htons(ETH_P_MPLS_UC):
                bpf_tail_call(skb, &jmp_table, MPLS);
                break;
        case bpf_htons(ETH_P_8021Q):
        case bpf_htons(ETH_P_8021AD):
                bpf_tail_call(skb, &jmp_table, VLAN);
                break;
        default:
                /* Protocol not supported */
                return BPF_DROP;
        }

        return BPF_DROP;
}

普通のCのプログラムだと思って読むと、すべてケースでreturn BPF_DROPすることになるように読めますが、そうではありません。bpf_tail_callはいわゆるtail callを簡潔に書くための関数で、実際にはbreakする前にreturnします(参考：bpf: introduce bpf_tail_call() helper [LWN.net])。

bpf_tail_callは、BPFのマップで作成したジャンプテーブル(jmp_table)を使ってプロトコル毎に用意したパーサ関数にジャンプします。IPv4の場合は以下のPROG(IP)関数に飛びます。やぱり地道にヘッダを解析しているのがわかります(parse_ip_proto以降は省略)。

PROG(IP)(struct __sk_buff *skb)
{
        void *data_end = (void *)(long)skb->data_end;
        struct bpf_flow_keys *keys = skb->flow_keys;
        void *data = (void *)(long)skb->data;
        struct iphdr *iph, _iph;
        bool done = false;

        iph = bpf_flow_dissect_get_header(skb, sizeof(*iph), &_iph);
        if (!iph)
                return BPF_DROP;

        /* IP header cannot be smaller than 20 bytes */
        if (iph->ihl < 5)
                return BPF_DROP;

        keys->addr_proto = ETH_P_IP;
        keys->ipv4_src = iph->saddr;
        keys->ipv4_dst = iph->daddr;

        keys->nhoff += iph->ihl << 2;
        if (data + keys->nhoff > data_end)
                return BPF_DROP;

        if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
                keys->is_frag = true;
                if (iph->frag_off & bpf_htons(IP_OFFSET))
                        /* From second fragment on, packets do not have headers
                         * we can parse.
                         */
                        done = true;
                else
                        keys->is_first_frag = true;
        }

        if (done)
                return BPF_OK;

        return parse_ip_proto(skb, iph->protocol);
}

おわりに

フロー情報抽出機能をBPFプログラムで実現するFlow dissector in BPFのコードを読んでみました。パケットヘッダ解析というBPFが得意な処理をBPFで実行するという真っ当な機能だと思います。

気になったのは、パッチの説明にあった"Also, with BPF the administrator can decide which protocols to support, reducing potential attack surface. Rarely encountered protocols can be excluded from dissection and the program can be updated without kernel recompile or reboot if a bug is discovered."の部分。システム管理者が簡単に対象プロトコルを減らすにはどうすれば良いのかわからなかったです。もしかするとBPFプログラムの再コンパイルなしで可能なのかもしれないですが、BPF力が足りないためよくわかりませんでした。時間があれば、もうちょっと調べてみたいです...

次のリリースにマージされるネットワーク関連の機能をまとめたブランチ。 ↩
つまり、Linuxの次の次のリリース(4.21 or 5.0)に含まれるはず。 ↩
本記事ではBPFを書きますが、もちろんeBPFのことです。 ↩
ネットワークスタックの途中でパケットをCPU間で振り分けたり、複数のキューに振り分けたりするときに、リオーダを防ぐため、同一フローのパケットは同じCPU/キューに振り分けたい。フロー情報からハッシュ値を計算してそれを基に振り分ける。 ↩
プリセットのflow_dissectorはブート時にinit_default_flow_dissectorsで設定されます。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up