Eval Generic netstack XDP patch¶
Authors: | Jesper Dangaard Brouer |
---|---|
Version: | 1.0.1 |
Date: | 2017-04-24 Mon |
Updated: | 2017-06-08 |
Given XDP works at the driver level, developing and testing XDP programs requires access to specific NIC hardware… but this is about to change in kernel v4.12.
UPDATE (2017-06-08): The mentioned/evaluated patches have been accepted and will appear in kernel release v4.12
To ease developing and testing XDP programs, a generic netstack-XDP patch proposal (PATCH V3 and PATCH V4) have been posted. This allow for attaching XDP programs to any net_device. If the driver doesn’t support native XDP, the XDP eBPF program gets attached further inside the network stack. This is obviously slower and loses the XDP benefit of skipping the SKB allocation.
The generic netstack-XDP patchset is NOT targetted high performance, but instead for making it easier to test and develop XDP programs.
That said, this does provide an excellent opportunity for comparing performance between NIC-level-XDP and netstack-XDP. This provides the ability to do what I call zoom-in-benchmarking of the network stack facilities, that the NIC-XDP programs avoid. Thus, allowing us to quantify the cost of these facilities.
Special note for the KVM driver virtio_net
:
XDP support have been added to KVM via the virtio_net driver, but unfortunately it is a hassle to configure (given it requires disabling specific options, which are default enabled).
Benchmark program¶
The XDP program used is called: xdp_bench01_mem_access_cost
and is
available in the prototype kernel samples/bpf directory as
xdp_bench01_mem_access_cost_kern.c and _user.c.
UPDATE (2017-06-08): Thexdp_bench01_mem_access_cost
program have gotten an option called--skb-mode
, which will force using “Generic XDP” even on interfaces that do support XDP natively. This is practical for doing this kind of comparison as described in the document.
Baseline testing with NIC-level XDP¶
First establish a baseline for the performance of NIC-level XDP. This will serve as baseline against the patch being evaluated. The packet generator machine is running pktgen_sample03_burst_single_flow.sh, which implies these tests are single CPU RX performance, as the UDP flow will hit a single hardware RX-queue, and thus only activate a single CPU.
Baseline with mlx5 on a Skylake CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz.
Network card (NIC) hardware: NIC: ConnectX-4 Dual 100Gbit/s, driver mlx5. Machines connected back-to-back with Ethernet-flow control disabled.
Dropping packet without touching the packet data (thus avoiding cache-miss) have a huge effect on my system. HW indicate via PMU counter LLC-load-misses that DDIO is working on my system, but the L3-to-L1 cache-line miss is causing the CPU to stall:
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_DROP --dev mlx5p2
XDP_action pps pps-human-readable mem
XDP_DROP 19851067 19,851,067 no_touch
XDP_DROP 19803663 19,803,663 no_touch (**used in examples**)
XDP_DROP 19795927 19,795,927 no_touch
XDP_DROP 19792161 19,792,161 no_touch
XDP_DROP 19792109 19,792,109 no_touch
I have previously posted patches to the mlx5 and mlx4 driver, that prefetch packet-data into L2, and avoid this cache stall, and I can basically achieve same result as above, even when reading data. Mellanox have taken over these patches, but they are stalling on that on newer E5-26xx v4 CPUs this prefetch already happens in HW.
This is a more realistic XDP_DROP senario where we touch packet data before dropping it (causes cache miss from L3):
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_DROP --dev mlx5p2 --read
XDP_action pps pps-human-readable mem
XDP_DROP 11972515 11,972,515 read
XDP_DROP 12006685 12,006,685 read (**used in examples**)
XDP_DROP 12004640 12,004,640 read
XDP_DROP 11997837 11,997,837 read
XDP_DROP 11998538 11,998,538 read
^CInterrupted: Removing XDP program on ifindex:5 device:mlx5p2
An interesting observation and take-ways from these two measurements is that this cache-miss cost approx 32ns ((1/12006685-1/19803663)*10^9).
For the XDP_TX test to be correct, it is important to swap MAC-addrs else the NIC HW will not transmit this to the wire (I verified this was actually TX’ed to the wire):
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_TX --dev mlx5p2 --read
XDP_action pps pps-human-readable mem
XDP_TX 10078899 10,078,899 read
XDP_TX 10109107 10,109,107 read
XDP_TX 10107393 10,107,393 read
XDP_TX 10107946 10,107,946 read
XDP_TX 10109020 10,109,020 read
Testing with network stack generic XDP¶
This test is based on PATCH V4 after adjusting the patch according to the email thread, and and validated XDP_TX can send packets on wire.
Netstack XDP_DROP¶
As expected there is no difference in letting the XDP prog touch/read packet-data vs “no_touch”, because we cannot avoid touching given the XDP/eBPF hook happens much later in the network stack. As can be seen by these benchmarks:
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_DROP --dev mlx5p2
XDP_action pps pps-human-readable mem
XDP_DROP 8438488 8,438,488 no_touch
XDP_DROP 8423788 8,423,788 no_touch
XDP_DROP 8425617 8,425,617 no_touch
XDP_DROP 8421396 8,421,396 no_touch
XDP_DROP 8432846 8,432,846 no_touch
^CInterrupted: Removing XDP program on ifindex:7 device:mlx5p2
The drop numbers are good, for the netstack but some distance to the 12,006,685 pps of XDP running on in-the-NIC. Percentage-wise it looks big a reduction of approx 30%. But nanosec difference is it “only” (1/12006685*10^9)-(1/8413417*10^9) = -35.57 ns
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_DROP --dev mlx5p2 --read
XDP_action pps pps-human-readable mem
XDP_DROP 8415835 8,415,835 read
XDP_DROP 8413417 8,413,417 read
XDP_DROP 8236525 8,236,525 read
XDP_DROP 8410996 8,410,996 read
XDP_DROP 8412015 8,412,015 read
^CInterrupted: Removing XDP program on ifindex:7 device:mlx5p2
Do notice, that reaching around 8Mpps is a very good result for the normal networks stack, because 100Gbit/s with large MTU size frames (1536 bytes due to overheads) corresponds to 8,138,020 pps ((100*10^9)/(1536*8)). The above test is with small 64bytes packets, and the generator sending 40Mpps (can be tuned to 65Mpps).
Below perf-stat for this generic netstack-XDP_DROP test, show a high (2.01) insn per cycle indicate that it is functioning fairly optimal, and we likely cannot find any “magic” trick as the CPU does not seem to be stalling on something:
$ sudo ~/perf stat -C7 -e L1-icache-load-misses -e cycles:k \
-e instructions:k -e cache-misses:k -e cache-references:k \
-e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k \
-e LLC-load -r 4 sleep 1
Performance counter stats for 'CPU(s) 7' (4 runs):
349,830 L1-icache-load-misses ( +- 0.53% ) (33.31%)
3,989,134,732 cycles:k ( +- 0.06% ) (44.50%)
8,016,054,916 instructions:k # 2.01 insn per cycle (+- 0.02%) (55.62%)
31,843,544 cache-misses:k # 17.337 % of all cache refs (+- 0.04%) (66.71%)
183,671,576 cache-references:k ( +- 0.03% ) (66.71%)
1,190,204 LLC-store-misses ( +- 0.29% ) (66.71%)
17,376,723 LLC-store ( +- 0.04% ) (66.69%)
55,058 LLC-load-misses ( +- 0.07% ) (22.19%)
3,056,972 LLC-load ( +- 0.13% ) (22.19%)
Netstack XDP_TX¶
When testing XDP_TX it is important to verify that packets are actually transmitted. This is because the NIC HW can choose to drop invalid packets, which changes the performance profile and your results.
Generic netstack-XDP_TX verified actually hitting wire. The slowdown is higher than expected. Maybe we are stalling on the tairptr/doorbell update on TX???
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_TX --dev mlx5p2 --read
XDP_action pps pps-human-readable mem
XDP_TX 4577542 4,577,542 read
XDP_TX 4484903 4,484,903 read
XDP_TX 4571821 4,571,821 read
XDP_TX 4574512 4,574,512 read
XDP_TX 4574424 4,574,424 read (**use in examples**)
XDP_TX 4575712 4,575,712 read
XDP_TX 4505569 4,505,569 read
^CInterrupted: Removing XDP program on ifindex:7 device:mlx5p2
Below perf-stat for generic netstack-XDP_TX, show a lower (1.51) insn per cycle, indicate that the system is stalling on something
$ sudo ~/perf stat -C7 -e L1-icache-load-misses -e cycles:k \
-e instructions:k -e cache-misses:k -e cache-references:k \
-e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k \
-e LLC-load -r 4 sleep 1
Performance counter stats for 'CPU(s) 7' (4 runs):
518,261 L1-icache-load-misses ( +- 0.58% ) (33.30%)
3,989,223,247 cycles:k ( +- 0.01% ) (44.49%)
6,017,445,820 instructions:k # 1.51 insn per cycle (+- 0.31%) (55.62%)
26,931,778 cache-misses:k # 10.930 % of all cache refs (+- 0.05%) (66.71%)
246,406,110 cache-references:k ( +- 0.19% ) (66.71%)
1,317,850 LLC-store-misses ( +- 2.93% ) (66.71%)
30,028,771 LLC-store ( +- 0.88% ) (66.70%)
72,146 LLC-load-misses ( +- 0.22% ) (22.19%)
12,426,426 LLC-load ( +- 2.12% ) (22.19%)
Perf details for netstack XDP_TX¶
My first though is that there is a high probability that this could be the tairptr/doorbell update. Looking at perf report something else lights up, which could still be the tailptr, as it stalls on the next lock operation
Samples: 25K of event 'cycles', Event count (approx.): 25790301710
Overhead Symbol
+ 24.75% [k] mlx5e_handle_rx_cqe
+ 16.95% [k] __build_skb
+ 10.72% [k] mlx5e_xmit
+ 7.03% [k] build_skb
+ 5.31% [k] mlx5e_alloc_rx_wqe
+ 2.99% [k] kmem_cache_alloc
+ 2.65% [k] ___slab_alloc
+ 2.65% [k] _raw_spin_lock
+ 2.52% [k] bpf_prog_662b9cae761bf6ab
+ 2.37% [k] netif_receive_skb_internal
+ 1.92% [k] memcpy_erms
+ 1.73% [k] generic_xdp_tx
+ 1.69% [k] mlx5e_get_cqe
+ 1.40% [k] __netdev_pick_tx
+ 1.28% [k] __rcu_read_unlock
+ 1.19% [k] netdev_pick_tx
+ 1.02% [k] swiotlb_map_page
+ 1.00% [k] __cmpxchg_double_slab.isra.56
+ 0.99% [k] dev_gro_receive
+ 0.85% [k] __rcu_read_lock
+ 0.80% [k] napi_gro_receive
+ 0.79% [k] mlx5e_poll_rx_cq
+ 0.73% [k] mlx5e_post_rx_wqes
+ 0.71% [k] get_partial_node.isra.76
+ 0.70% [k] mlx5e_page_release
+ 0.62% [k] eth_type_trans
+ 0.56% [k] mlx5e_select_queue
0.49% [k] skb_gro_reset_offset
0.42% [k] skb_put
Packet rate 4574424 translates to ~219 nanosec (1/4574424*10^9).
The top contender is mlx5e_handle_rx_cqe(24.75%), which initially didn’t surprise me, given I know that this function (via inlining) will be the first to touch the packet (via is_first_ethertype_ip()), thus causing a cache-line miss. BUT something is wrong. Looking at perf-annotate, the cache-line miss is NOT occurring, instead 67.24% CPU time spend on a refcnt increment (due to page_ref_inc(di->page) used for page-recycle cache). Something is wrong as 24.75% of 219 is 54ns, which is too high even for an atomic refcnt inc. (Note: the cache-miss is actually avoided due to the prefetch have time to work, due to this stall on the lock. Thus, removing the stall will bring-back the cache-line stall).
Inside __build_skb(16.95%) there is 83.47% CPU spend on “rep stos”, which is clearing/memset-zero the SKB itself. Again something is wrong as ((1/4574424*10^9)*(16.95/100)) = 37ns is too high for clearing the SKB (time_bench_memset show this optimally takes 10 ns).
Inside mlx5e_xmit(10.72%) there is 17.96% spend on a sfence asm instruction. The cost (1/4574424*10^9)*(10.72/100) = 23.43 ns of calling mlx5e_xmit() might not be too off-target.
My guess is that this is caused the the tailptr/doorbell stall. And doing bulk/xmit_more we can likely reduce mlx5e_handle_rx_cqe(-12ns as cache-miss returns) and __build_skb(-27ns). Thus, the performance target should lay around 5.6Mpps ((1/(218-12-27)*10^9) = 5586592).
Also notice that __cmpxchg_double_slab() show that we are hitting the SLUB slow(er)-path.
Zooming into perf with Generic-netstack-XDP¶
Testing Generic-netstack-XDP_DROP again and looking closer at the perf reports. This will be intersting because we can deduct the cost of the different parts of the network stack, assuming there is no-fake stalls due to tailptr/doorbell (like the XDP_TX case)
[jbrouer@skylake prototype-kernel]$
sudo ./xdp_bench01_mem_access_cost --action XDP_DROP --dev mlx5p2 --read
XDP_action pps pps-human-readable mem
XDP_DROP 8148835 8,148,835 read
XDP_DROP 8148972 8,148,972 read
XDP_DROP 8148962 8,148,962 read
XDP_DROP 8146856 8,146,856 read
XDP_DROP 8150026 8,150,026 read
XDP_DROP 8149734 8,149,734 read
XDP_DROP 8149646 8,149,646 read
For some unknown reason the Generic-XDP_DROP number are a bit lower, than above numbers. Using 8148972 pps (8,148,972) as our new baseline, show (averaged) cost per packet 122.47 nanosec (1/8165032*10^9)
The difference to NIC-level-XDP is: (1/12006685*10^9)- (1/8148972*10^9) = -39.42 ns
Simply perf recorded 30 sec, and find the CPU this was running on by added the –sort cpu to the output. The CPU output/column showed that NAPI was running on CPU 7
sudo ~/perf record -aR -g sleep 30
sudo ~/perf report --no-children --sort cpu,comm,dso,symbol
Now we will drill down on CPU 7 and see what it is doing. We start with removing the “children” column, to start viewing the overhead on a per function basis.
I’m using this long perf report command to reduce the columns and print to stdout and removing the call graph (I’ll manually inspect the call-graph with the standard terminal-user-interface (TUI))
sudo ~/perf report --no-children --sort symbol \
--kallsyms=/proc/kallsyms -C7 --stdio -g none
Reduced output:
# Samples: 119K of event 'cycles'
# Event count (approx.): 119499252009
#
# Overhead Symbol
# ........ ..........................................
#
34.33% [k] mlx5e_handle_rx_cqe
10.36% [k] __build_skb
5.49% [k] build_skb
5.10% [k] page_frag_free
4.06% [k] bpf_prog_662b9cae761bf6ab
4.02% [k] kmem_cache_alloc
3.85% [k] netif_receive_skb_internal
3.72% [k] kmem_cache_free
3.69% [k] mlx5e_alloc_rx_wqe
2.91% [k] mlx5e_get_cqe
1.83% [k] napi_gro_receive
1.80% [k] __rcu_read_unlock
1.65% [k] skb_release_data
1.49% [k] dev_gro_receive
1.43% [k] skb_release_head_state
1.26% [k] mlx5e_post_rx_wqes
1.22% [k] mlx5e_page_release
1.21% [k] kfree_skb
1.19% [k] eth_type_trans
1.00% [k] __rcu_read_lock
0.84% [k] skb_release_all
0.83% [k] skb_free_head
0.81% [k] kfree_skbmem
0.80% [k] percpu_array_map_lookup_elem
0.79% [k] mlx5e_poll_rx_cq
0.79% [k] skb_put
0.77% [k] skb_gro_reset_offset
0.63% [k] swiotlb_sync_single
0.61% [k] swiotlb_sync_single_for_device
0.42% [k] swiotlb_sync_single_for_cpu
0.28% [k] net_rx_action
0.21% [k] bpf_map_lookup_elem
0.20% [k] mlx5e_napi_poll
0.11% [k] __do_softirq
0.06% [k] mlx5e_poll_tx_cq
0.02% [k] __raise_softirq_irqoff
Some memory observations are that we are hitting the fast path of the SLUB allocator (indicated by no func names from the slower path). The mlx5 driver-page recycler also have 100% hit rate, verified by looking at ethtool -S stats, and mlx5 stats “cache_reuse”, using my ethtool_stats.pl tool:
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2) stat: 8179636 ( 8,179,636) <= rx3_cache_reuse /sec
Ethtool(mlx5p2) stat: 8179632 ( 8,179,632) <= rx3_packets /sec
Ethtool(mlx5p2) stat: 40657800 ( 40,657,800) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2) stat: 490777805 ( 490,777,805) <= rx_bytes /sec
Ethtool(mlx5p2) stat: 2602103605 ( 2,602,103,605) <= rx_bytes_phy /sec
Ethtool(mlx5p2) stat: 8179636 ( 8,179,636) <= rx_cache_reuse /sec
Ethtool(mlx5p2) stat: 8179630 ( 8,179,630) <= rx_csum_complete /sec
Ethtool(mlx5p2) stat: 18736623 ( 18,736,623) <= rx_discards_phy /sec
Ethtool(mlx5p2) stat: 13741170 ( 13,741,170) <= rx_out_of_buffer /sec
Ethtool(mlx5p2) stat: 8179630 ( 8,179,630) <= rx_packets /sec
Ethtool(mlx5p2) stat: 40657861 ( 40,657,861) <= rx_packets_phy /sec
Ethtool(mlx5p2) stat: 2602122863 ( 2,602,122,863) <= rx_prio0_bytes /sec
Ethtool(mlx5p2) stat: 21921459 ( 21,921,459) <= rx_prio0_packets /sec
[...]
Knowing the cost per packet 122.47 nanosec (1/8165032*10^9), we can extrapolate the ns used by each function call. Let use oneline for calculating that for us:
sudo ~/perf report --no-children --sort symbol \
--kallsyms=/proc/kallsyms -C7 --stdio -g none | \
awk -F% 'BEGIN {base=(1/8165032*10^9)} \
/%/ {ns=base*($1/100); \
printf("%6.2f\% => %5.1f ns func:%s\n",$1,ns,$2);}'
Output:
34.33% => 42.0 ns func: [k] mlx5e_handle_rx_cqe
10.36% => 12.7 ns func: [k] __build_skb
5.49% => 6.7 ns func: [k] build_skb
5.10% => 6.2 ns func: [k] page_frag_free
4.06% => 5.0 ns func: [k] bpf_prog_662b9cae761bf6ab
4.02% => 4.9 ns func: [k] kmem_cache_alloc
3.85% => 4.7 ns func: [k] netif_receive_skb_internal
3.72% => 4.6 ns func: [k] kmem_cache_free
3.69% => 4.5 ns func: [k] mlx5e_alloc_rx_wqe
2.91% => 3.6 ns func: [k] mlx5e_get_cqe
1.83% => 2.2 ns func: [k] napi_gro_receive
1.80% => 2.2 ns func: [k] __rcu_read_unlock
1.65% => 2.0 ns func: [k] skb_release_data
1.49% => 1.8 ns func: [k] dev_gro_receive
1.43% => 1.8 ns func: [k] skb_release_head_state
1.26% => 1.5 ns func: [k] mlx5e_post_rx_wqes
1.22% => 1.5 ns func: [k] mlx5e_page_release
1.21% => 1.5 ns func: [k] kfree_skb
1.19% => 1.5 ns func: [k] eth_type_trans
1.00% => 1.2 ns func: [k] __rcu_read_lock
0.84% => 1.0 ns func: [k] skb_release_all
0.83% => 1.0 ns func: [k] skb_free_head
0.81% => 1.0 ns func: [k] kfree_skbmem
0.80% => 1.0 ns func: [k] percpu_array_map_lookup_elem
0.79% => 1.0 ns func: [k] mlx5e_poll_rx_cq
0.79% => 1.0 ns func: [k] skb_put
0.77% => 0.9 ns func: [k] skb_gro_reset_offset
0.63% => 0.8 ns func: [k] swiotlb_sync_single
0.61% => 0.7 ns func: [k] swiotlb_sync_single_for_device
0.42% => 0.5 ns func: [k] swiotlb_sync_single_for_cpu
0.28% => 0.3 ns func: [k] net_rx_action
0.21% => 0.3 ns func: [k] bpf_map_lookup_elem
0.20% => 0.2 ns func: [k] mlx5e_napi_poll
0.11% => 0.1 ns func: [k] __do_softirq
top contender mlx5e_handle_rx_cqe¶
The top contender mlx5e_handle_rx_cqe() in the driver code
34.33% => 42.0 ns func: [k] mlx5e_handle_rx_cqe
When looking at the code/perf-annotate do notice that several function calls have been inlined by the compiler. The thing that light-up (56.23% => 23.6 ns) in perf-annotate is touching/reading the data-packet for the first time, which is reading the ethertype via is_first_ethertype_ip(), called via:
- which is called from mlx5e_handle_csum()
- which is called by mlx5e_build_rx_skb()
- which is called by mlx5e_complete_rx_cqe()
- which is called by mlx5e_handle_rx_cqe() all inlined.
Notice this is_first_ethertype_ip() call is the reason why eth_type_trans() is not so hot in this driver.
Analyzing __build_skb and memset¶
The compiler choose not to inline __build_skb(), and what is primarily going on here is memset clearing the SKB data, which gets optimized into an “rep stos” asm-operation, which is actually not optimal for this size of objects. Looking at perf-annotate shows that 75.65% of the time of __build_skb() is spend on “rep stos %rax,%es:(%rdi)”. Thus, extrapolating 12.7 ns (12.7*(75.65/100)) cost of 9.6 ns.
This is very CPU specific how fast or slow this is, but I’ve benchmarked different alternative approaches with time_bench_memset.c.
Memset benchmarks on this Skylake CPU show that hand-optimizing ASM-coded memset, can reach 8 bytes per cycles, but only saves approx 2.5 ns or 10 cycles. A more interesting approach would be if we could memset clear a larger area. E.g. when bulk-allocating SKBs and detecting they belong to the same page and is contiguous in memory. Benchmarks show that clearing larger areas is more efficient.
Table with memset “rep-stos” size vs bytes-per-cycle efficiency
$ perl -ne 'while(/memset_(\d+) .* elem: (\d+) cycles/g)\
{my $bpc=$1/$2; \
printf("memset %5d bytes cost %4d cycles thus %4.1f bytes-per-cycle\n", \
$1, $2, $bpc);}' memset_test_dmesg
memset 32 bytes cost 4 cycles thus 8.0 bytes-per-cycle
memset 64 bytes cost 29 cycles thus 2.2 bytes-per-cycle
memset 128 bytes cost 29 cycles thus 4.4 bytes-per-cycle
memset 192 bytes cost 35 cycles thus 5.5 bytes-per-cycle
memset 199 bytes cost 35 cycles thus 5.7 bytes-per-cycle
memset 201 bytes cost 39 cycles thus 5.2 bytes-per-cycle
memset 204 bytes cost 40 cycles thus 5.1 bytes-per-cycle
memset 200 bytes cost 39 cycles thus 5.1 bytes-per-cycle
memset 208 bytes cost 39 cycles thus 5.3 bytes-per-cycle
memset 256 bytes cost 36 cycles thus 7.1 bytes-per-cycle
memset 512 bytes cost 40 cycles thus 12.8 bytes-per-cycle
memset 768 bytes cost 47 cycles thus 16.3 bytes-per-cycle
memset 1024 bytes cost 52 cycles thus 19.7 bytes-per-cycle
memset 2048 bytes cost 84 cycles thus 24.4 bytes-per-cycle
memset 4096 bytes cost 148 cycles thus 27.7 bytes-per-cycle
memset 8192 bytes cost 276 cycles thus 29.7 bytes-per-cycle
I’ve already implemented the SLUB bulk-alloc API, and it could be extended with detecting if objects are physically contiguous for allowing clearing multiple object at the same time. (Notice the SLUB alloc-side fast-path already delivers object from the same page).
Blaming the children¶
The nanosec number are getting so small, that we might miss the effect of deep call chains. Thus, lets look at perf report with the “children” enabled:
Samples: 119K of event 'cycles', Event count (approx.): 119499252009
Children Self Symbol
+ 100.00% 0.00% [k] kthread
+ 100.00% 0.00% [k] ret_from_fork
+ 99.99% 0.01% [k] smpboot_thread_fn
+ 99.98% 0.01% [k] run_ksoftirqd
+ 99.94% 0.11% [k] __do_softirq
+ 99.78% 0.28% [k] net_rx_action
+ 99.41% 0.20% [k] mlx5e_napi_poll
+ 92.44% 0.79% [k] mlx5e_poll_rx_cq
+ 86.37% 34.33% [k] mlx5e_handle_rx_cqe
+ 29.40% 1.83% [k] napi_gro_receive
+ 24.50% 3.85% [k] netif_receive_skb_internal
+ 19.41% 5.49% [k] build_skb
+ 14.98% 1.21% [k] kfree_skb
+ 14.15% 10.36% [k] __build_skb
+ 9.43% 0.84% [k] skb_release_all
+ 6.97% 1.65% [k] skb_release_data
+ 5.38% 1.26% [k] mlx5e_post_rx_wqes
+ 5.10% 5.10% [k] page_frag_free
+ 4.86% 4.06% [k] bpf_prog_662b9cae761bf6ab
+ 4.30% 3.69% [k] mlx5e_alloc_rx_wqe
+ 4.30% 0.81% [k] kfree_skbmem
+ 4.02% 4.02% [k] kmem_cache_alloc
+ 3.72% 3.72% [k] kmem_cache_free
+ 2.91% 2.91% [k] mlx5e_get_cqe
Lets calculate the ns cost:
$ sudo ~/perf report --children --sort symbol \
--kallsyms=/proc/kallsyms -C7 --stdio -g none | \
awk -F% 'BEGIN {base=(1/8165032*10^9); \
print "Children => nanosec Self Symbol/fucntion\n";} \
/%/ {ns=base*($1/100); \
printf("%6.2f%s => %5.1f ns %s%s func:%s\n",$1,"%",ns,$2,"%",$3);}'
Children => nanosec Self Symbol/fucntion
100.00% => 122.5 ns 0.00% func: [k] kthread
100.00% => 122.5 ns 0.00% func: [k] ret_from_fork
99.99% => 122.5 ns 0.01% func: [k] smpboot_thread_fn
99.98% => 122.4 ns 0.01% func: [k] run_ksoftirqd
99.94% => 122.4 ns 0.11% func: [k] __do_softirq
99.78% => 122.2 ns 0.28% func: [k] net_rx_action
99.41% => 121.8 ns 0.20% func: [k] mlx5e_napi_poll
92.44% => 113.2 ns 0.79% func: [k] mlx5e_poll_rx_cq
86.37% => 105.8 ns 34.33% func: [k] mlx5e_handle_rx_cqe
29.40% => 36.0 ns 1.83% func: [k] napi_gro_receive
24.50% => 30.0 ns 3.85% func: [k] netif_receive_skb_internal
19.41% => 23.8 ns 5.49% func: [k] build_skb
14.98% => 18.3 ns 1.21% func: [k] kfree_skb
14.15% => 17.3 ns 10.36% func: [k] __build_skb
9.43% => 11.5 ns 0.84% func: [k] skb_release_all
6.97% => 8.5 ns 1.65% func: [k] skb_release_data
5.38% => 6.6 ns 1.26% func: [k] mlx5e_post_rx_wqes
5.10% => 6.2 ns 5.10% func: [k] page_frag_free
4.86% => 6.0 ns 4.06% func: [k] bpf_prog_662b9cae761bf6ab
4.30% => 5.3 ns 3.69% func: [k] mlx5e_alloc_rx_wqe
4.30% => 5.3 ns 0.81% func: [k] kfree_skbmem
4.02% => 4.9 ns 4.02% func: [k] kmem_cache_alloc
3.72% => 4.6 ns 3.72% func: [k] kmem_cache_free
2.91% => 3.6 ns 2.91% func: [k] mlx5e_get_cqe
1.80% => 2.2 ns 1.80% func: [k] __rcu_read_unlock
1.49% => 1.8 ns 1.49% func: [k] dev_gro_receive
1.43% => 1.8 ns 1.43% func: [k] skb_release_head_state
1.22% => 1.5 ns 1.22% func: [k] mlx5e_page_release
1.19% => 1.5 ns 1.19% func: [k] eth_type_trans
1.00% => 1.2 ns 1.00% func: [k] __rcu_read_lock
0.84% => 1.0 ns 0.83% func: [k] skb_free_head
0.80% => 1.0 ns 0.80% func: [k] percpu_array_map_lookup_elem
0.79% => 1.0 ns 0.79% func: [k] skb_put
0.77% => 0.9 ns 0.77% func: [k] skb_gro_reset_offset
Interesting here is napi_gro_receive() which is the base-call into the network stack, everything “under” this call cost 29.40% of the time, translated to 36.0 ns. This 36 ns cost is interesting as we calculated the difference to NIC-level-XDP to be 39 ns:
- The difference to NIC-level-XDP is:
- (1/12006685*10^9)- (1/8148972*10^9) = -39.42 ns
Freeing the SKB is summed up under kfree_skb() with 14.98% => 18.3 ns. In this case kfree_skb() should get attributed under napi_gro_receive(), due to the direct kfree_skb(skb) call in netif_receive_generic_xdp(). In other situations kfree_skb() happens during the DMA TX completion, but not here.
Creating, allocating and clearing the SKB is all “under” the build_skb() call, which attributes to a collective 19.41% or 23.8 ns. The build_skb() call happens, in-driver, before calling napi_gro_receive.
Thus, one might be lead to conclude that the overhead of the network stack is (23.8 ns +36 ns) 59.8 ns, but something is not adding up as this is higher the calculated approx 40ns difference to NIC-level-XDP.