Note: Since I originally added this, I realised I was running with 16 msix vector limit (the default is 8). The effect is the memory allocations would be ½ of the values below and since the majority of the time is in ring setup/teardown, I'd expect the times to be a bit over ½ those stated. The benefits in terms of percentage improvement will be similar.
To address this, I've changed the point at which the DMA is allocated and freed. I have moved it to allocate all rings for all groups in i40e_start(), to individual allocation for each ring as i40e_ring_start()/i40e_ring_stop() are called.
Based on this I have produced this table to illustrate how many calls are made into the DMA allocation and free routines and who long the command line utility takes to run the command.
In each case the tests were started with an un-configured NIC.
Aggregates¶
|
Before |
After |
Command |
Alloc |
Free |
Time |
Alloc |
Free |
Time |
create-aggr (2 i/fs) |
3672064 |
|
37s |
114752 |
|
1.2s |
create-if aggr |
|
|
0.1s |
229504 |
229504 |
3.4s |
create-addr |
|
|
5s |
114752 |
|
1.6s |
set linkprop mtu |
3672064 |
3672074 |
10min |
114752 |
229504 |
5.2s |
delete-aggr |
|
3672064 |
20s |
|
114752 |
0.6s |
VNICs¶
|
Before |
After |
Command |
Alloc |
Free |
Time |
Alloc |
Free |
Time |
create-vnic |
1836032 |
|
16s |
114752 |
|
0.5s |
create-if |
|
|
0.1s |
|
|
0.1s |
create-addr |
|
|
0.5s |
|
|
0.5s |
In the "Before" case, the large allocation is a one-off, subsequent VNICs created on the same link would not incur the same large allocation. But it does not detract that it would allocate ~16GB (with jumbo frames) of DMA for a single VNIC.
I40e¶
|
Before |
After |
Command |
Alloc |
Free |
Time |
Alloc |
Free |
Time |
create-if |
1836032 |
|
20s |
172128 |
114752 |
2.3s |
create-addr |
|
|
0.5s |
57376 |
|
1.0s |
delete-if |
|
1836032 |
10.3s |
|
114752 |
0.6s |
So, if you have a single aggr, you will use 1GB vs 32GB, a single VNIC 500MB vs 16GB. Obviously, the more aggr/vnic/addr groups created the usage will increase per these tables, but in the most common scenarios there are significant memory savings.
Also, note that the memory usage estimates are for a single interface, the memory usage can be multiplied up for dual or quad port PCI adapters.
There are some notable timing improvements with this change alongside #12957