Bringing MultiQueue to the Nanos Unikernel Network Stack

https://nanovms.com/dev/tutorials/bringing-multiqueue-to-nanos-unikernel-network-stack

We've made substantial changes to our networking stack. Some of those include numerous changes to our LWIP fork (which not even sure we can call it LWIP anymore as it's totally different).

This latest change allows multiple transmit and receive (tx/rx) queues. In short this allows superior network performance when you have an instance with multiple vcpus.

Modern nics, and by modern I mean ones in commodity servers for the past 10-15 years (you most assuredly have this on your system), have multiple rx/tx queues which allows sending/receiving to multiple threads. If you don't have multiple queues it means that only one thread can process incoming and outgoing network traffic. Generally speaking you don't want more queues than the number of cores you have available.

So essentially the single queue model looks something like this:


    _______
    | nic |
    -------
______   _______
| /|\ |  | rx  |
|  |  |  |  |  |
| tx  |  | \|/ |
-------  -------
    ________
    | cpu1 |
    --------
Single TX/RX Queue

and we made something that look like this instead:


              _______
              | nic |
              -------
______  _______    _______  _______
| /|\|  | rx  |    | /|\ |  | rx  | 
|  | |  |  |  |    |  |  |  |  |  |
| tx |  | \|/ |    | tx  |  | \|/ |
|____|  |_____|    |_____|  |_____|
    ________          ________
    | cpu1 |          | cpu2 |
    --------          --------
Multiple TX/RX Queue

Just to be clear linux and other systems have had this for years so this is more of a "nanos is catching up" type of feature but one that is important for scaling nonetheless. Again, if you are running on a t2.small it doesn't really matter.

I keep telling people that you can't just wave a magic wand and get superior performance. This is a prime example of the type of work that systems engineeering has to do. It is work that doesn't really have anything at all to do with unikernels but something you'll notice when comparing high core count linux machines with your unikernel instance and wondering why they might be getting better throughput on the webservers.

So what is a queue, or, in this instance more commonly known as a ring, when it comes to network interfaces anyways?

One Ring to Rule Them All, One Ring to Find Them, One Ring to Bring Them All

Have you ever had to make a queue out of two stacks or make a stack out of two queues on a whiteboard? You might remember that by creating a queue using two stacks you inherently are choosing to make enqueue or dequeue costly. You also might be more familiar with the simple linked list implementation. Just like everything in software there are many multiple approaches. Well there is another method and that is the ring buffer - which is what these queues are.

One of the differences between using a linked list implementation vs a ring buffer for a queue is space complexity since the ring buffer has a fixed size. The tradeoff here is that you can choose to have your enqueue fail (eg: drop packets in this instance) or overwrite data.

On the flip side, ring buffers are faster, especially when you know how much data you are going to want to store at a given time.

Our Implementation

By default, the virtio-net driver uses as many queues as supported by the attached device. It is possible to override this behavior by specifying the "io-queues" configuration option in the manifest tuple corresponding to a given network interface. For example, the following snippet of an ops configuration file instructs the driver to use 2 queues for the first network interface:

"ManifestPassthrough": {
  "en1": {
    "io-queues": "2"
  }
}

Note if you are testing locally with something like iperf you'll want to ensure you have vhost enabled. (You may need to 'modprobe vhost_net'.) Vhost provides lower latency and much greater throughput. Why? It essentially moves packets between guest and host by using the host kernel bypassing qemu.

The number of queues used by the driver is always limited to the number of CPUs in the running instance (this behavior cannot be overridden by the "io-queues" option).

For optimization, each tx/rx queue is configured with an interrupt affinity such that different queues are served by different CPUs.

Locally, on the host you can see how many queues you have available by using ethtool like so:

eyberg@box:~$ ethtool -l eno2
Channel parameters for eno2:
Pre-set maximums:
RX:             8
TX:             8
Other:          n/a
Combined:       n/a
Current hardware settings:
RX:             7
TX:             4
Other:          n/a
Combined:       n/a

You can look at proc to see how they are being utilized by each thread (YMMV here per thread count):

eyberg@box:~$ cat /proc/interrupts | grep eno2 | awk {'print $28'}
eno2-0
eno2-1
eno2-2
eno2-3
eno2-4
eno2-5
eno2-6

Note: I purposely shortened the output here but the columns in between are individual threads.

You can then even watch traffic on each queue - this is useful to verify that you are indeed using what you think you are using:

watch -d -n 2 "ethtool -S eno2 | grep rx | grep packets | column"

Cloud Specific MultiQueue Settings

Now, outside of benchmarking, most of you probably don't have a strong reason to set all of this locally. So what happens when you deploy to the cloud?

The number of queues assigned on Google Cloud depends on the network interface type you are using. We support both virtio-net and gvNIC. If you are using virtio-net the equation is:

vcpu/number-of-nics

If you are using gvNIC (google's in-house network adapter that we support for and is used by the arm instances) the default count is:

2(vcpu)/number-of-nics

Furthermore virtio can have up to 32 queues, however gvnic can only have up to 16.

On AWS, if using ENA it is 1 per vcpu. Again, up to 32 max.

So go get yourself some high vcpu instances, set your iperf cannons to stun and enjoy the new multi-queue support.

{
"by": "eyberg",
"descendants": 0,
"id": 40249718,
"score": 1,
"time": 1714755066,
"title": "Bringing MultiQueue to the Nanos Unikernel Network Stack",
"type": "story",
"url": "https://nanovms.com/dev/tutorials/bringing-multiqueue-to-nanos-unikernel-network-stack"
}
{
"author": null,
"date": null,
"description": "We’ve made substantial changes to our networking\nstack. Some of those include numerous changes to our LWIP fork (which\nnot even sure we can call it LWIP anymore as it’s totally different).\nThis latest change allows multiple transmit and receive (tx/rx) queues.\nIn short this allows superior network performance when you have an instance with multiple vcpus.",
"image": "https://nanovms.com/static/dist/img/logo.png",
"logo": "https://logo.clearbit.com/nanovms.com",
"publisher": "NanoVMs",
"title": "Bringing MultiQueue to the Nanos Unikernel Network Stack",
"url": "https://nanovms.com/dev/tutorials/bringing-multiqueue-to-nanos-unikernel-network-stack"
}
{
"url": "https://nanovms.com/dev/tutorials/bringing-multiqueue-to-nanos-unikernel-network-stack",
"title": "Bringing MultiQueue to the Nanos Unikernel Network Stack",
"description": "We've made substantial changes to our networking\nstack. Some of those include numerous changes to our LWIP fork (which\nnot even sure we can call it LWIP anymore as it's totally different).\nThis latest change allows multiple transmit and receive (tx/rx) queues.\nIn short this allows superior network performance when you have an instance with multiple vcpus.\n",
"links": [
"https://nanovms.com/dev/tutorials/bringing-multiqueue-to-nanos-unikernel-network-stack"
],
"image": "",
"content": "<div>\n<p>\nWe've made substantial changes to our networking stack. Some of those\ninclude numerous changes to our LWIP fork (which not even sure we can\ncall it LWIP anymore as it's totally different).\n</p>\n<p>\nThis latest change allows multiple transmit and receive (tx/rx) queues.\nIn short this allows superior network performance when you have an\ninstance with multiple vcpus.\n</p>\n<p>\nModern nics, and by modern I mean ones in commodity servers for the past\n10-15 years (you most assuredly have this on your system), have multiple\nrx/tx queues which allows sending/receiving to\nmultiple threads. If you don't have multiple queues it means that\nonly one thread can process incoming and outgoing network traffic.\nGenerally speaking you don't want more queues than the\nnumber of cores you have available.\n</p>\n<p>\nSo essentially the single queue model looks something like this:\n</p>\n<div>\n<pre><code>\n _______\n | nic |\n -------\n______ _______\n| /|\\ | | rx |\n| | | | | |\n| tx | | \\|/ |\n------- -------\n ________\n | cpu1 |\n --------\n</code></pre></div>\n<i>Single TX/RX Queue</i><p>\nand we made something that look like this instead:\n</p><div>\n<pre><code>\n _______\n | nic |\n -------\n______ _______ _______ _______\n| /|\\| | rx | | /|\\ | | rx | \n| | | | | | | | | | | |\n| tx | | \\|/ | | tx | | \\|/ |\n|____| |_____| |_____| |_____|\n ________ ________\n | cpu1 | | cpu2 |\n -------- --------\n</code></pre></div>\n<i>Multiple TX/RX Queue</i>\n<p>\nJust to be clear linux and other systems have had this for years so this\nis more of a \"nanos is catching up\" type of feature but one that is\nimportant for scaling nonetheless. Again, if you are running on a\nt2.small it doesn't really matter.\n</p><p>\nI keep telling people that you can't just wave a magic wand and get\nsuperior performance. This is a prime example of the type of work\nthat systems engineeering has to do. It is work that doesn't really have\nanything at all to do with unikernels but something you'll notice when\ncomparing high core count linux machines with your unikernel instance\nand wondering why they might be getting better throughput on the\nwebservers.\n</p>\n<p>So what is a queue, or, in this instance more commonly known as a\nring, when it comes to network interfaces anyways?\n</p>\n<p>\n </p><h4>One Ring to Rule Them All, One Ring to Find Them, One Ring to\nBring Them All</h4>\n<p></p>\n<p>\nHave you ever had to make a queue out of two stacks or make a stack out\nof two queues on a whiteboard? You might remember that by creating a\nqueue using two stacks you inherently are choosing to make enqueue or\ndequeue costly. You also might be more familiar with the simple linked\nlist implementation. Just like everything in software there are many multiple\napproaches. Well there is another method and that is the ring buffer -\nwhich is what these queues are.\n</p>\n<p>\nOne of the differences between using a linked list implementation vs a\nring buffer for a queue is space complexity since the ring buffer has a\nfixed size. The tradeoff here is that you can choose to have your\nenqueue fail (eg: drop packets in this instance) or overwrite data.\n</p>\n<p>\nOn the flip side, ring buffers are faster, especially when you know how much data you are\ngoing to want to store at a given time.\n</p>\n<p>\n </p><h4>Our Implementation</h4>\n<p></p>\n<p>\nBy default, the virtio-net driver uses as many queues as supported by\nthe attached device. It is possible to override this behavior by\nspecifying the \"io-queues\" configuration option in the manifest tuple\ncorresponding to a given network interface. For example,\nthe following snippet of an ops configuration file instructs the driver\nto use 2 queues for the first network interface:\n</p>\n<div>\n<pre><code>\"ManifestPassthrough\": {\n \"en1\": {\n \"io-queues\": \"2\"\n }\n}\n</code></pre></div>\n<p>Note if you are testing locally with something like <a target=\"_blank\" href=\"https://iperf.fr/\">iperf</a> you'll want to ensure you have vhost\nenabled. (You may need to 'modprobe vhost_net'.) Vhost provides lower latency and much greater throughput. Why?\nIt essentially moves packets between guest and host by using the host\nkernel bypassing qemu.</p>\n<p>\nThe number of queues used by the driver is always limited to the number\nof CPUs in the running instance (this behavior cannot be overridden by\nthe \"io-queues\" option).\n</p>\n<p>\nFor optimization, each tx/rx queue is configured with an\ninterrupt affinity such that different queues are served\nby different CPUs.\n</p>\n<p>Locally, on the host you can see how many queues you have available by using ethtool like\nso:</p>\n<div>\n<pre><code>eyberg@box:~$ ethtool -l eno2\nChannel parameters for eno2:\nPre-set maximums:\nRX: 8\nTX: 8\nOther: n/a\nCombined: n/a\nCurrent hardware settings:\nRX: 7\nTX: 4\nOther: n/a\nCombined: n/a\n</code></pre></div>\n<p>You can look at proc to see how they are being utilized by each\nthread (YMMV here per thread count):</p>\n<div>\n<pre><code>eyberg@box:~$ cat /proc/interrupts | grep eno2 | awk {'print $28'}\neno2-0\neno2-1\neno2-2\neno2-3\neno2-4\neno2-5\neno2-6\n</code></pre></div>\n<p>Note: I purposely shortened the output here but the columns in\nbetween are individual threads.</p>\n<p>You can then even watch traffic on each queue - this is useful to verify\nthat you are indeed using what you think you are using:</p>\n<div>\n<pre><code>watch -d -n 2 \"ethtool -S eno2 | grep rx | grep packets | column\"</code></pre></div>\n<p>\n </p><h4>Cloud Specific MultiQueue Settings</h4>\n<p></p>\n<p>Now, outside of benchmarking, most of you probably don't have a\nstrong reason to set all of this locally. So what happens when you\ndeploy to the cloud?</p>\n<p>\nThe number of queues assigned on Google Cloud depends on the network\ninterface type you are using. We support both virtio-net and gvNIC. If\nyou are using virtio-net the equation is:\n</p>\n<p><strong>\nvcpu/number-of-nics\n</strong></p>\n<p>\nIf you are using gvNIC (google's in-house network adapter that we\nsupport for and is used by the arm instances) the default count is:\n</p>\n<p><strong>\n2(vcpu)/number-of-nics\n</strong></p>\n<p>Furthermore virtio can have up to 32 queues, however gvnic can only have up to 16.</p>\n<p>\nOn AWS, if using ENA it is <a target=\"_blank\" href=\"https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst\">1\nper vcpu</a>. Again, up to 32 max.</p>\n<p>So go get yourself some high vcpu instances, set your iperf cannons\nto stun and enjoy the new multi-queue support.</p>\n </div>",
"author": "",
"favicon": "",
"source": "nanovms.com",
"published": "",
"ttr": 201,
"type": ""
}