Department of Computer Science and A.I

Tel : (356) 3290-2130

Fax : (356) 320529

System Software

      Research Group

Achieving Gigabit Performance on Programmable Ethernet Network Interface Cards

By Wallace Wadge


This page is also available in PDF format. The full text of the dissertation is available here.

Abstract

The shift from Fast Ethernet (100Mbit/s) to Gigabit Ethernet (1000Mbit/s) did not result in the expected tenfold increase in bandwidth at the application level. In this dissertation we make use of programmable Ethernet network cards running at gigabit speeds to identify present bottlenecks and also propose two new techniques aimed at boosting application performance to near-gigabit levels whilst maintaining full compatibility with existing systems. Furthermore, we investigate the performance of the PCI bus and the throughput ``on-the-line'' when using different frame sizes.

Conventional Systems

In a conventional system, the Network Interface Card (NIC) performs a very small part of the work required, leaving the rest for the host to handle. A typical NIC makes use of interrupts to signal to the host that new data has been received from the network. It is then the responsibility of the OS to issue the appropriate calls to fetch the data from the correct memory location, perform further processing on it (such as TCP protocol handling), and eventually copy the payload (that is, the useful data) for the processes requesting it. This system contains a number of problems, some of which are detailed below:

Interrupt Driven

The whole system is interrupt driven which means that context switching between kernel mode and user mode is required for each packet received (Some newer network cards are able to coalesce some packets together prior to triggering an interrupt) from the network, resulting in unnecessary overheads. Under heavy load, the host might find itself flooded with interrupt requests.

Inappropriate resource accounting

The NIC has no knowledge of underlying processes and will generate an interrupt just as soon as it receives data. This results in a scenario whereby the processing time required to handle the incoming data may be attributed to processes which are not involved in the underlying transfers.

Lack of load shedding

If the host is unable to cope with the incoming influx of packets, it has no alternative but to drop them. Unfortunately, dropping packets cannot occur without some further host processing, which may result in receiver livelock being triggered (that is, a system doing nothing except drop incoming packets).

Checksum calculation

Research suggests that a host system may spend as much as 15% of the total processing time required to calculate the checksum of each packet.

Multiple memory copies

The running user application must explicitly perform a system call in order to retrieve the data received from the network. This has the effect of copying the internally queued data to the applications' address space and finally de-allocating the memory reserved for the copy. Trivially, such a step is logically redundant since there is never any need to maintain multiple copies of the same data. Furthermore, copying data around wastes considerable processing resources. There are some other drawbacks in the conventional system, but these alone are enough to warrant the necessity for a completely different implementation scheme. Research has shown that, under heavy load, a host system may spend as much as 90% (and in some cases, more than that) of the total processing time available for handling communication details leaving a mere fraction of processing time for other processes. Throughput also drops to just 409Mbits/s at the application level (when utilizing gigabit NICs), which is a far cry from the desired 1 Gbit/s the new Ethernet standard proposes. Some vendors have proposed increasing the payload size of each Ethernet packet from the standard 1.5K length up to 9K. Since the header length of each Ethernet packet is constant, there is less overall overhead cost for longer packets. This is equivalent to travelling on a highway where only mid-sized cars are allowed. At each toll-station where thousands of cars are processed by an attendant, each car must wait an inordinate amount of time to get through. As traffic increases, so does the time it takes to get through the toll station. This time could be dramatically reduced if the passengers in cars could pass through the toll station on buses. This would reduce the number of vehicles processed and shorten overall delay for each passenger. Several independent tests have demonstrated that performance improves considerably while at the same time a significant drop in host CPU loading could be evidenced. However, many in the industry (of special note, the IEEE) are reluctant to change the tried-and-tested Ethernet standard just yet. Most agree that if such a change where to take place, then it should no longer be called Ethernet. In this dissertation we identify the major bottlenecks and attempt to get the best of both worlds, that is, maintain the standard 1.5K packet size whilst boosting performance.

Background

Hardware tools

New NICs are starting to appear on the market which, unlike their conventional counterparts, can be programmed extensively to perform tasks normally reserved for the host to handle. In this dissertation, we make use of such a programmable network cards (more specifically, we use the Alteon Tigon NIC). The NIC contains two, RISC-based, embedded processors together with 1Mb of SRAM. It also contains two independent DMA engines (one for moving data from the NIC to the host and the other to perform the transfers in the other direction), as well as some other hardware-assist features such as byte-steering logic to allow transfers to start and end on any byte. The NIC runs a program which is downloaded upon initialization time, such program being referred to as the "firmware" to avoid confusion with the host driver. There is no need for any PROM swapping or special utilities to update this firmware; the host merely maps the firmware program to the correct memory locations. It is up to the firmware to control the hardware; including manipulating buffers and buffer descriptors, controlling the DMA engines, calculating checksums, generating interrupts to the host and instructing the hardware to transmit or receive in some format. The standard firmware supplied also performs some useful work such as calculating TCP/IP checksums on-the-fly. In this dissertation, we modify or extend the supplied firmware to either provide additional functionality or else to perform some investigative tests.

PCI Bus

The NICs available make use of the standard PCI bus to affect all transfers to and from the host. The PCI standard defines two possible bus speeds (33Mhz and 66Mhz) and two different bit-rates (32-bit or 64-bit), all of which are accepted by the Alteon Tigon. However, in this dissertation we make use of the standard 33Mhz/32-bit variety. An important point to highlight regarding the PCI bus is the fact that each transfer incurs an overhead cost which is not proportional to the number of bytes transferred. In other words transfers of larger sizes are more efficient. With this bus rate, the maximum throughput possible is about 125MB/s, which is little more than our 1Gb/s target.

Host interaction

The standard firmware uses a model similar to the Virtual Interface Architecture standard (VIA) to communicate with the host. This means that whenever the host wishes to send or receive data to or from the NIC, it uses a system of message descriptors organized in a ring fashion, using doorbells to signal their presence. Interrupts can be generated by the NIC's firmware, but we mask them off to improve performance, relying instead on polling. The NIC assumes a flat memory model and therefore does not take into consideration the page swapping techniques in use by the OS. Furthermore, processes usually make use of virtual addresses rather than the physical addressing system the NIC expects. To circumvent this problem, the bigphysarea kernel extension together with the consequetive (sic) driver is used to pin-down a chunk of memory at boot time. The OS sees this area as just another I/O device and does not perform any page swapping techniques on it. Upon initialization, this memory is mapped into user-space making it available for user-level processes. Since it is also very easy to calculate the physical memory locations within this consecutive memory, we are able to communicate with the NIC without the need for invoking the OS at any stage. In fact, the OS does not feature in any of our tests (with the exception of setting the link up during initialization). This also enables zero-copy communication with data transfers being of order O(message_length). Zero-copy communication enables a process to avoid memory to memory copies hence boosting performance, however it comes at a price since such memory is no longer hidden and protected from other processes. Various techniques have been proposed for better control of this type of memory.

Results

Theoretical Limits

A number of key areas were investigated, each of which could be a potential bottleneck. Firstly however, we calculated the maximum theoretical Ethernet throughput which could be obtained when using different frame sizes. These are partially reproduced below:

Ethernet Theoretical Maximum Effeciency
True Frame size (bytes)Real Payload (bytes)Max Efficiency(Mbits/s)
792 754 952.0202
1048 1010 963.7405
1536 1498 975.2604
9096 9058 995.8223

In the above table, the true frame size refers to what is actually being transmitted, that is, the entire packet together with the interpacket gap and preambles. The real payload refers to our ``useful'' data, that is, the data which will be eventually used by the processes. Finally, the maximum efficiency refers to the best throughput we can ever hope to achieve. This implies that the maximum throughput we can achieve when using the standard 1.5K Ethernet frames is 975.26 Mbits/s and around 995.82 Mbits/s when using the extended 9K frames. Having determined our maximum limits, we next verify if the hardware can live to its claims by testing out the NIC to NIC line speed.

NIC to NIC line speed

Overheads are involved in transferring normal packets from the host to the NIC and vice-versa. However, since we need to test the maximum line to line speed using different payload lengths, we modify the firmware to do no task except send and receive packets at the fastest rate possible, the packets being generated by the NIC itself. The NIC provides a hardware timer register which may be read and set by the firmware. We make use of this register to keep track of the elapsed time in order to eventually calculate the bandwidth obtained. This is necessary since if we were to rely on the host to establish the time elapsed, the time delay for the signal to reach the host would distort our results. The firmware also keeps track of the number of packets transmitted and received. The elapsed time and the number of packets transmitted is read by the host at fixed intervals which then outputs the results.
Line Speed throughput for different payload sizes
Payload length(bytes)Bandwidth(Mbits/s)
1498 975.0565
9054 995.1155

Results show that the hardware is able to achieve our expected theoretical limits without incurring any significant overhead costs (Unfortunately, hardware limitations dictate the clock accuracy of all our timings). The maximum packet size ``on the line'' is set at 64K, however, CRC error detection routines become less effective beyond 12K. Furthermore, larger packet sizes incur longer latencies as well as an increased chance of corruption. Nevertheless, in our tests we have verified the throughput obtained when utilizing large frames as well as the normal-length packets.

PCI test results

A memory read or write requires the use of the PCI bus. To access any data, the hardware requires the setting up of a transaction involving the specification of the address to start reading or writing to, together with the number of bytes to be transferred. This ``setting up'' involves an initial startup cost. Results show that the rate at which the PCI bus is able to transfer data between the host and the NIC may not necessarily be equal to the rate of moving data in the other direction.

Host to NIC

In this test, we modify the firmware on the NIC to perform a number of DMA transfers (hence utilizing the PCI bus) in quick succession. The host software is implemented to give the initial command together with the appropriate parameters (DMA transfer length and the number of times to perform the test). When the NIC's firmware receives the command, it starts controlling the DMA engines each time transferring a number of bytes from the hosts' memory to it's internal memory. When all the necessary transfers are complete, the NIC sends an event to the host together with the elapsed time (as before, the NIC is responsible for establishing the time elapsed). The host then calculates the results after performing averaging operations on multiple transfers.

NIC to Host

This is similar to the test performed above except that we reverse the transfer direction. Again, it is up to the NIC to perform most of the work. Since a continuous host polling for the completion event may interrupt the PCI bus, the host is set to check for the results only after a number of seconds have elapsed. This does not affect results since the NIC would have completed the test in question with the elapsed time taken and stored, ready for retrieval.

Conclusions of test

Results clearly indicate that at lower sized transfers, performance drops considerably. For example, if we look at the NIC to Host results, we find that there is a difference of more than 38.6% between standard 1.5K packets and the proposed 9K frames.
NIC to Host
Payload length(bytes)Bandwidth(Mbits/s)
1512 665.6390
9072 922.8537

Standard firmware tests

The tests mentioned previously only test out the performance of the PCI bus; overhead costs of parsing an Ethernet header and queueing it for transmission are not calculated. The next tests investigate the maximum throughput possible in transmitting complete Ethernet packets from one side to another. Since we are testing the transmission and reception routines separately, we modify the firmware to either generate the packets (as we did to test line speed), or alternatively to receive them. In other words we create a setup to send packets from the host to the NIC at the other end, or from the NIC to the host as the final destination.

Host to Host

With the above tests complete, we are able to test out the complete communication chain --- transmitting from host to host. The figure below shows a summary of all our results so far (with standard 1.5K packets). The bars are ordered as follows: the Ethernet theoretical limit, card to card throughput test, throughput obtained to transfer data from the host to the NIC and vice-versa, transmitting or receiving from a NIC using standard firmware and host to host transfers using standard firmware.

New Techniques

It is evident that the real bottleneck in the communication cycle is the PCI bus. Normally the NIC performs a DMA transfer of the same size as the received (or transmitted) packet. We modify this behavior by introducing two new techniques, which we call packet fragmentation and packet coalescing for the transmitting and receiving side respectively.

Packet fragmentation

We have shown that since PCI transfers involve an initial startup cost for each transfer, smaller transfers are less efficient than longer packet transfers. In this technique, the host constructs a single large ``chunk'' containing a number of smaller (1.5K) packets. The firmware then breaks this chunk up into a number of fragments and transmits them out on the network. Since the maximum message size which may be transferred over the PCI bus is 64K, the host may transmit up to 43 packets in one go. Message descriptor flags also make it possible for the standard system and the fragmentation technique to co-exist. The figure below shows that performance has been boosted right up to the Ethernet theoretical limit, mirroring the results we obtain when using large payload sizes. The host to NIC segment shows the throughput obtained over the PCI bus when transferring messages of different transaction lengths. The second graph shows the number of bytes transferred (again over the same PCI bus) when the number of fragments per chunk is altered. Notice that while the PCI throughput is able to achieve 1Gb/s, the throughput obtained by fragmentation suddenly tapers off as it hits the theoretical limit boundary.

Packet Coalescing

For the receiving side we adopt the same rationale but use a different technique. In packet coalescing we delay transfers from the NIC to the host until a number of packets have been received. The host just allocates a larger chunk of memory and sets a special flag in the message descriptor (again this is done to allow both systems to co-exist). The NIC then ``joins'' a number of received packets together and transfers the packets when there is no more available buffer space or when a timeout expires. For example if the host wishes to receive 42 packets of 1.5K size in length at one go (the maximum possible), it first allocates enough consecutive memory to hold them all. Next, the NIC is informed that packet coalescing is required via a flag in the message descriptor. As a general rule of thumb, the higher the number of packets coalesced, the better performance is obtained since larger PCI transfers are used; however, this causes increased latency. We therefore modify the firmware further in a way as to trigger an alarm (via the use of the onboard timer register) after a host-specified amount of time, this acting as a signal to transfer the pending data. For example if 3 packets were received and there is still space for 10 more, timer expiration would ensure that the host would still get those initial 3 packets without waiting for the rest. This system therefore lets the host increase performance whilst controlling latency. Once again results show that the Ethernet theoretical limit is achieved whilst maintaining full compatibility with the existing systems. The figure below shows the effect of packet coalescing when coupled with packet fragmentation.

Conclusions

This dissertation has verified that the Ethernet theoretical limits can indeed be reached with existing hardware. Furthermore, we have located one of the chief bottlenecks in the entire communication loop, the PCI bus. Two techniques have also been devised and implemented on programmable NICs to demonstrate that, with some minor changes, performance can be boosted dramatically without resorting to new Ethernet standards requiring a change in all intermediatory routers. The effect of varying frame sizes (including the extended Jumbo frames) on the physical line and on the PCI bus was also analysed and zero-copy techniques were implemented.