2 February 16/2002 -- revision 0.2.1:
4 February 10/2002 -- revision 0.2:
5 some spell checking ;->
6 January 12/2002 -- revision 0.1
7 This is still work in progress so may change.
8 To keep up to date please watch this space.
13 NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
14 to improve network performance on Linux. For more details please
16 NAPI provides a "inherent mitigation" which is bound by system capacity
17 as can be seen from the following data collected by Robert on Gigabit
20 Psize Ipps Tput Rxint Txint Done Ndone
21 ---------------------------------------------------------------
22 60 890000 409362 17 27622 7 6823
23 128 758150 464364 21 9301 10 7738
24 256 445632 774646 42 15507 21 12906
25 512 232666 994445 241292 19147 241192 1062
26 1024 119061 1000003 872519 19258 872511 0
27 1440 85193 1000003 946576 19505 946569 0
31 "Ipps" stands for input packets per second.
32 "Tput" == packets out of total 1M that made it out.
33 "txint" == transmit completion interrupts seen
34 "Done" == The number of times that the poll() managed to pull all
35 packets out of the rx ring. Note from this that the lower the
36 load the more we could clean up the rxring
37 "Ndone" == is the converse of "Done". Note again, that the higher
38 the load the more times we couldn't clean up the rxring.
41 when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
42 The system cant handle the processing at 1 interrupt/packet at that load level.
43 At lower rates on the other hand, rx interrupts go up and therefore the
44 interrupt/packet ratio goes up (as observable from that table). So there is
45 possibility that under low enough input, you get one poll call for each
46 input packet caused by a single interrupt each time. And if the system
47 cant handle interrupt per packet ratio of 1, then it will just have to
53 A driver MAY continue using the old 2.4 technique for interfacing
54 to the network stack and not benefit from the NAPI changes.
55 NAPI additions to the kernel do not break backward compatibility.
56 NAPI, however, requires the following features to be available:
58 A) DMA ring or enough RAM to store packets in software devices.
60 B) Ability to turn off interrupts or maybe events that send packets up
63 NAPI processes packet events in what is known as dev->poll() method.
64 Typically, only packet receive events are processed in dev->poll().
65 The rest of the events MAY be processed by the regular interrupt handler
66 to reduce processing latency (justified also because there are not that
68 Note, however, NAPI does not enforce that dev->poll() only processes
70 Tests with the tulip driver indicated slightly increased latency if
71 all of the interrupt handler is moved to dev->poll(). Also MII handling
72 gets a little trickier.
73 The example used in this document is to move the receive processing only
74 to dev->poll(); this is shown with the patch for the tulip driver.
75 For an example of code that moves all the interrupt driver to
76 dev->poll() look at the ported e1000 code.
78 There are caveats that might force you to go with moving everything to
79 dev->poll(). Different NICs work differently depending on their status/event
80 acknowledgement setup.
81 There are two types of event register ACK mechanisms.
82 I) what is known as Clear-on-read (COR).
83 when you read the status/event register, it clears everything!
84 The natsemi and sunbmac NICs are known to do this.
85 In this case your only choice is to move all to dev->poll()
87 II) Clear-on-write (COW)
88 i) you clear the status by writing a 1 in the bit-location you want.
89 These are the majority of the NICs and work the best with NAPI.
90 Put only receive events in dev->poll(); leave the rest in
91 the old interrupt handler.
92 ii) whatever you write in the status register clears every thing ;->
93 Cant seem to find any supported by Linux which do this. If
94 someone knows such a chip email us please.
95 Move all to dev->poll()
97 C) Ability to detect new work correctly.
98 NAPI works by shutting down event interrupts when there's work and
99 turning them on when there's none.
100 New packets might show up in the small window while interrupts were being
101 re-enabled (refer to appendix 2). A packet might sneak in during the period
102 we are enabling interrupts. We only get to know about such a packet when the
103 next new packet arrives and generates an interrupt.
104 Essentially, there is a small window of opportunity for a race condition
105 which for clarity we'll refer to as the "rotting packet".
107 This is a very important topic and appendix 2 is dedicated for more
110 Locking rules and environmental guarantees
111 ==========================================
113 -Guarantee: Only one CPU at any time can call dev->poll(); this is because
114 only one CPU can pick the initial interrupt and hence the initial
115 netif_rx_schedule(dev);
116 - The core layer invokes devices to send packets in a round robin format.
117 This implies receive is totally lockless because of the guarantee that only
118 one CPU is executing it.
119 - contention can only be the result of some other CPU accessing the rx
120 ring. This happens only in close() and suspend() (when these methods
121 try to clean the rx ring);
122 ****guarantee: driver authors need not worry about this; synchronization
123 is taken care for them by the top net layer.
124 -local interrupts are enabled (if you dont move all to dev->poll()). For
125 example link/MII and txcomplete continue functioning just same old way.
126 This improves the latency of processing these events. It is also assumed that
127 the receive interrupt is the largest cause of noise. Note this might not
129 [according to Manfred Spraul, the winbond insists on sending one
130 txmitcomplete interrupt for each packet (although this can be mitigated)].
131 For these broken drivers, move all to dev->poll().
133 For the rest of this text, we'll assume that dev->poll() only
134 processes receive events.
136 new methods introduce by NAPI
137 =============================
139 a) netif_rx_schedule(dev)
140 Called by an IRQ handler to schedule a poll for device
142 b) netif_rx_schedule_prep(dev)
143 puts the device in a state which allows for it to be added to the
144 CPU polling list if it is up and running. You can look at this as
145 the first half of netif_rx_schedule(dev) above; the second half
148 c) __netif_rx_schedule(dev)
149 Add device to the poll list for this CPU; assuming that _prep above
150 has already been called and returned 1.
152 d) netif_rx_reschedule(dev, undo)
153 Called to reschedule polling for device specifically for some
154 deficient hardware. Read Appendix 2 for more details.
156 e) netif_rx_complete(dev)
158 Remove interface from the CPU poll list: it must be in the poll list
159 on current cpu. This primitive is called by dev->poll(), when
160 it completes its work. The device cannot be out of poll list at this
161 call, if it is then clearly it is a BUG(). You'll know ;->
163 All of the above methods are used below, so keep reading for clarity.
165 Device driver changes to be made when porting NAPI
166 ==================================================
168 Below we describe what kind of changes are required for NAPI to work.
170 1) introduction of dev->poll() method
171 =====================================
173 This is the method that is invoked by the network core when it requests
174 for new packets from the driver. A driver is allowed to send upto
175 dev->quota packets by the current CPU before yielding to the network
176 subsystem (so other devices can also get opportunity to send to the stack).
178 dev->poll() prototype looks as follows:
179 int my_poll(struct net_device *dev, int *budget)
181 budget is the remaining number of packets the network subsystem on the
182 current CPU can send up the stack before yielding to other system tasks.
183 *Each driver is responsible for decrementing budget by the total number of
185 Total number of packets cannot exceed dev->quota.
187 dev->poll() method is invoked by the top layer, the driver just sends if it
188 can to the stack the packet quantity requested.
190 more on dev->poll() below after the interrupt changes are explained.
192 2) registering dev->poll() method
193 ===================================
195 dev->poll should be set in the dev->probe() method.
200 /* two new additions */
201 /* first register my poll method */
203 /* next register my weight/quanta; can be overridden in /proc */
207 dev->stop = my_close;
211 3) scheduling dev->poll()
212 =============================
213 This involves modifying the interrupt handler and the code
214 path which takes the packet off the NIC and sends them to the
217 it's important at this point to introduce the classical D Becker
222 netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
225 struct net_device *dev = (struct net_device *)dev_instance;
226 struct my_private *tp = (struct my_private *)dev->priv;
228 int work_count = my_work_count;
229 status = read_interrupt_status_reg();
231 return IRQ_NONE; /* Shared IRQ: not us */
232 if (status == 0xffff)
233 return IRQ_HANDLED; /* Hot unplug */
235 do_some_error_handling()
238 acknowledge_ints_ASAP();
240 if (status & link_interrupt) {
241 spin_lock(&tp->link_lock);
242 do_some_link_stat_stuff();
243 spin_lock(&tp->link_lock);
246 if (status & rx_interrupt) {
247 receive_packets(dev);
250 if (status & rx_nobufs) {
251 make_rx_buffs_avail();
254 if (status & tx_related) {
255 spin_lock(&tp->lock);
259 spin_unlock(&tp->lock);
262 status = read_interrupt_status_reg();
264 } while (!(status & error) || more_work_to_be_done);
268 ----------------------------------------------------------------------
270 We now change this to what is shown below to NAPI-enable it:
272 ----------------------------------------------------------------------
274 netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
276 struct net_device *dev = (struct net_device *)dev_instance;
277 struct my_private *tp = (struct my_private *)dev->priv;
279 status = read_interrupt_status_reg();
281 return IRQ_NONE; /* Shared IRQ: not us */
282 if (status == 0xffff)
283 return IRQ_HANDLED; /* Hot unplug */
285 do_some_error_handling();
288 /************************ start note *********************************/
289 acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
290 /************************ end note *********************************/
292 if (status & link_interrupt) {
293 spin_lock(&tp->link_lock);
294 do_some_link_stat_stuff();
295 spin_unlock(&tp->link_lock);
297 /************************ start note *********************************/
298 if (status & rx_interrupt || (status & rx_nobuffs)) {
299 if (netif_rx_schedule_prep(dev)) {
301 /* disable interrupts caused
302 * by arriving packets */
303 disable_rx_and_rxnobuff_ints();
304 /* tell system we have work to be done. */
305 __netif_rx_schedule(dev);
307 printk("driver bug! interrupt while in poll\n");
308 /* FIX by disabling interrupts */
309 disable_rx_and_rxnobuff_ints();
312 /************************ end note note *********************************/
314 if (status & tx_related) {
315 spin_lock(&tp->lock);
320 spin_unlock(&tp->lock);
323 status = read_interrupt_status_reg();
325 /************************ start note *********************************/
326 } while (!(status & error) || more_work_to_be_done(status));
327 /************************ end note note *********************************/
331 ---------------------------------------------------------------------
334 We note several things from above:
336 I) Any interrupt source which is caused by arriving packets is now
337 turned off when it occurs. Depending on the hardware, there could be
338 several reasons that arriving packets would cause interrupts; these are the
339 interrupt sources we wish to avoid. The two common ones are a) a packet
340 arriving (rxint) b) a packet arriving and finding no DMA buffers available
342 This means also acknowledge_ints_ASAP() will not clear the status
343 register for those two items above; clearing is done in the place where
344 proper work is done within NAPI; at the poll() and refill_rx_ring()
345 discussed further below.
346 netif_rx_schedule_prep() returns 1 if device is in running state and
347 gets successfully added to the core poll list. If we get a zero value
348 we can _almost_ assume are already added to the list (instead of not running.
349 Logic based on the fact that you shouldn't get interrupt if not running)
350 We rectify this by disabling rx and rxnobuf interrupts.
352 II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
353 These functionalities are still around actually......
355 infact, receive_packets(dev) is very close to my_poll() and
356 make_rx_buffs_avail() is invoked from my_poll()
358 4) converting receive_packets() to dev->poll()
359 ===============================================
361 We need to convert the classical D Becker receive_packets(dev) to my_poll()
363 First the typical receive_packets() below:
364 -------------------------------------------------------------------
366 /* this is called by interrupt handler */
367 static void receive_packets (struct net_device *dev)
370 struct my_private *tp = (struct my_private *)dev->priv;
371 rx_ring = tp->rx_ring;
373 int entry = cur_rx % RX_RING_SIZE;
375 int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
377 while (rx_ring_not_empty) {
379 unsigned int rx_size;
380 unsigned int pkt_size;
382 /* read size+status of next frame from DMA ring buffer */
383 /* the number 16 and 4 are just examples */
384 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
385 rx_size = rx_status >> 16;
386 pkt_size = rx_size - 4;
389 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
390 (!(rx_status & RxStatusOK))) {
391 netdrv_rx_err (rx_status, dev, tp, ioaddr);
395 if (--rx_work_limit < 0)
399 skb = dev_alloc_skb (pkt_size + 2);
407 /*seems very driver specific ... some just pass
408 whatever is on the ring already. */
411 /* move to the next skb on the ring */
412 entry = (++tp->cur_rx) % RX_RING_SIZE;
417 /* store current ring pointer state */
420 /* Refill the Rx ring buffers if they are needed */
426 -------------------------------------------------------------------
427 We change it to a new one below; note the additional parameter in
430 -------------------------------------------------------------------
432 /* this is called by the network core */
433 static int my_poll (struct net_device *dev, int *budget)
436 struct my_private *tp = (struct my_private *)dev->priv;
437 rx_ring = tp->rx_ring;
439 int entry = cur_rx % RX_BUF_LEN;
440 /* maximum packets to send to the stack */
441 /************************ note note *********************************/
442 int rx_work_limit = dev->quota;
444 /************************ end note note *********************************/
445 do { // outer beginning loop starts here
447 clear_rx_status_register_bit();
449 while (rx_ring_not_empty) {
451 unsigned int rx_size;
452 unsigned int pkt_size;
454 /* read size+status of next frame from DMA ring buffer */
455 /* the number 16 and 4 are just examples */
456 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
457 rx_size = rx_status >> 16;
458 pkt_size = rx_size - 4;
461 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
462 (!(rx_status & RxStatusOK))) {
463 netdrv_rx_err (rx_status, dev, tp, ioaddr);
467 /************************ note note *********************************/
468 if (--rx_work_limit < 0) { /* we got packets, but no quota */
469 /* store current ring pointer state */
472 /* Refill the Rx ring buffers if they are needed */
476 /********************** end note **********************************/
479 skb = dev_alloc_skb (pkt_size + 2);
483 /************************ note note *********************************/
484 netif_receive_skb (skb);
485 /********************** end note **********************************/
489 /*seems very driver specific ... common is just pass
490 whatever is on the ring already. */
493 /* move to the next skb on the ring */
494 entry = (++tp->cur_rx) % RX_RING_SIZE;
499 /* store current ring pointer state */
502 /* Refill the Rx ring buffers if they are needed */
505 /* no packets on ring; but new ones can arrive since we last
507 status = read_interrupt_status_reg();
508 if (rx status is not set) {
509 /* If something arrives in this narrow window,
510 an interrupt will be generated */
513 /* done! at least that's what it looks like ;->
514 if new packets came in after our last check on status bits
515 they'll be caught by the while check and we go back and clear them
516 since we havent exceeded our quota */
517 } while (rx_status_is_set);
521 /************************ note note *********************************/
522 dev->quota -= received;
525 /* If RX ring is not full we are out of memory. */
526 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
529 /* we are happy/done, no more packets on ring; put us back
530 to where we can start processing interrupts again */
531 netif_rx_complete(dev);
532 enable_rx_and_rxnobuf_ints();
534 /* The last op happens after poll completion. Which means the following:
535 * 1. it can race with disabling irqs in irq handler (which are done to
537 * 2. it can race with dis/enabling irqs in other poll threads
538 * 3. if an irq raised after the beginning of the outer beginning
539 * loop (marked in the code above), it will be immediately
542 * Summarizing: the logic may result in some redundant irqs both
543 * due to races in masking and due to too late acking of already
544 * processed irqs. The good news: no events are ever lost.
550 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
551 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
555 printk("received==0\n");
558 dev->quota -= received;
560 return 1; /* not_done */
563 /* Start timer, stop polling, but do not enable rx interrupts. */
564 start_poll_timer(dev);
565 return 0; /* we'll take it from here so tell core "done"*/
567 /************************ End note note *********************************/
569 -------------------------------------------------------------------
571 From above we note that:
572 0) rx_work_limit = dev->quota
573 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
575 2) We have a done and not_done state.
576 3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
577 4) we have a new way of handling oom condition
578 5) A new outer for (;;) loop has been added. This serves the purpose of
579 ensuring that if a new packet has come in, after we are all set and done,
580 and we have not exceeded our quota that we continue sending packets up.
583 -----------------------------------------------------------
584 Poll timer code will need to do the following:
588 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
589 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
592 /* If RX ring is not full we are still out of memory.
593 Restart the timer again. Else we re-add ourselves
594 to the master poll list.
597 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
600 else netif_rx_schedule(dev); /* we are back on the poll list */
602 5) dev->close() and dev->suspend() issues
603 ==========================================
604 The driver writer needn't worry about this; the top net layer takes
607 6) Adding new Stats to /proc
608 =============================
609 In order to debug some of the new features, we introduce new stats
610 that need to be collected.
611 TODO: Fill this later.
613 APPENDIX 1: discussion on using ethernet HW FC
614 ==============================================
615 Most chips with FC only send a pause packet when they run out of Rx buffers.
616 Since packets are pulled off the DMA ring by a softirq in NAPI,
617 if the system is slow in grabbing them and we have a high input
618 rate (faster than the system's capacity to remove packets), then theoretically
619 there will only be one rx interrupt for all packets during a given packetstorm.
620 Under low load, we might have a single interrupt per packet.
621 FC should be programmed to apply in the case when the system cant pull out
622 packets fast enough i.e send a pause only when you run out of rx buffers.
623 Note FC in itself is a good solution but we have found it to not be
624 much of a commodity feature (both in NICs and switches) and hence falls
625 under the same category as using NIC based mitigation. Also, experiments
626 indicate that it's much harder to resolve the resource allocation
627 issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
628 proved harder. In any case, FC works even better with NAPI but is not
632 APPENDIX 2: the "rotting packet" race-window avoidance scheme
633 =============================================================
635 There are two types of associations seen here
637 1) status/int which honors level triggered IRQ
639 If a status bit for receive or rxnobuff is set and the corresponding
640 interrupt-enable bit is not on, then no interrupts will be generated. However,
641 as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
642 generated. [assuming the status bit was not turned off].
643 Generally the concept of level triggered IRQs in association with a status and
644 interrupt-enable CSR register set is used to avoid the race.
646 If we take the example of the tulip:
647 "pending work" is indicated by the status bit(CSR5 in tulip).
648 the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
649 the CSR5 will continue to be turned on with new packet arrivals even if
650 we clear it the first time)
651 Very important is the fact that if we turn on the interrupt bit on when
652 status is set that an immediate irq is triggered.
654 If we cleared the rx ring and proclaimed there was "no more work
655 to be done" and then went on to do a few other things; then when we enable
656 interrupts, there is a possibility that a new packet might sneak in during
657 this phase. It helps to look at the pseudo code for the tulip poll
660 --------------------------
663 while (ring_is_not_empty()) {
665 if quota is exceeded: exit, no touching irq status/mask
667 /* No packets, but new can arrive while we are doing this*/
669 if (CSR5 is not set) {
670 /* If something arrives in this narrow window here,
671 * where the comments are ;-> irq will be generated */
675 } while (rx_status_is_set);
676 ------------------------
678 CSR5 bit of interest is only the rx status.
679 If you look at the last if statement:
680 you just finished grabbing all the packets from the rx ring .. you check if
681 status bit says there are more packets just in ... it says none; you then
682 enable rx interrupts again; if a new packet just came in during this check,
683 we are counting that CSR5 will be set in that small window of opportunity
684 and that by re-enabling interrupts, we would actually trigger an interrupt
685 to register the new packet for processing.
687 [The above description nay be very verbose, if you have better wording
688 that will make this more understandable, please suggest it.]
690 2) non-capable hardware
692 These do not generally respect level triggered IRQs. Normally,
693 irqs may be lost while being masked and the only way to leave poll is to do
694 a double check for new input after netif_rx_complete() is invoked
695 and re-enable polling (after seeing this new input).
703 while (ring_is_not_empty()) {
705 if quota is exceeded: exit, not touching irq status/mask
710 enable_rx_interrupts()
711 netif_rx_complete(dev);
712 if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
713 disable_rx_and_rxnobufs()
715 } while (rx_status_is_set);
718 Basically netif_rx_complete() removes us from the poll list, but because a
719 new packet which will never be caught due to the possibility of a race
720 might come in, we attempt to re-add ourselves to the poll list.
725 APPENDIX 3: Scheduling issues.
726 ==============================
727 As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
728 general solution to schedule softirq's to run before next interrupt and by putting
729 them under scheduler control. Also this prevents consecutive softirq's from
730 monopolize the CPU. This also have the effect that the priority of ksoftirq needs
731 to be considered when running very CPU-intensive applications and networking to
732 get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
733 (eventually more) is reported cure problems with low network performance at high
736 Most used processes in a GIGE router:
737 USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
738 root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
739 root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
741 --------------------------------------------------------------------
745 ftp://robur.slu.se/pub/Linux/net-development/NAPI/
748 --------------------------------------------------------------------
749 TODO: Write net-skeleton.c driver.
750 -------------------------------------------------------------
754 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
755 Jamal Hadi Salim <hadi@cyberus.ca>
756 Robert Olsson <Robert.Olsson@data.slu.se>
760 People who made this document better:
762 Lennert Buytenhek <buytenh@gnu.org>
763 Andrew Morton <akpm@zip.com.au>
764 Manfred Spraul <manfred@colorfullife.com>
765 Donald Becker <becker@scyld.com>
766 Jeff Garzik <jgarzik@pobox.com>