Documentation/networking/tcp.txt

   1 TCP protocol
   2 ============
   3
   4 Last updated: 3 June 2017
   5
   6 Contents
   7 ========
   8
   9 - Congestion control
  10 - How the new TCP output machine [nyi] works
  11
  12 Congestion control
  13 ==================
  14
  15 The following variables are used in the tcp_sock for congestion control:
  16 snd_cwnd                The size of the congestion window
  17 snd_ssthresh            Slow start threshold. We are in slow start if
  18                         snd_cwnd is less than this.
  19 snd_cwnd_cnt            A counter used to slow down the rate of increase
  20                         once we exceed slow start threshold.
  21 snd_cwnd_clamp          This is the maximum size that snd_cwnd can grow to.
  22 snd_cwnd_stamp          Timestamp for when congestion window last validated.
  23 snd_cwnd_used           Used as a highwater mark for how much of the
  24                         congestion window is in use. It is used to adjust
  25                         snd_cwnd down when the link is limited by the
  26                         application rather than the network.
  27
  28 As of 2.6.13, Linux supports pluggable congestion control algorithms.
  29 A congestion control mechanism can be registered through functions in
  30 tcp_cong.c. The functions used by the congestion control mechanism are
  31 registered via passing a tcp_congestion_ops struct to
  32 tcp_register_congestion_control. As a minimum, the congestion control
  33 mechanism must provide a valid name and must implement either ssthresh,
  34 cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook.
  35
  36 Private data for a congestion control mechanism is stored in tp->ca_priv.
  37 tcp_ca(tp) returns a pointer to this space.  This is preallocated space - it
  38 is important to check the size of your private data will fit this space, or
  39 alternatively, space could be allocated elsewhere and a pointer to it could
  40 be stored here.
  41
  42 There are three kinds of congestion control algorithms currently: The
  43 simplest ones are derived from TCP reno (highspeed, scalable) and just
  44 provide an alternative congestion window calculation. More complex
  45 ones like BIC try to look at other events to provide better
  46 heuristics.  There are also round trip time based algorithms like
  47 Vegas and Westwood+.
  48
  49 Good TCP congestion control is a complex problem because the algorithm
  50 needs to maintain fairness and performance. Please review current
  51 research and RFC's before developing new modules.
  52
  53 The default congestion control mechanism is chosen based on the
  54 DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default
  55 value then you can set it using sysctl net.ipv4.tcp_congestion_control. The
  56 module will be autoloaded if needed and you will get the expected protocol. If
  57 you ask for an unknown congestion method, then the sysctl attempt will fail.
  58
  59 If you remove a TCP congestion control module, then you will get the next
  60 available one. Since reno cannot be built as a module, and cannot be
  61 removed, it will always be available.
  62
  63 How the new TCP output machine [nyi] works.
  64 ===========================================
  65
  66 Data is kept on a single queue. The skb->users flag tells us if the frame is
  67 one that has been queued already. To add a frame we throw it on the end. Ack
  68 walks down the list from the start.
  69
  70 We keep a set of control flags
  71
  72
  73         sk->tcp_pend_event
  74
  75                 TCP_PEND_ACK                    Ack needed
  76                 TCP_ACK_NOW                     Needed now
  77                 TCP_WINDOW                      Window update check
  78                 TCP_WINZERO                     Zero probing
  79
  80
  81         sk->transmit_queue              The transmission frame begin
  82         sk->transmit_new                First new frame pointer
  83         sk->transmit_end                Where to add frames
  84
  85         sk->tcp_last_tx_ack             Last ack seen
  86         sk->tcp_dup_ack                 Dup ack count for fast retransmit
  87
  88
  89 Frames are queued for output by tcp_write. We do our best to send the frames
  90 off immediately if possible, but otherwise queue and compute the body
  91 checksum in the copy.
  92
  93 When a write is done we try to clear any pending events and piggy back them.
  94 If the window is full we queue full sized frames. On the first timeout in
  95 zero window we split this.
  96
  97 On a timer we walk the retransmit list to send any retransmits, update the
  98 backoff timers etc. A change of route table stamp causes a change of header
  99 and recompute. We add any new tcp level headers and refinish the checksum
 100 before sending.
 101