docs/debug.adoc

   1 = NTP Debugging Techniques
   2 include::include-html.ad[]
   3
   4 [cols="10%,90%",frame="none",grid="none",style="verse"]
   5 |==============================
   6 |image:pic/pogo.gif[]|
   7 {millshome}pictures.html[from 'Pogo', Walt Kelly]
   8
   9 We make house calls and bring our own bugs.
  10
  11 |==============================
  12
  13 == More Help
  14
  15 include::includes/install.adoc[]
  16
  17 '''''
  18
  19 == Initial Startup
  20
  21 This page discusses +ntpd+ program monitoring and debugging techniques
  22 using the link:ntpq.html[+ntpq+ - standard NTP query program], either
  23 on the local server or from a remote machine. The +ntpq+ program
  24 implements the management functions specified in the NTP specification
  25 https://tools.ietf.org/html/rfc5905[RFC 5905].  In addition, the
  26 program can be used to send remote configuration commands to the
  27 server.
  28
  29 The +ntpd+ daemon can operate in two modes, depending on the presence
  30 of the +-n+ command-line option. Without the option the daemon
  31 detaches from the controlling terminal and proceeds autonomously. With
  32 one or more +-d+ options the daemon generates special trace output
  33 useful for debugging. In general, interpretation of this output
  34 requires reference to the sources. However, a single +-d+ does produce
  35 only mildly cryptic output and can be very useful in finding problems
  36 with configuration and network troubles.
  37
  38 Any Firewall needs to allow NTP traffic.
  39 The server side of +ntpd+ listens on UDP port 123.  The client side
  40 also sends from port 123 but not all implementations do that
  41 and +ntpq+ sends from system assigned ports.
  42 If you are running an link:authentic.html#nts[NTS] server, you also
  43 need to allow TCP port 4460.
  44
  45 Other problems are apparent in the system log, which ordinarily shows
  46 the startup banner, some cryptic initialization data and the computed
  47 precision value. Event messages at startup and during regular operation
  48 are sent to the optional +protostats+ monitor file, as described on the
  49 link:decode.html[Event Messages and Status Words] page. These and other
  50 error messages are sent to the system log, as described on the
  51 link:msyslog.html[+ntpd+ System Log Messages] page. In real emergencies
  52 the daemon will send a terminal error message to the system log and then
  53 cease operation.
  54
  55 The next most common problem is incorrect DNS names. Check that each DNS
  56 name used in the configuration file exists and that the address responds
  57 to the Unix +ping+ command. The Unix +traceroute+
  58 utility can be used to verify a partial or complete path exists. Most
  59 problems reported to the NTP newsgroup are not NTP problems, but
  60 problems with the network or firewall configuration.
  61
  62 If you use GPS, and your time is off by 19 years, you may have been
  63 bitten by the GPS 1024 week number rollover bug - WNRO.
  64 Please see link:rollover.html[Rollover issues in time sources]
  65
  66 == Verifying Correct Operation
  67
  68 Unless using the +iburst+ option, the client normally takes a few
  69 minutes to synchronize to a server. If the client time at startup
  70 happens to be more than 1000 s distant from NTP time, the daemon exits
  71 with a message to the system log directing the operator to manually set
  72 the time within 1000 s and restart. If the time is less than 1000 s but
  73 more than 128 s distant, a step correction occurs and the daemon
  74 restarts automatically.
  75
  76 When started for the first time and a frequency file - usually ntp.drift -
  77 is not present, the daemon enters a special mode in order to calibrate the
  78 frequency. This takes 900 ms during which the time is not disciplined. When
  79 calibration is complete, the daemon creates the frequency file and enters
  80 normal mode to amortize whatever residual offset remains.
  81
  82 The +ntpq+ commands +pe+, +as+ and +rv+ are normally sufficient to
  83 verify correct operation and assess nominal performance. The
  84 link:ntpq.html#pe[+pe+] command displays a list showing the DNS name or
  85 IP address for each association along with selected status and
  86 statistics variables. The first character in each line is the tally
  87 code, which shows which associations are candidates to set the system
  88 clock and of these which one is the system peer. The encoding is shown
  89 in the +select+ field of the link:decode.html#peer[peer status word].
  90
  91 The link:ntpq.html#as[+as+] command displays a list of associations and
  92 association identifiers. Note the +condition+ column, which reflects the
  93 tally code. The link:ntpq.html#rv[+rv+] command displays the
  94 link:ntpq.html#system[system variables] billboard, including the
  95 link:decode.html#sys[system status word]. The
  96 link:ntpq.html#rv[+rv assocID+] command, where +assocID+ is the
  97 association ID, displays the link:ntpq.html#peer[peer variables]
  98 billboard, including the link:decode.html#peer[peer status word]. Note
  99 that, except for explicit calendar dates, times are in milliseconds and
 100 frequencies are in parts-per-million (ppm).
 101
 102 A detailed explanation of the system, peer and clock variables in the
 103 billboards is beyond the scope of this page; however, a comprehensive
 104 explanation for each one is in the NTPv4 protocol specification. The
 105 following observations will be useful in debugging and monitoring.
 106
 107 1.  The server has successfully synchronized to its sources if the
 108 +leap+ peer variable has value other than 3 (11b) The client has
 109 successfully synchronized to the server when the +leap+ system variable
 110 has value other than 3.
 111 2.  The +reach+ peer variable is an 8-bit shift register displayed in
 112 octal format. When a valid packet is received, the rightmost bit is lit.
 113 When a packet is sent, the register is shifted left one bit with 0
 114 replacing the rightmost bit. If the +reach+ value is nonzero, the server
 115 is reachable; otherwise, it is unreachable. Note that, even if all
 116 servers become unreachable, the system continues to show valid time to
 117 dependent applications.
 118 3.  A useful indicator of miscellaneous problems is the +flash+ peer
 119 variable, which shows the result of 13 sanity tests. It contains the
 120 link:decode.html#flash[flash status word] bits, commonly called
 121 flashers, which displays the current errors for the association. These
 122 bits should all be zero for a valid server.
 123 4.  The three peer variables +filtdelay+, +filtoffset+ and +filtdisp+
 124 show the delay, offset and jitter statistics for each of the last eight
 125 measurement rounds. These statistics and their trends are valuable
 126 performance indicators for the server, client and the network. For
 127 instance, large fluctuations in delay and jitter suggest network
 128 congestion. Missing clock filter stages suggest packet losses in the
 129 network.
 130 5.  The synchronization distance, defined as one-half the delay plus the
 131 dispersion, represents the maximum error statistic. The jitter
 132 represents the expected error statistic. The maximum error and expected
 133 error calculated from the peer variables represents the quality metric
 134 for the server. The maximum error and expected error calculated from the
 135 system variables represents the quality metric for the client. If the
 136 root synchronization distance for any server exceeds 1.5 s, called the
 137 select threshold, the server is considered invalid.
 138
 139 Sometimes the time distribution of errors can be revealing. It's a
 140 good idea to look occasionally at the plots produced by
 141 link:ntpviz.html[ntpviz].
 142
 143 == Large Frequency Errors
 144
 145 The frequency tolerance of computer clock oscillators varies widely,
 146 sometimes above 500 ppm. While the daemon can handle frequency errors up
 147 to 500 ppm, or 43 seconds per day, values much above 100 ppm reduce the
 148 headroom, especially at the lowest poll intervals. To determine the
 149 particular oscillator frequency, start +ntpd+ using the +noselect+
 150 option with the +server+ configuration command.
 151
 152 Record the time of day and offset displayed by the +ntpq+
 153 link:ntpq.html#peer[+peer+] command. Wait for an hour or so and record the
 154 time of day and offset. Calculate the frequency as the offset difference
 155 divided by the time difference. If the frequency offset is much above 100 ppm,
 156 the link:ntpfrob.html[{ntpfrobman}] program might be useful to adjust the
 157 kernel clock frequency below that value. For systems that do not support
 158 this program, this might be one using a command in the system startup
 159 file.
 160
 161 == Access Controls
 162
 163 Provisions are included in +ntpd+ for access controls which deflect
 164 unwanted traffic from selected hosts or networks. The controls described
 165 on the link:accopt.html[Access Control Options] include detailed packet
 166 filter operations based on source address and address mask. Normally,
 167 filtered packets are dropped without notice other than to increment
 168 tally counters. However, the server can be configured to send a
 169 "kiss-o'-death" (KoD) packet to the client either when explicitly
 170 configured or when cryptographic authentication fails for some reason.
 171 The client association is permanently disabled, the access denied bit
 172 (BOGON4) is set in the flash variable and a message is sent to the system
 173 log.
 174
 175 The access control provisions include a limit on the packet rate from a
 176 host or network. If an incoming packet exceeds the limit, it is dropped
 177 and a KoD sent to the source. If this occurs after the client
 178 association has synchronized, the association is not disabled, but a
 179 message is sent to the system log. See the link:accopt.html[Access
 180 Control Options] page for further information.
 181
 182 == Large Delay Variations
 183
 184 In some reported scenarios an access line may show low to moderate
 185 network delays during some period of the day and moderate to high delays
 186 during other periods. Often the delay on one direction of transmission
 187 dominates, which can result in large time offset errors, sometimes in
 188 the range up to a few seconds. It is not usually convenient to run
 189 +ntpd+ throughout the day in such scenarios, since this could result in
 190 several time steps, especially if the condition persists for greater
 191 than the stepout threshold.
 192
 193 Specific provisions have been built into +ntpd+ to cope with these
 194 problems. The scheme is called "huff-'n-puff and is described on the
 195 link:miscopt.html[Miscellaneous Options] page. An alternative approach
 196 in such scenarios is first to calibrate the local clock frequency error
 197 by running +ntpd+ in continuous mode during the quiet interval and let
 198 it write the frequency to the +ntp.drift+ file. Then, run +ntpd -q+ from
 199 a cron job each day at some time in the quiet interval. In systems with
 200 the nanokernel or microkernel performance enhancements, including
 201 Solaris, Tru64, Linux and FreeBSD, the kernel continuously disciplines
 202 the frequency so that the residual correction produced by +ntpd+ is
 203 usually less than a few milliseconds.
 204
 205 == Cryptographic Authentication
 206
 207 Reliable source authentication requires the use of symmetric key
 208 link:authopt.html[Authentication Options] page. In symmetric key
 209 cryptography servers and clients share session keys contained in a
 210 secret key file. In public key cryptography, the server has a
 211 private key, never shared, and a public key with unrestricted
 212 distribution. Symmetric kays can be produced by
 213 the link:ntpkeygen.html[+ntpkeygen+] program.
 214
 215 Problems with symmetric key authentication are usually due to mismatched
 216 keys or improper use of the +trustedkey+ command. A simple way to check
 217 for problems is to use the trace facility, which is enabled using the
 218 +ntpd -d+ command line. As each packet is received a trace line is
 219 displayed which shows the authentication status in the +auth+ field. A
 220 status of 1 indicates the packet was successful authenticated; otherwise
 221 it has failed.
 222
 223 == Debugging Checklist
 224
 225 If the +ntpq+ or program does not show that messages are being
 226 received by the daemon or that received messages do not result in
 227 correct synchronization, verify the following:
 228
 229 1.  Check the system log for +ntpd+ messages about configuration errors,
 230 name-lookup failures or initialization problems. Common system log
 231 messages are summarized on the link:msyslog.html[+ntpd+ System Log
 232 Messages] page.  If you specify a log file, be sure to check in
 233 your main syslog file (and be sure it logs entries from ntpd) since
 234 some of the errors are logged before it switches to the specified
 235 log file.
 236
 237 2. Check to be sure that only one copy of +ntpd+ is running.
 238
 239 3.  Verify using +ping+ or other utility that packets actually do make
 240 the round trip between the client and server. Verify using +dig+,
 241 +nslookup+ or other utility that the DNS server names do exist and
 242 resolve to valid Internet addresses. Be aware that ICMP (ping) packets
 243 may be firewalled or filtered anywhere in the path. Ping failure does not
 244 explicitly mean that the client and server cannot exchange NTP's
 245 UDP traffic.
 246
 247 4.  Check that the remote NTP server is up and running. The usual
 248 evidence that it is not is a +Connection refused+ message.
 249
 250 5.  Using the +ntpq+ program, verify that the packets received and
 251 packets sent counters are incrementing. If the sent counter does not
 252 increment and the configuration file includes configured servers,
 253 something may be wrong in the host network or interface configuration.
 254 If the sent counter does increment, but the received counter does not
 255 increment, something may be wrong in the network or the server NTP
 256 daemon may not be running or the server itself may be down or not
 257 responding.
 258
 259 6.  If both the sent and received counters do increment, but the +reach+
 260 values in the peers billboard with +ntpq+ continues to show zero,
 261 received packets are probably being discarded for some reason. If this
 262 is the case, the cause should be evident from the +flash+ variable as
 263 discussed above and on the +ntpq+ page. It could be that the server has
 264 disabled access for the client address, in which case the +refid+ field
 265 in the +ntpq+ peers billboard will show a kiss code. See
 266 link:decode.html#kiss[Kiss Codes] for a full list of the codes and their
 267 meanings.
 268
 269 7.  If the +reach+ values in the peers billboard show the servers are
 270 alive and responding, note the tattletale symbols at the left margin,
 271 which indicate the status of each server resulting from the various
 272 grooming and mitigation algorithms. The interpretation of these symbols
 273 is discussed on the +ntpq+ page. After a few minutes of operation, one
 274 or another of the reachable server candidates should show a * tattletale
 275 symbol. If this doesn't happen, the intersection algorithm, which
 276 classifies the servers as truechimers or falsetickers, may be unable to
 277 find a majority of truechimers among the server population.
 278
 279 8.  If all else fails, see the FAQ and/or the discussion and briefings
 280 at the project website.
 281
 282 '''''
 283
 284 include::includes/footer.adoc[]