HOWTO

   1 Introduction
   2 ------------
   3 This HOWTO describes how to set up a loadbalanced, redundant and
   4 distributed network monitoring system using op5 Monitor. Note that
   5 Merlin is a part of op5 Monitor. Non-customers will have to adjust
   6 paths etc used throughout this guide in order to be able to use it.
   7 Replacing "op5 Monitor" with "nagios + merlin" will be a good start
   8 for those of you venturing into the unknown without the aid of our
   9 rather excellent support services. For those wishing to configure
  10 only distributed network monitoring, or only loadbalanced or
  11 redundant, this is still a good guide.
  12
  13 The guide will assume that we're installing two redundant and
  14 loadbalanced master-servers ("yoda" and "obi1"), with three
  15 poller servers, two of which are peered with each other. The single
  16 poller will be designated "solo". The peered pollers will be "luke"
  17 and "leya". We'll assume that each poller has their names in the
  18 DNS and can be looked up that way. 'solo' will be responsible for
  19 monitoring the hostgroup 'hyperdrive'. 'luke' and 'leya' will share
  20 responsibility in monitoring the hostgroups 'theforce' and 'tattoine'.
  21
  22 With this setup, communications will go like this:
  23 luke will talk with:
  24 leya, obi1, yoda
  25
  26 leya will talk with:
  27 luke, obi1, yoda
  28
  29 solo will talk with:
  30 obi1, yoda
  31
  32 obi1 will talk with:
  33 yoda, luke, leya, solo
  34
  35 yoda will talk with:
  36 obi1, luke, leya, solo
  37
  38
  39 Preparations
  40 ------------
  41 The following needs to be in place for this HOWTO to be usable, but
  42 how to obtain or set them up is outside the scope of this article.
  43
  44 * Make sure you have the passwords for the root accounts on all the
  45   servers intended to be part of the monitoring network. This will
  46   be needed in order to configure merlin.
  47 * Open the firewalls for port 15551 (merlin's default port) and 22.
  48   Both ends will attempt to connect on port 15551, so it's ok if only
  49   one side of the intended connection can connect to the other. For
  50   port 22, it's a little bit more complicated and in order to get the
  51   full shebang of features both ends will need to be able to initiate
  52   connections with the other. It's possible to get away with not
  53   allowing pollers to initiate connections to the master server, but
  54   certain recovery operations will then not be possible.
  55 * op5 Monitor needs to be installed on all systems intended to be part
  56   of the network.
  57
  58
  59 Hello mon'!
  60 -----------
  61 Included in op5 Monitor is the 'mon' helper. mon is a nifty little
  62 tool designed to help with configuring, managing and verifying the
  63 status of a distributed Merlin installation.
  64
  65 It's usage is quite simple: mon <category> <command>
  66 Just type mon and you'll get a list of all available categories and
  67 commands. Some commands lack a category and are runnable all by
  68 themselves, such as 'stop', 'start' and 'restart', which take care
  69 of stopping, starting and restarting monitor and merlin using the
  70 proper shutdown and startup sequences.
  71
  72 We'll soon see exactly how useful that little helper is.
  73
  74
  75 Step 1 - Configure Merlin on one of the master systems
  76 ------------------------------------------------------
  77 The aforementioned 'mon' helper has a 'node' category. This is useful
  78 for manipulating configured nodes in merlin's configuration file,
  79 among other things.
  80 We'll start with configuring Merlin properly on 'yoda'. The commands
  81 to do so will look like this (yes, there's a typo there):
  82
  83   mon node add obi1 type=peer
  84   mon node add luke type=poller hostgroup=theforce,tattoine
  85   # type=poller is default, so we don't have to spell it out
  86   mon node add leya hostgroup=theforce,tattoine
  87   mon node add solo hostgroup=hyperride
  88
  89 The 'node' category also has a 'remove' command, so when we've noticed
  90 the typo we made when adding the poller 'solo', we can fix that by
  91 removing the faulty node and re-adding it again:
  92
  93   mon node remove solo
  94   mon node add solo type=poller hostgroup=hyperdrive
  95
  96 You may verify that you've done things right in a couple of different
  97 ways. 'mon node list' lists nodes. It accepts --type= argument, so
  98 if we want to list all pollers and peers, we can run:
  99
 100   mon node list --type=poller,peer
 101
 102 This in conjunction with 'mon node show <name>' is excellent to use
 103 when scripting.
 104
 105 The contents of the configuration file, which by default resides in
 106 /opt/monitor/op5/merlin/merlin.conf, should now include something
 107 like this:
 108         --8<--8<--8<--
 109         peer obi1 {
 110                 address = obi1
 111                 port = 15551
 112         }
 113         poller luke {
 114                 address = luke
 115                 port = 15551
 116                 hostgroup = theforce,tattoine
 117         }
 118         poller leya {
 119                 address = leya
 120                 port = 15551
 121                 hostgroup = theforce,tattoine
 122         }
 123         poller solo {
 124                 address = solo
 125                 port = 15551
 126                 hostgroup = hyperdrive
 127         }
 128         --8<--8<--8<--8<--
 129
 130
 131 Step 2 - Distribute ssh-keys
 132 ----------------------------
 133 The 'mon' helper has an sshkey category. The commands in it let you
 134 push and fetch sshkeys from remote destinations.
 135
 136   mon sshkey push --all
 137
 138 will append your ~/.ssh/id_rsa.pub file to the authorized_keys file
 139 for the root and monitor user on all configured remote nodes.
 140 If you don't have a public keyfile, one will be generated for you.
 141 Please note that if you generate a keyfile yourself, it must not have
 142 a password, or configuration synchronization will fail to work.
 143
 144 So far we've set up one-way communication. 'yoda' can now talk to
 145 all the other systems without having to use passwords. In order to
 146 fetch all the keys from the remote systems, we'll use the following
 147 command:
 148
 149   mon sshkey fetch --all
 150
 151 This will fetch all the relevant keys into ~/.ssh/authorized_keys.
 152 So every node can talk to 'yoda', and 'yoda' can talk to every
 153 other node. That's great. But 'luke' and 'leya' need to be able
 154 to talk to each other as well, and all the pollers need to be
 155 able to talk to 'obi1'. Since we have all keys except our own in
 156 ~/.ssh/authorized_keys, we can simply amend it with the key we
 157 generated earlier and distribute the resulting file to every node.
 158 Or we can do what we just did for 'yoda' on all the other nodes,
 159 and simply wait until we're done configuring merlin on all nodes
 160 and then log in to them and run the 'mon sshkey push --all' and
 161 'mon sshkey fetch --all' commands there too.
 162
 163 You can verify that this works by running:
 164   mon node ctrl -- 'echo hostname is $(hostname)'
 165
 166
 167 Step 3 - Configure Merlin on the remote systems
 168 -----------------------------------------------
 169 I sneakily introduced the 'node ctrl' command in the last section.
 170 This time we'll use it rather heavily, along with the 'node add'
 171 command which will run on the remote systems.
 172
 173 First we add ourself and 'obi1' as master to all pollers:
 174   mon node ctrl --type=poller -- mon node add yoda type=master
 175   mon node ctrl --type=poller -- mon node add obi1 type=master
 176
 177 Now the poller 'solo' is actually done and configured already.
 178
 179 Then we add ourself as a peer to all our peers (just 'obi1' really,
 180 but in case you build larger networks, this will work better):
 181   mon node ctrl --type=peer -- mon node add yoda type=peer
 182
 183 Then we add all pollers to 'obi1':
 184   mon node ctrl obi1 -- mon node add luke hostgroup=theforce,tattoine
 185   mon node ctrl obi1 -- mon node add leya hostgroup=theforce,tattoine
 186   mon node ctrl obi1 -- mon node add solo hostgroup=hyperdrive
 187
 188 And finally we add luke and leya as peers to each other:
 189   mon node ctrl leya -- mon node add luke type=peer
 190   mon node ctrl luke -- mon node add leya type=peer
 191
 192 solo will have the following config:
 193         --8<--8<--8<--
 194         master yoda {
 195                 address = yoda
 196                 port = 15551
 197         }
 198         master obi1 {
 199                 address = obi1
 200                 port = 15551
 201         }
 202         --8<--8<--8<--
 203
 204 luke will have this in its config file:
 205         --8<--8<--8<--
 206         master yoda {
 207                 address = yoda
 208                 port = 15551
 209         }
 210         master obi1 {
 211                 address = obi1
 212                 port = 15551
 213         }
 214         peer leya {
 215                 address = leya
 216                 port = 15551
 217         }
 218         --8<--8<--8<--
 219
 220 leya will have this:
 221         --8<--8<--8<--
 222         master yoda {
 223                 address = yoda
 224                 port = 15551
 225         }
 226         master obi1 {
 227                 address = obi1
 228                 port = 15551
 229         }
 230         peer luke {
 231                 address = luke
 232                 port = 15551
 233         }
 234         --8<--8<--8<--
 235
 236 obi1 will have this:
 237         --8<--8<--8<--
 238         peer yoda {
 239                 address = yoda
 240                 port = 15551
 241         }
 242         poller luke {
 243                 address = luke
 244                 port = 15551
 245                 hostgroup = theforce,tattoine
 246         }
 247         poller leya {
 248                 address = leya
 249                 port = 15551
 250                 hostgroup = theforce,tattoine
 251         }
 252         poller solo {
 253                 address = solo
 254                 port = 15551
 255                 hostgroup = hyperdrive
 256         }
 257         --8<--8<--8<--
 258
 259
 260 Step 4 - Verifying configuration and ssh-key setup
 261 --------------------------------------------------
 262 Now that we have merlin configured properly on all our five nodes,
 263 we can use a recursive version of the 'node ctrl' command to make
 264 sure ssh work properly from all systems to all systems they need
 265 to talk to. Try pasting this into the console:
 266
 267   mon node ctrl -- \
 268     'echo On $(hostname); hostname | mon node ctrl -- \
 269           \'from=$(cat); echo "@$(hostname) from $from\'' \
 270   | grep ^@
 271
 272 Hard to follow? I agree, but it should produce something like this:
 273   @yoda from obi1
 274   @luke from obi1
 275   @leya from obi1
 276   @solo from obi1
 277   @yoda from luke
 278   @obi1 from luke
 279   @leya from luke
 280   @yoda from leya
 281   @obi1 from leya
 282   @luke from leya
 283   @yoda from solo
 284   @obi1 from solo
 285
 286 If it does, that means ssh keys are properly installed, at least for
 287 the root user(s). If the command seems to hang somewhere in the middle,
 288 try rerunning it without the ending 'grep' statement. If that ends up
 289 with a password prompt appearing, you'll need to revisit the sshkey
 290 configuration and again make sure every node that should be able to
 291 talk to other nodes can talk to the nodes it should be able to talk
 292 to. Simple, eh?
 293
 294 (XXX; Test this and make sure it actually works like this)
 295
 296 Step 5 - Configuring Nagios
 297 ---------------------------
 298 Handling object configuration is very much out of scope for this
 299 article, but there are a few rules (most of them are actually more
 300 like guidelines, but things will be confusing if you don't follow
 301 them, so please do) one needs to adhere to in order for Merlin to
 302 work properly.
 303
 304 * Each host that is a member of a hostgroup used to distribute work
 305   from a master to a poller should never be member of a hostgroup
 306   that is used to distribute work to a different poller. In our
 307   case, that means that any host that is a member of either 'theforce'
 308   or 'tattoine' shouldn't also be a member of 'hyperdrive'.
 309
 310 * Two peers absolutely must have identical object configuration.
 311   This is due to the way loadbalancing works in Merin. In our case
 312   that means that since 'luke' is responsible for 'theforce' and
 313   'tattoine', its peer 'leia' must also be responsible for exactly
 314   those two hostgroups, and no other.
 315
 316 That's basically it. It's possible to circumvent these rules, but
 317 if you do, you're on your own. No tools currently exist to enforce
 318 them, and Merlin won't complain if you suddenly add another poller
 319 that's responsible for 'tattoine' and 'hyperdrive', even though
 320 such a configuration is obviously retarded in light of the above
 321 rules.
 322
 323
 324 Step 6 - Synchronization configuration
 325 --------------------------------------
 326 Configuration synchronization will be a bit easier for you if you
 327 move all of monitor's object configuration files to a cfg_dir
 328 instead of using the default layout of mixing object configuration
 329 with Nagios' main configuration file and other various stuff. This
 330 is especially true for pollers and doesn't matter nearly as much
 331 for masters.
 332
 333 The quick and easy way to set it up so that it works like that is
 334 by running the following commands from 'yoda':
 335
 336   dir=/opt/monitor/etc/oconf
 337   conf=/opt/monitor/etc/nagios.cfg
 338   mon node ctrl -- sed -i /^cfg_file=/d $conf
 339   mon node ctrl -- sed -i /^log_file=/acfg_dir=$dir $conf
 340   mon node ctrl -- mkdir -m 775 $dir
 341   mon node ctrl -- chown monitor:apache $dir
 342
 343 Now, if you run:
 344   mon node ctrl -- mon oconf hash
 345
 346 you should get a list of 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
 347 as output from all nodes. That means the pollers now have an empty
 348 object configuration, which is just the way we like it since we'll
 349 be pushing configuration from one of our two peered masters to all
 350 the pollers.
 351
 352 ("da 39 hash" is what you get from sha1 when you don't feed it any
 353 input at all)
 354
 355 In merlin, you can configure a script to run that takes care of
 356 syncing configuration. This script should also restart monitor on
 357 the receiving ends when it's done sending configuration. In the
 358 Merlin world, this is handled by a single command that gets run
 359 once when we detect that we have a newer configuration than any
 360 of our peers or pollers.
 361
 362   mon oconf push
 363
 364 takes no arguments at all. It does parse merlin.conf though and
 365 creates complete configuration files for all the pollers, which
 366 by default gets sent to /opt/monitor/etc/oconf/from-master.cfg
 367 on each respective poller, which is then restarted. Again by
 368 default, it will also send the entire /opt/monitor/etc directory
 369 to all its peers, using rsync --delete to make sure all systems
 370 are fully synced. Currently though, only changes to the object
 371 config triggers a full sync, so perhaps there's room for
 372 improvement.
 373
 374 Config sync is configured either globally via an object_config
 375 compound in the daemon section of the config file, or via those
 376 same object_config compounds in each node if one wants to
 377 override how one system syncs to another. It could look something
 378 like this, for instance:
 379
 380   daemon {
 381     object_config {
 382       # the command to run when one or more peers or
 383       # pollershave older configuration than we do
 384       push = mon oconf push
 385
 386       # the command to run when one or more masters or
 387       # peers have a newer configuration than we do
 388       #pull = some random command
 389     }
 390   }
 391
 392   peer obi1 {
 393     object_config {
 394       # the command to run when obi1 has older config than we do
 395           # overrides the global command
 396       push = rsync -aovtr --delete /opt/monitor/etc obi1:/opt/monitor
 397       # command to run when obi1 has newer config than we do
 398       #pull = some random command
 399     }
 400   }
 401
 402 Caveats:
 403 * The 'pull' thing is highly untested and I'm unsure how it would
 404   work if one node tries to pull from another while that other node
 405   is pushing at the same time. Care should be take to avoid such
 406   setups.
 407 * The only supported scenario is to have the master with the most
 408   recently changed configuration push that config to its peers and
 409   pollers. This *will* create avalanche pushing if one uses peered
 410   pollers that in turn have pollers themselves, since all peered
 411   pollers that in turn have pollers will try to push at the same
 412   time. Due to this, more than 2 tiers is currently not supported
 413   officially, although it works just fine for everything else in our
 414   lab environment.
 415 * Config pushing from master to poller requires the objects.cache
 416   file in order to split config for each poller. Since config pushes
 417   should always be initiated by a running Merlin anyways, this isn't
 418   much of a problem once you've done the first push and everything
 419   is up and running, but when first setting up the system it will
 420   be tricksy to get things to run smoothly.
 421
 422 The object_config compounds can contain whatever variables you
 423 like without Merlin complaining about them, and
 424
 425   mon node show
 426
 427 will show them as
 428   OCONF_PUSH=mon oconf push
 429   OCONF_WHATEVER_YOU_NAMED_YOUR_VARIABLE=somevalue
 430
 431 so you can quite easily add some other scripted solution to support
 432 your needs. 'mon oconf push' happens to use two such private vars,
 433 namely 'source' and 'dest'. 'source' is really only used when pushing
 434 configuration to peers, and 'dest' is what we end up using as target
 435 when pushing the configuration. So if you want your peer sync script
 436 to only send /opt/monitor/etc/oconf that we created before, you can
 437 quite easily set that up by configuring your peer thus:
 438
 439 peer obi1 {
 440   address = obi1
 441   port = 15551
 442   type = peer
 443
 444   object_config {
 445     push = mon oconf push
 446     source = /opt/monitor/etc/oconf
 447     dest = /opt/monitor/etc
 448   }
 449
 450
 451 The 'oconf push' command uses another command internally:
 452   mon oconf nodesplit
 453
 454 This you can run without interfering with anything. In our case, it
 455 would print something like this:
 456
 457   Created /var/cache/merlin/config/luke with 1154 objects for hostgroup
 458   list 'theforce,tattoine'
 459   Created /var/cache/merlin/config/leya with 1154 objects for hostgroup
 460   list 'theforce,tattoine'
 461   Created /var/cache/merlin/config/solo with 652 objects for hostgroup
 462   list 'hyperdrive'.
 463
 464 You can inspect the files thus created and see if they seem to fit your
 465 criteria. Note that they will be rather large, since templates aren't
 466 sent to the poller nodes.
 467
 468 Step 7 - Starting the distributed system
 469 ----------------------------------------
 470 Once you've inspected the configuration and you like what you see,
 471 it's time to activate it and get some monitoring going on. Run the
 472 following sequence of commands when you're ready:
 473
 474   mon restart; sleep 3; mon oconf push
 475
 476 This should send configuration to all the pollers and peers and then
 477 attempt to restart monitor and merlin on those nodes. Pushing config
 478 to masters is not yet supported, although scripting it wouldn't be
 479 too hard for those who are interested. Do see the notes about 'pull'
 480 above though.
 481
 482
 483 Step 8 - Verifying that it works
 484 --------------------------------
 485 The first thing to do is to run:
 486
 487   mon node status
 488
 489 It will quite quickly become apparent that this little helper is
 490 awesome for finding problems in your merlin setup. It connects to
 491 the database, grabs the currently running nodes and prints a lot
 492 of information about them, such as if they're active, when they
 493 last sent a program_status_data event, how many checks they're
 494 doing and what their check latency is. If, for some reason,
 495 one node has crashed or is otherwise unable to communicate with
 496 the node you're looking from, you'll find this out quite quickly
 497 using this little helper.
 498
 499 Filing a bugreport without including output from a run of this
 500 program on all nodes is a hanging offense. You have been warned.
 501
 502
 503 Step 9 - Finding out why it doesn't
 504 -----------------------------------
 505 This is sort of general guidelines to troubleshooting certain issues
 506 in Merlin. It involves digging through logfiles, running small helper
 507 programs and generally just tinkering around trying to figure out
 508 what happened, what's happening and what will happen when I do this?
 509 Most of it is stuff that has come up during beta-testing or that have
 510 been problematic in the past. If new common problems arise, I'll add
 511 more recipes to this little guide.
 512
 513
 514
 515 Problem: Loadbalancing seems to have stopped working even though
 516          all my peers are ACTIVE and were last seen alive "3s ago"
 517 Answer:
 518 It can sometimes look like that if you check the output of
 519
 520   mon node status
 521
 522 after having restarted Merlin or Monitor. Most of the time, it's
 523 because the peers switched peer-id's and are now slowly taking
 524 over each others checks. If the problem isn't resolving itself
 525 so that the number of checks performed by each peer is converging
 526 on an equal split, something else has gone wrong and a more thorough
 527 investigation is necessary.
 528 Checking if all peers have the same configuration is the first step:
 529
 530   mon node ctrl --type=peer -- mon oconf hash
 531   mon node ctrl --type=peer -- sha1sum /opt/monitor/var/objects.cache
 532
 533 If they do, you might have run into the intermittent error that
 534 some users have seen with peers in a loadbalanced setup. Restarting
 535 the affected systems usually restores them to good working order:
 536
 537   mon node ctrl --type=peer -- mon restart
 538
 539
 540 Problem: Logfiles are flooded with messages about 'nulling OOB pointer'
 541 Answer:
 542 Merlin uses a highly efficient binary protocol to transfer events
 543 between module and daemon and across the network to other nodes.
 544 The way the codec works makes it not-really-but-almost impossible
 545 to support network nodes with different wordsize and byte order.
 546 That is, 32bit and 64bit systems can't talk to each other, and
 547 servers running i386-type cpu's can't communicate with PowerPC or
 548 other big-endian machines. PDP11 won't work with anything but other
 549 PDP11's. They'll do just fine with each other though.
 550
 551 Since merlin-0.9.0-beta5, merlin detects when a node with different
 552 wordsize, byte order and object structure version connects and warns
 553 about such incompatibilities. grep the logfiles for 'FATAL.*compat'
 554 and you should see if that's the problem.
 555
 556 If it isn't and it's the module logfile which holds all the messages
 557 you've almost certainly hit a compatibility problem with Nagios, or
 558 a concurrency issue related to threading. There shouldn't really be
 559 any compatibility problems, since Merlin will unload itself if the
 560 version of Nagios that loads it has a different object structure
 561 version than we're expecting of it, but I suppose weirder things
 562 have happened than a random malfunction in a piece of software.
 563
 564
 565 Problem: Database isn't being updated
 566 Answer:
 567 Inserting events into the database is the job of the daemon.
 568 Information about its problems can be found in the daemon
 569 logfile, /opt/monitor/op5/merlin/daemon.log. If no "query failed"
 570 messages can be found there, check the neb.log file to see if
 571 it's sending anything, and look for disconnected peers and pollers
 572 with:
 573
 574   mon node status
 575
 576
 577 Problem: 'mon node status' shows one or more nodes as 'INACTIVE'
 578 Answer:
 579 Check that merlin and monitor is running on the remote systems.
 580
 581   mon node ctrl $node -- pidof monitor
 582   mon node ctrl $node -- pidof merlind
 583   mon node ctrl $node -- mon node status
 584
 585 If they're not, you've found the symptom, so check the logfiles
 586 or look for corefiles on those systems.
 587 If they are, try
 588
 589   grep -i connect /opt/monitor/op5/merlin/logs/daemon.log
 590
 591 If you see a lot of connection attempts to the INACTIVE node and
 592 no "connection refused", it's almost certainly a firewall issue.
 593 If you see the "connection refused" thing, it's almost certainly
 594 due to either merlind not running or misconfiguration.
 595
 596
 597 Problem: I've found a corefile
 598 Answer:
 599 Goodie. Now do something useful with it, pretty, pretty please.
 600
 601   file core
 602
 603 will tell you which command was run to create it, so then you can
 604 run:
 605
 606   # gdb -q /path/to/$offending_program core
 607   gdb> bt
 608
 609 If the corefile came from monitor, you'll need to run:
 610
 611   mondebug core
 612
 613 instead, since it will otherwise include a lot of "unresolved symbols"
 614 in the backtrace, which basically means that it's completely useless
 615 and I would still have to re-do it. At least if the core was caused
 616 by a module, which is something we need to know.
 617
 618 Send me the output of both the "file" command and the backtrace
 619 that gdb prints, along with the corefile. This is basically everything
 620 I do when I get a corefile anyways, but for me to be able to do it if
 621 you send me the corefile means I'll have to install the exact same
 622 version of merlin, monitor and possibly a lot of system libraries too.
 623 That's extremely cumbersome, so go the extra halfinch and grab a
 624 backtrace while you're at it. Bugs will get fixed a billion times
 625 faster if you do.
 626
 627 If the trace looks like this:
 628 #0  0x0022c402 in ?? ()
 629 #1  0x0062e116 in ?? ()
 630 #2  0x0806f4fb in ?? ()
 631 #3  0x080566cc in ?? ()
 632
 633 That means that either the stack has been overwritten (a bug can cause
 634 this, and it's bad), or that there are only unresolved symbols in there.
 635 Either way, it's fairly useless in that state, but I'll still want it
 636 since any clue is better than no clue.
 637
 638
 639
 640 Problem: 'mon node status' claims one node hasn't been alive for a
 641          very long time
 642 Answer:
 643 There should be a timestamp stating when it was last active. grep
 644 for that timestamp in Merlin's logfiles and nagios.log. Start with
 645 daemon.log on the system where you ran 'mon node status' and look
 646 for disconnect messages. You'll have to check the logs on both the
 647 systems to find the most likely cause.
 648
 649 To look through nagios.log you can use:
 650
 651   mon log show --start=$when-20 --end=$when+20
 652
 653 although you'll have to calculate the start and end things manually,
 654 since that command right there isn't valid shell or anything.
 655
 656   mon log show --help
 657
 658 will provide more filtering options, since grep'ing can be tricky
 659 without knowing what to look for.
 660
 661
 662 Problem: Reports show wrong uptime/downtime/whatever
 663 Answer:
 664 In 99% of all cases this is due to missing data in the report_data
 665 table. It now resides in the merlin database as opposed to the
 666 monitor_reports database, where it used to be.
 667
 668 If you can find a particular period in the logs that happens to be
 669 broken it's not that hard to repair it, although doing so will take
 670 time and you have to shut down Monitor in order to pull it off.
 671 If the database is anything but huge, it's definitely easiest to
 672 just truncate the report_data table and recreate it from scratch.
 673 The following sequence of commands *should* take care of doing
 674 that, but it's been a while since I wrote them and I haven't got
 675 anything to test with.
 676
 677   mon stop
 678   mon node ctrl -- mon stop
 679   mon log import --fetch
 680
 681 The 'log' category of commands is useful for importing data from
 682 remote sites though. Snoop around a bit and see what you can find.
 683 'sortmerge' will thrash the disks quite a lot and use a ton of
 684 memory, but 'import' is the only one which can be potentially
 685 dangerous. Use the --no-sql option for a dry-run first if you're
 686 nervous.
 687
 688 If the report_data table *is* huge and shutting down monitor
 689 while repairs are under way is not an option, there may be other
 690 solutions to try but they are all situational and more or less
 691 voodooish.
 692
 693
 694 Problem: Merlin's config sync destroyed my configuration!
 695 Answer:
 696 If it happened on a poller system, that's by design. Nothing to
 697 do and nothing to try. Just re-do your work and make sure you
 698 don't do it in a file that Merlin will overwrite occasionally.
 699 If it was on a peer system, it might be possible to save it.
 700 Sneak a peak in /var/cache/merlin/backup and see if you can
 701 find your config files there.
 702
 703
 704 Problem: X happened and it's not a listed problem here
 705 Answer:
 706 Perhaps it's by design, and then again it might not be. If you
 707 think it's wrong, check the logfiles (all three of them) on
 708 all systems involved and look for anomalies. When that's done
 709 and you still haven't found anything, remain calm and write a
 710 concise report stating what you did, what you expected should
 711 happen and what happened. Feel free to include logfiles and
 712 stuff as well, since I'll almost certainly want to look at
 713 them myself.
 714
 715
 716 Problem: My peered pollers are acting up!
 717 Answer:
 718 It's possible that the pollers are trying to push their config
 719 to the master-server, but with the default sync command, it
 720 will attempt to push configuration to all nodes at the same
 721 time. This can cause one peered poller to try to push to the
 722 other poller, which resets that poller and causes it to try
 723 to push its configuration to the master, but since it pushes
 724 globally and by default only to pollers and peers, it will
 725 cause config to be pushed to the other node, which is then
 726 restarted, etc, etc, etc.
 727 To fix it, it's usually enough to add an empty object_config
 728 section to all your master nodes, like so:
 729
 730         master yoda {
 731                 address = yoda
 732                 port = 15551
 733                         object_config {
 734                         }
 735         }
 736         master obi1 {
 737                 address = obi1
 738                 port = 15551
 739                         object_config {
 740                         }
 741         }
 742
 743 This removes the config sync command from the master nodes,
 744 so Merlin won't try to push configuration around.
 745
 746
 747 Problem: My extra files/plugins/whatever aren't being synced!
 748 Answer:
 749 In order to sync files outside of the object configuration one can add
 750 an extra compound to the node-configuration of the node one wants to
 751 sync paths to. The config should look something like this:
 752
 753 poller solo {
 754         hostgroups = hyperdrive
 755         address = solo
 756         port = 15551
 757         sync {
 758                 /path/to/file/to/sync = yes
 759                 /path/to/file2/to/sync = /path/on/remote/system
 760         }
 761 }
 762
 763 This will cause /path/to/file/to/sync to be sent to the same path on
 764 the remote system, and the file /path/to/file2/to/sync to be sent to
 765 /path/on/remote/system on the remote system. It should be possible to
 766 list directories and not only files, but that is untested.
 767
 768 Caveat 1: This is not well tested, but what sparse tests I've done
 769 worked out well.
 770 Caveat 2: Peers normally send all of /opt/monitor/etc to each other,
 771 so for those no extra configuration should be necessary in a normal
 772 setup.
 773 Caveat 3: Only the sha1 checksum of the object config, coupled with
 774 the timestamp of the same, is used to determine which file to sync
 775 where. There's no checking done to make sure a newer version of the
 776 file on the other end is being overwritten.
 777 Caveat 4: This will run as the same user as the merlin daemon. I
 778 have no idea if file ownership and permissions will be preserved.
 779
 780 If caveat 3 bites your ass, check in /var/cache/merlin/backups (or
 781 /var/cache/merlin/backup) for the original files.
 782
 783 This is sort-of supported as of merlin-1.1.7, but see caveat 1.
 784
 785
 786 Problem: After a restart, Ninja hangs/is empty for a loong time!
 787 Answer:
 788 If you have a large system, you should be using ocimp instead of
 789 import.php to run the initial import of your configuration. It's
 790 well more than 10 times as fast and will cause the waiting period
 791 to be that much shorter. This will also most likely help if you're
 792 seeing an empty UI from time to time. It makes the largest difference
 793 in large environments, ofcourse, but even smaller ones should benefit
 794 from it. ocimp was considered "stable beta" as of merlin-1.1.8-beta2.
 795
 796
 797 Problem: One of my nodes can't connect to the others!
 798 Answer:
 799 On the node that can't connect to other nodes you need to add a
 800 node-option to prevent it from attempting to connect to the nodes
 801 it can't connect to. This is useful if you know there is a firewall
 802 that you won't be allowing new connections through in one direction,
 803 and especially if you're using a tripwire-rigged firewall. An
 804 example config would look like this:
 805
 806   master behind-firewall-we-cant-connect-through {
 807     address = master.behind.firewall.example.com
 808     connect = no
 809   }
 810
 811 This will cause the merlin daemon on this node to never attempt to
 812 connect to master.behind.firewall.example.com. Normally, both nodes
 813 attempt to connect at startup and every so often so long as the
 814 connection is down.
 815 This feature was introduced in Merlin 1.1.0
 816
 817
 818 Problem: My poller is monitoring a network behind a firewall which
 819          the master can't see through!
 820 Answer:
 821 There's a node-option you can use on the master-node which only
 822 works when configuring poller nodes. Here's what it would look
 823 like
 824
 825   poller watching-network-behind-firewall-where-master-cant-see {
 826     address = admin-net-poller.example.com
 827     port = 15551
 828         takeover = no
 829   }
 830
 831 This feature was introduced in Merlin 0.9.1.
 832
 833
 834 Problem: A peer has crashed hard in my network, but the other peer
 835          isn't taking over the checks!
 836 Answer:
 837 If you have no pollers in your merlin network and you're running
 838 merlin 1.1.14-p5 or earlier, you've most likely been hit by a
 839 fairly ancient bug in the Merlin daemon, where it only checked
 840 node timeout in case there were pollers attached to the network.
 841 This was fixed in 7f2fc8f9850735ab549d3bbfa987b09af4cc1a6d, which
 842 is part of all releases post 1.1.14 (ie, 1.1.15-beta1 and up).
 843
 844
 845 Problem: Nodes keep timing out even though I know they're still
 846          connected. They just send very rarely and slowly!
 847 Answer:
 848 As of 1.1.14-p8 (ae67657bc1bd84beeac70759df5d87a2aac30fb2), you
 849 can set the data_timeout option for your nodes and have the
 850 merlin daemon grok it to mean how often a particular node must
 851 send data to still be considered alive and well. The default
 852 value here is pulse_interval * 2, so 30 seconds in total. You
 853 can increase it a lot, or set it to 0 to avoid any checking at
 854 all, but eventually the system tcp timeout will kick in and
 855 kill the connection anyway.
 856
 857 Problem: I want feature X!
 858 Answer:
 859 I want icecream.
 860
 861
 862 Problem: Make feature Y work like this!
 863 Answer:
 864 Talk to Peter.