3 This HOWTO describes how to set up a loadbalanced, redundant and
4 distributed network monitoring system using op5 Monitor. Note that
5 Merlin is a part of op5 Monitor. Non-customers will have to adjust
6 paths etc used throughout this guide in order to be able to use it.
7 Replacing "op5 Monitor" with "nagios + merlin" will be a good start
8 for those of you venturing into the unknown without the aid of our
9 rather excellent support services. For those wishing to configure
10 only distributed network monitoring, or only loadbalanced or
11 redundant, this is still a good guide.
13 The guide will assume that we're installing two redundant and
14 loadbalanced master-servers ("yoda" and "obi1"), with three
15 poller servers, two of which are peered with each other. The single
16 poller will be designated "solo". The peered pollers will be "luke"
17 and "leya". We'll assume that each poller has their names in the
18 DNS and can be looked up that way. 'solo' will be responsible for
19 monitoring the hostgroup 'hyperdrive'. 'luke' and 'leya' will share
20 responsibility in monitoring the hostgroups 'theforce' and 'tattoine'.
22 With this setup, communications will go like this:
33 yoda, luke, leya, solo
36 obi1, luke, leya, solo
41 The following needs to be in place for this HOWTO to be usable, but
42 how to obtain or set them up is outside the scope of this article.
44 * Make sure you have the passwords for the root accounts on all the
45 servers intended to be part of the monitoring network. This will
46 be needed in order to configure merlin.
47 * Open the firewalls for port 15551 (merlin's default port) and 22.
48 Both ends will attempt to connect on port 15551, so it's ok if only
49 one side of the intended connection can connect to the other. For
50 port 22, it's a little bit more complicated and in order to get the
51 full shebang of features both ends will need to be able to initiate
52 connections with the other. It's possible to get away with not
53 allowing pollers to initiate connections to the master server, but
54 certain recovery operations will then not be possible.
55 * op5 Monitor needs to be installed on all systems intended to be part
61 Included in op5 Monitor is the 'mon' helper. mon is a nifty little
62 tool designed to help with configuring, managing and verifying the
63 status of a distributed Merlin installation.
65 It's usage is quite simple: mon <category> <command>
66 Just type mon and you'll get a list of all available categories and
67 commands. Some commands lack a category and are runnable all by
68 themselves, such as 'stop', 'start' and 'restart', which take care
69 of stopping, starting and restarting monitor and merlin using the
70 proper shutdown and startup sequences.
72 We'll soon see exactly how useful that little helper is.
75 Step 1 - Configure Merlin on one of the master systems
76 ------------------------------------------------------
77 The aforementioned 'mon' helper has a 'node' category. This is useful
78 for manipulating configured nodes in merlin's configuration file,
80 We'll start with configuring Merlin properly on 'yoda'. The commands
81 to do so will look like this (yes, there's a typo there):
83 mon node add obi1 type=peer
84 mon node add luke type=poller hostgroup=theforce,tattoine
85 # type=poller is default, so we don't have to spell it out
86 mon node add leya hostgroup=theforce,tattoine
87 mon node add solo hostgroup=hyperride
89 The 'node' category also has a 'remove' command, so when we've noticed
90 the typo we made when adding the poller 'solo', we can fix that by
91 removing the faulty node and re-adding it again:
94 mon node add solo type=poller hostgroup=hyperdrive
96 You may verify that you've done things right in a couple of different
97 ways. 'mon node list' lists nodes. It accepts --type= argument, so
98 if we want to list all pollers and peers, we can run:
100 mon node list --type=poller,peer
102 This in conjunction with 'mon node show <name>' is excellent to use
105 The contents of the configuration file, which by default resides in
106 /opt/monitor/op5/merlin/merlin.conf, should now include something
116 hostgroup = theforce,tattoine
121 hostgroup = theforce,tattoine
126 hostgroup = hyperdrive
131 Step 2 - Distribute ssh-keys
132 ----------------------------
133 The 'mon' helper has an sshkey category. The commands in it let you
134 push and fetch sshkeys from remote destinations.
136 mon sshkey push --all
138 will append your ~/.ssh/id_rsa.pub file to the authorized_keys file
139 for the root and monitor user on all configured remote nodes.
140 If you don't have a public keyfile, one will be generated for you.
141 Please note that if you generate a keyfile yourself, it must not have
142 a password, or configuration synchronization will fail to work.
144 So far we've set up one-way communication. 'yoda' can now talk to
145 all the other systems without having to use passwords. In order to
146 fetch all the keys from the remote systems, we'll use the following
149 mon sshkey fetch --all
151 This will fetch all the relevant keys into ~/.ssh/authorized_keys.
152 So every node can talk to 'yoda', and 'yoda' can talk to every
153 other node. That's great. But 'luke' and 'leya' need to be able
154 to talk to each other as well, and all the pollers need to be
155 able to talk to 'obi1'. Since we have all keys except our own in
156 ~/.ssh/authorized_keys, we can simply amend it with the key we
157 generated earlier and distribute the resulting file to every node.
158 Or we can do what we just did for 'yoda' on all the other nodes,
159 and simply wait until we're done configuring merlin on all nodes
160 and then log in to them and run the 'mon sshkey push --all' and
161 'mon sshkey fetch --all' commands there too.
163 You can verify that this works by running:
164 mon node ctrl -- 'echo hostname is $(hostname)'
167 Step 3 - Configure Merlin on the remote systems
168 -----------------------------------------------
169 I sneakily introduced the 'node ctrl' command in the last section.
170 This time we'll use it rather heavily, along with the 'node add'
171 command which will run on the remote systems.
173 First we add ourself and 'obi1' as master to all pollers:
174 mon node ctrl --type=poller -- mon node add yoda type=master
175 mon node ctrl --type=poller -- mon node add obi1 type=master
177 Now the poller 'solo' is actually done and configured already.
179 Then we add ourself as a peer to all our peers (just 'obi1' really,
180 but in case you build larger networks, this will work better):
181 mon node ctrl --type=peer -- mon node add yoda type=peer
183 Then we add all pollers to 'obi1':
184 mon node ctrl obi1 -- mon node add luke hostgroup=theforce,tattoine
185 mon node ctrl obi1 -- mon node add leya hostgroup=theforce,tattoine
186 mon node ctrl obi1 -- mon node add solo hostgroup=hyperdrive
188 And finally we add luke and leya as peers to each other:
189 mon node ctrl leya -- mon node add luke type=peer
190 mon node ctrl luke -- mon node add leya type=peer
192 solo will have the following config:
204 luke will have this in its config file:
245 hostgroup = theforce,tattoine
250 hostgroup = theforce,tattoine
255 hostgroup = hyperdrive
260 Step 4 - Verifying configuration and ssh-key setup
261 --------------------------------------------------
262 Now that we have merlin configured properly on all our five nodes,
263 we can use a recursive version of the 'node ctrl' command to make
264 sure ssh work properly from all systems to all systems they need
265 to talk to. Try pasting this into the console:
268 'echo On $(hostname); hostname | mon node ctrl -- \
269 \'from=$(cat); echo "@$(hostname) from $from\'' \
272 Hard to follow? I agree, but it should produce something like this:
286 If it does, that means ssh keys are properly installed, at least for
287 the root user(s). If the command seems to hang somewhere in the middle,
288 try rerunning it without the ending 'grep' statement. If that ends up
289 with a password prompt appearing, you'll need to revisit the sshkey
290 configuration and again make sure every node that should be able to
291 talk to other nodes can talk to the nodes it should be able to talk
294 (XXX; Test this and make sure it actually works like this)
296 Step 5 - Configuring Nagios
297 ---------------------------
298 Handling object configuration is very much out of scope for this
299 article, but there are a few rules (most of them are actually more
300 like guidelines, but things will be confusing if you don't follow
301 them, so please do) one needs to adhere to in order for Merlin to
304 * Each host that is a member of a hostgroup used to distribute work
305 from a master to a poller should never be member of a hostgroup
306 that is used to distribute work to a different poller. In our
307 case, that means that any host that is a member of either 'theforce'
308 or 'tattoine' shouldn't also be a member of 'hyperdrive'.
310 * Two peers absolutely must have identical object configuration.
311 This is due to the way loadbalancing works in Merin. In our case
312 that means that since 'luke' is responsible for 'theforce' and
313 'tattoine', its peer 'leia' must also be responsible for exactly
314 those two hostgroups, and no other.
316 That's basically it. It's possible to circumvent these rules, but
317 if you do, you're on your own. No tools currently exist to enforce
318 them, and Merlin won't complain if you suddenly add another poller
319 that's responsible for 'tattoine' and 'hyperdrive', even though
320 such a configuration is obviously retarded in light of the above
324 Step 6 - Synchronization configuration
325 --------------------------------------
326 Configuration synchronization will be a bit easier for you if you
327 move all of monitor's object configuration files to a cfg_dir
328 instead of using the default layout of mixing object configuration
329 with Nagios' main configuration file and other various stuff. This
330 is especially true for pollers and doesn't matter nearly as much
333 The quick and easy way to set it up so that it works like that is
334 by running the following commands from 'yoda':
336 dir=/opt/monitor/etc/oconf
337 conf=/opt/monitor/etc/nagios.cfg
338 mon node ctrl -- sed -i /^cfg_file=/d $conf
339 mon node ctrl -- sed -i /^log_file=/acfg_dir=$dir $conf
340 mon node ctrl -- mkdir -m 775 $dir
341 mon node ctrl -- chown monitor:apache $dir
344 mon node ctrl -- mon oconf hash
346 you should get a list of 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
347 as output from all nodes. That means the pollers now have an empty
348 object configuration, which is just the way we like it since we'll
349 be pushing configuration from one of our two peered masters to all
352 ("da 39 hash" is what you get from sha1 when you don't feed it any
355 In merlin, you can configure a script to run that takes care of
356 syncing configuration. This script should also restart monitor on
357 the receiving ends when it's done sending configuration. In the
358 Merlin world, this is handled by a single command that gets run
359 once when we detect that we have a newer configuration than any
360 of our peers or pollers.
364 takes no arguments at all. It does parse merlin.conf though and
365 creates complete configuration files for all the pollers, which
366 by default gets sent to /opt/monitor/etc/oconf/from-master.cfg
367 on each respective poller, which is then restarted. Again by
368 default, it will also send the entire /opt/monitor/etc directory
369 to all its peers, using rsync --delete to make sure all systems
370 are fully synced. Currently though, only changes to the object
371 config triggers a full sync, so perhaps there's room for
374 Config sync is configured either globally via an object_config
375 compound in the daemon section of the config file, or via those
376 same object_config compounds in each node if one wants to
377 override how one system syncs to another. It could look something
378 like this, for instance:
382 # the command to run when one or more peers or
383 # pollershave older configuration than we do
384 push = mon oconf push
386 # the command to run when one or more masters or
387 # peers have a newer configuration than we do
388 #pull = some random command
394 # the command to run when obi1 has older config than we do
395 # overrides the global command
396 push = rsync -aovtr --delete /opt/monitor/etc obi1:/opt/monitor
397 # command to run when obi1 has newer config than we do
398 #pull = some random command
403 * The 'pull' thing is highly untested and I'm unsure how it would
404 work if one node tries to pull from another while that other node
405 is pushing at the same time. Care should be take to avoid such
407 * The only supported scenario is to have the master with the most
408 recently changed configuration push that config to its peers and
409 pollers. This *will* create avalanche pushing if one uses peered
410 pollers that in turn have pollers themselves, since all peered
411 pollers that in turn have pollers will try to push at the same
412 time. Due to this, more than 2 tiers is currently not supported
413 officially, although it works just fine for everything else in our
415 * Config pushing from master to poller requires the objects.cache
416 file in order to split config for each poller. Since config pushes
417 should always be initiated by a running Merlin anyways, this isn't
418 much of a problem once you've done the first push and everything
419 is up and running, but when first setting up the system it will
420 be tricksy to get things to run smoothly.
422 The object_config compounds can contain whatever variables you
423 like without Merlin complaining about them, and
428 OCONF_PUSH=mon oconf push
429 OCONF_WHATEVER_YOU_NAMED_YOUR_VARIABLE=somevalue
431 so you can quite easily add some other scripted solution to support
432 your needs. 'mon oconf push' happens to use two such private vars,
433 namely 'source' and 'dest'. 'source' is really only used when pushing
434 configuration to peers, and 'dest' is what we end up using as target
435 when pushing the configuration. So if you want your peer sync script
436 to only send /opt/monitor/etc/oconf that we created before, you can
437 quite easily set that up by configuring your peer thus:
445 push = mon oconf push
446 source = /opt/monitor/etc/oconf
447 dest = /opt/monitor/etc
451 The 'oconf push' command uses another command internally:
454 This you can run without interfering with anything. In our case, it
455 would print something like this:
457 Created /var/cache/merlin/config/luke with 1154 objects for hostgroup
458 list 'theforce,tattoine'
459 Created /var/cache/merlin/config/leya with 1154 objects for hostgroup
460 list 'theforce,tattoine'
461 Created /var/cache/merlin/config/solo with 652 objects for hostgroup
464 You can inspect the files thus created and see if they seem to fit your
465 criteria. Note that they will be rather large, since templates aren't
466 sent to the poller nodes.
468 Step 7 - Starting the distributed system
469 ----------------------------------------
470 Once you've inspected the configuration and you like what you see,
471 it's time to activate it and get some monitoring going on. Run the
472 following sequence of commands when you're ready:
474 mon restart; sleep 3; mon oconf push
476 This should send configuration to all the pollers and peers and then
477 attempt to restart monitor and merlin on those nodes. Pushing config
478 to masters is not yet supported, although scripting it wouldn't be
479 too hard for those who are interested. Do see the notes about 'pull'
483 Step 8 - Verifying that it works
484 --------------------------------
485 The first thing to do is to run:
489 It will quite quickly become apparent that this little helper is
490 awesome for finding problems in your merlin setup. It connects to
491 the database, grabs the currently running nodes and prints a lot
492 of information about them, such as if they're active, when they
493 last sent a program_status_data event, how many checks they're
494 doing and what their check latency is. If, for some reason,
495 one node has crashed or is otherwise unable to communicate with
496 the node you're looking from, you'll find this out quite quickly
497 using this little helper.
499 Filing a bugreport without including output from a run of this
500 program on all nodes is a hanging offense. You have been warned.
503 Step 9 - Finding out why it doesn't
504 -----------------------------------
505 This is sort of general guidelines to troubleshooting certain issues
506 in Merlin. It involves digging through logfiles, running small helper
507 programs and generally just tinkering around trying to figure out
508 what happened, what's happening and what will happen when I do this?
509 Most of it is stuff that has come up during beta-testing or that have
510 been problematic in the past. If new common problems arise, I'll add
511 more recipes to this little guide.
515 Problem: Loadbalancing seems to have stopped working even though
516 all my peers are ACTIVE and were last seen alive "3s ago"
518 It can sometimes look like that if you check the output of
522 after having restarted Merlin or Monitor. Most of the time, it's
523 because the peers switched peer-id's and are now slowly taking
524 over each others checks. If the problem isn't resolving itself
525 so that the number of checks performed by each peer is converging
526 on an equal split, something else has gone wrong and a more thorough
527 investigation is necessary.
528 Checking if all peers have the same configuration is the first step:
530 mon node ctrl --type=peer -- mon oconf hash
531 mon node ctrl --type=peer -- sha1sum /opt/monitor/var/objects.cache
533 If they do, you might have run into the intermittent error that
534 some users have seen with peers in a loadbalanced setup. Restarting
535 the affected systems usually restores them to good working order:
537 mon node ctrl --type=peer -- mon restart
540 Problem: Logfiles are flooded with messages about 'nulling OOB pointer'
542 Merlin uses a highly efficient binary protocol to transfer events
543 between module and daemon and across the network to other nodes.
544 The way the codec works makes it not-really-but-almost impossible
545 to support network nodes with different wordsize and byte order.
546 That is, 32bit and 64bit systems can't talk to each other, and
547 servers running i386-type cpu's can't communicate with PowerPC or
548 other big-endian machines. PDP11 won't work with anything but other
549 PDP11's. They'll do just fine with each other though.
551 Since merlin-0.9.0-beta5, merlin detects when a node with different
552 wordsize, byte order and object structure version connects and warns
553 about such incompatibilities. grep the logfiles for 'FATAL.*compat'
554 and you should see if that's the problem.
556 If it isn't and it's the module logfile which holds all the messages
557 you've almost certainly hit a compatibility problem with Nagios, or
558 a concurrency issue related to threading. There shouldn't really be
559 any compatibility problems, since Merlin will unload itself if the
560 version of Nagios that loads it has a different object structure
561 version than we're expecting of it, but I suppose weirder things
562 have happened than a random malfunction in a piece of software.
565 Problem: Database isn't being updated
567 Inserting events into the database is the job of the daemon.
568 Information about its problems can be found in the daemon
569 logfile, /opt/monitor/op5/merlin/daemon.log. If no "query failed"
570 messages can be found there, check the neb.log file to see if
571 it's sending anything, and look for disconnected peers and pollers
577 Problem: 'mon node status' shows one or more nodes as 'INACTIVE'
579 Check that merlin and monitor is running on the remote systems.
581 mon node ctrl $node -- pidof monitor
582 mon node ctrl $node -- pidof merlind
583 mon node ctrl $node -- mon node status
585 If they're not, you've found the symptom, so check the logfiles
586 or look for corefiles on those systems.
589 grep -i connect /opt/monitor/op5/merlin/logs/daemon.log
591 If you see a lot of connection attempts to the INACTIVE node and
592 no "connection refused", it's almost certainly a firewall issue.
593 If you see the "connection refused" thing, it's almost certainly
594 due to either merlind not running or misconfiguration.
597 Problem: I've found a corefile
599 Goodie. Now do something useful with it, pretty, pretty please.
603 will tell you which command was run to create it, so then you can
606 # gdb -q /path/to/$offending_program core
609 If the corefile came from monitor, you'll need to run:
613 instead, since it will otherwise include a lot of "unresolved symbols"
614 in the backtrace, which basically means that it's completely useless
615 and I would still have to re-do it. At least if the core was caused
616 by a module, which is something we need to know.
618 Send me the output of both the "file" command and the backtrace
619 that gdb prints, along with the corefile. This is basically everything
620 I do when I get a corefile anyways, but for me to be able to do it if
621 you send me the corefile means I'll have to install the exact same
622 version of merlin, monitor and possibly a lot of system libraries too.
623 That's extremely cumbersome, so go the extra halfinch and grab a
624 backtrace while you're at it. Bugs will get fixed a billion times
627 If the trace looks like this:
628 #0 0x0022c402 in ?? ()
629 #1 0x0062e116 in ?? ()
630 #2 0x0806f4fb in ?? ()
631 #3 0x080566cc in ?? ()
633 That means that either the stack has been overwritten (a bug can cause
634 this, and it's bad), or that there are only unresolved symbols in there.
635 Either way, it's fairly useless in that state, but I'll still want it
636 since any clue is better than no clue.
640 Problem: 'mon node status' claims one node hasn't been alive for a
643 There should be a timestamp stating when it was last active. grep
644 for that timestamp in Merlin's logfiles and nagios.log. Start with
645 daemon.log on the system where you ran 'mon node status' and look
646 for disconnect messages. You'll have to check the logs on both the
647 systems to find the most likely cause.
649 To look through nagios.log you can use:
651 mon log show --start=$when-20 --end=$when+20
653 although you'll have to calculate the start and end things manually,
654 since that command right there isn't valid shell or anything.
658 will provide more filtering options, since grep'ing can be tricky
659 without knowing what to look for.
662 Problem: Reports show wrong uptime/downtime/whatever
664 In 99% of all cases this is due to missing data in the report_data
665 table. It now resides in the merlin database as opposed to the
666 monitor_reports database, where it used to be.
668 If you can find a particular period in the logs that happens to be
669 broken it's not that hard to repair it, although doing so will take
670 time and you have to shut down Monitor in order to pull it off.
671 If the database is anything but huge, it's definitely easiest to
672 just truncate the report_data table and recreate it from scratch.
673 The following sequence of commands *should* take care of doing
674 that, but it's been a while since I wrote them and I haven't got
675 anything to test with.
678 mon node ctrl -- mon stop
679 mon log import --fetch
681 The 'log' category of commands is useful for importing data from
682 remote sites though. Snoop around a bit and see what you can find.
683 'sortmerge' will thrash the disks quite a lot and use a ton of
684 memory, but 'import' is the only one which can be potentially
685 dangerous. Use the --no-sql option for a dry-run first if you're
688 If the report_data table *is* huge and shutting down monitor
689 while repairs are under way is not an option, there may be other
690 solutions to try but they are all situational and more or less
694 Problem: Merlin's config sync destroyed my configuration!
696 If it happened on a poller system, that's by design. Nothing to
697 do and nothing to try. Just re-do your work and make sure you
698 don't do it in a file that Merlin will overwrite occasionally.
699 If it was on a peer system, it might be possible to save it.
700 Sneak a peak in /var/cache/merlin/backup and see if you can
701 find your config files there.
704 Problem: X happened and it's not a listed problem here
706 Perhaps it's by design, and then again it might not be. If you
707 think it's wrong, check the logfiles (all three of them) on
708 all systems involved and look for anomalies. When that's done
709 and you still haven't found anything, remain calm and write a
710 concise report stating what you did, what you expected should
711 happen and what happened. Feel free to include logfiles and
712 stuff as well, since I'll almost certainly want to look at
716 Problem: My peered pollers are acting up!
718 It's possible that the pollers are trying to push their config
719 to the master-server, but with the default sync command, it
720 will attempt to push configuration to all nodes at the same
721 time. This can cause one peered poller to try to push to the
722 other poller, which resets that poller and causes it to try
723 to push its configuration to the master, but since it pushes
724 globally and by default only to pollers and peers, it will
725 cause config to be pushed to the other node, which is then
726 restarted, etc, etc, etc.
727 To fix it, it's usually enough to add an empty object_config
728 section to all your master nodes, like so:
743 This removes the config sync command from the master nodes,
744 so Merlin won't try to push configuration around.
747 Problem: My extra files/plugins/whatever aren't being synced!
749 In order to sync files outside of the object configuration one can add
750 an extra compound to the node-configuration of the node one wants to
751 sync paths to. The config should look something like this:
754 hostgroups = hyperdrive
758 /path/to/file/to/sync = yes
759 /path/to/file2/to/sync = /path/on/remote/system
763 This will cause /path/to/file/to/sync to be sent to the same path on
764 the remote system, and the file /path/to/file2/to/sync to be sent to
765 /path/on/remote/system on the remote system. It should be possible to
766 list directories and not only files, but that is untested.
768 Caveat 1: This is not well tested, but what sparse tests I've done
770 Caveat 2: Peers normally send all of /opt/monitor/etc to each other,
771 so for those no extra configuration should be necessary in a normal
773 Caveat 3: Only the sha1 checksum of the object config, coupled with
774 the timestamp of the same, is used to determine which file to sync
775 where. There's no checking done to make sure a newer version of the
776 file on the other end is being overwritten.
777 Caveat 4: This will run as the same user as the merlin daemon. I
778 have no idea if file ownership and permissions will be preserved.
780 If caveat 3 bites your ass, check in /var/cache/merlin/backups (or
781 /var/cache/merlin/backup) for the original files.
783 This is sort-of supported as of merlin-1.1.7, but see caveat 1.
786 Problem: After a restart, Ninja hangs/is empty for a loong time!
788 If you have a large system, you should be using ocimp instead of
789 import.php to run the initial import of your configuration. It's
790 well more than 10 times as fast and will cause the waiting period
791 to be that much shorter. This will also most likely help if you're
792 seeing an empty UI from time to time. It makes the largest difference
793 in large environments, ofcourse, but even smaller ones should benefit
794 from it. ocimp was considered "stable beta" as of merlin-1.1.8-beta2.
797 Problem: One of my nodes can't connect to the others!
799 On the node that can't connect to other nodes you need to add a
800 node-option to prevent it from attempting to connect to the nodes
801 it can't connect to. This is useful if you know there is a firewall
802 that you won't be allowing new connections through in one direction,
803 and especially if you're using a tripwire-rigged firewall. An
804 example config would look like this:
806 master behind-firewall-we-cant-connect-through {
807 address = master.behind.firewall.example.com
811 This will cause the merlin daemon on this node to never attempt to
812 connect to master.behind.firewall.example.com. Normally, both nodes
813 attempt to connect at startup and every so often so long as the
815 This feature was introduced in Merlin 1.1.0
818 Problem: My poller is monitoring a network behind a firewall which
819 the master can't see through!
821 There's a node-option you can use on the master-node which only
822 works when configuring poller nodes. Here's what it would look
825 poller watching-network-behind-firewall-where-master-cant-see {
826 address = admin-net-poller.example.com
831 This feature was introduced in Merlin 0.9.1.
834 Problem: A peer has crashed hard in my network, but the other peer
835 isn't taking over the checks!
837 If you have no pollers in your merlin network and you're running
838 merlin 1.1.14-p5 or earlier, you've most likely been hit by a
839 fairly ancient bug in the Merlin daemon, where it only checked
840 node timeout in case there were pollers attached to the network.
841 This was fixed in 7f2fc8f9850735ab549d3bbfa987b09af4cc1a6d, which
842 is part of all releases post 1.1.14 (ie, 1.1.15-beta1 and up).
845 Problem: Nodes keep timing out even though I know they're still
846 connected. They just send very rarely and slowly!
848 As of 1.1.14-p8 (ae67657bc1bd84beeac70759df5d87a2aac30fb2), you
849 can set the data_timeout option for your nodes and have the
850 merlin daemon grok it to mean how often a particular node must
851 send data to still be considered alive and well. The default
852 value here is pulse_interval * 2, so 30 seconds in total. You
853 can increase it a lot, or set it to 0 to avoid any checking at
854 all, but eventually the system tcp timeout will kick in and
855 kill the connection anyway.
857 Problem: I want feature X!
862 Problem: Make feature Y work like this!