By Ivan Skytte Jørgensen
Copy of blog post I made for Privacore/Findx on 2017-05-05
I recently had to rework how the search engine instances keep track of each other, and ended up developing a stand-alone tool.
The search engine instances need to know which instances are alive and which are dead so they know which instances to ask and wait for sub-answers from and which to skip and find other instances for. The old mechanism had each instance periodically "ping" each other instance, exchanging information such as "yes, I’m not dead" as well as some auxiliary information such as repair mode, configuration checksums, etc.
That works reasonably well until you start having lots of instances. Eg. with 500 instances and a ping interval of 1 second you have each instance producing 499 requests and getting 499 responses back, as well as having to answer 499 incoming pings. It doesn’t scale that well. The old mechanism had a way to mitigate this by having a limit of how many in-flight ping requests it could have. But that slows down the propagation of information noticeably. The way to avoid the fully meshed communication between 500 instances is to use a hierarchy / tree, but changing the logic to that didn’t look feasible so I started looking for moving the whole logic out of the search engine and instead have a separate (3rd-party) component for that.
I did check if someone had made a lightweight heartbeat/keepalive thing but couldn’t find any that didn’t pull in unnecessary baggage. There were no need for some of the full-blown HA systems (MC ServiceGuard, HACMP, Sun cluster, …). I didn’t need the additional requirements of a dedicated heartbeat network segment, quorum disks, and monitoring-supervisor-hypervisor-metavisor-failover-thing. Some of the not-heavy-weight components were tied to specific problems, eg. keepalived is tied to VRRP and isn’t really suitable for application support. Others were meant for controlling a frontend (eg http load-balancer) and tell it which backends are working. What I really needed was a peer-oriented no-master keepalive-network and only needed weak guarantees and flexibility in its configuration and use.
I simply couldn’t find such a piece of software.
I chose to make the first version in python which may surprise some people, but handling 20-40 small packets per second is not a problem in Python. I also aimed for zero configuration. It should just work out of the box. It does that – as long as you only have one network. When you have more than one network you do need extra configuration either for the tool or in your routers to allow multicast or broadcast.
I went for a natural hierarchy: one monitor per host, and all instances on that host talk to the monitor on that host. The monitors are fully meshed. The full mesh of monitors won’t scale infinitely but it will scale to 50 hosts without problems.
I just finished making the search engine use the new heartbeat tool and the results so far are very promising. Much faster detection of alive instances, much quicker exchange of auxiliary information, and seems to be more reliable.
The only "gotcha" I have found so far is if you have two test environments on the same network the the monitor instances will discover each other and believe all instances are part of the same cluster; unless you configure it otherwise.
Of course, the heartbeat tool is open source: https://github.com/privacore/vagus