Load-balancing with "sticky" selection.

In computer networks, load-balancers, and sometimes in internal work schedulers a mechanism is needed to distribute load to one of multiple backend servers. Three good examples are:
  - A IP-based load balancer that, based on the user's IP-address, selects a server among a whole farm of servers.
  - A front-end to a cluster of application servers.
  - A RADIUS proxy with multiple backend servers

There are many ways to do this but most of them have problems that lead to increase hardware requirements, less flexibility, and occasional outages.
The most widely known selection mechanisms, that do not examine attributes of the work item, are:
  - First available backend.
    The first available backend is selected.
  - Round-robin.
    The distributor selects backend A, then B, then C, then A, then B, then C, ...
The major problem with those mechanisms is that the distribution does not exploit if multiple requests would benefit from being processed by the same backend. An example is when a web site serves content specific to a user, or uses user-specific settings, then it is normally most efficient to let the same backend handle all the user's requests because the backend will only have to look up the user in the database for the first request - subsequent requests can use cached lookups.

The most widely known selection mechanisms, that do examine attributes of the work item, are:
  - Static selection.
    Which users go to which servers are statically configured and does not take into account that one or more backends could be down. Eg. users 1-999 go to backend A, users 1000-1999 go to backend B, users 2000-2499 go to backend C, and users 2500-2999 to go backend D.
  - Dynamic selection.
    The distributor adjusts the backend assignment whenever a backend becomes available/unavailable, with no regard to the current assignment.

The problem with the static selection when a backend is down then users assigned to that backend cannot be served. The advantage is that the mechanism allows for unequal load capacity of the backends.
The problem with the dynamic selection is that when the backend server pool changes the assignmnets are essentially reshuffled and any cached lookups are lost, causing lower performance.

These problems can easily be solved, but I have seen many places where one of the problematic mechanisms above has been used for reasons unknown.

The mechanism/algorithms below has these primary goals:
  - Select a backend for each possible input value
    The mechanism must not be limited to a subset of the possible selection-attribute values. e.g. IP-address range.
  - Support non-fixed number of backends
    The mechanism must support a dynamic pool of backends where backends appear and disappear.
  - Support backends that does not have equal capacity.
    The mechanism must support that the backends does not have equal capacity and must distribute the load accordingly.
The mechanism has the additional goals:
  - Low memory usage
    The mechanism must not keep track of each possible or active selection-attribute backend assignment. Eg. it must not use 2^32 internal elements to support the full IPv4 address range.
  - High speed
    The mechanism should only use non-negligible processing power compared to a eg. system call or forwarding a UDP packet.
  - Avoid generating "spikes" in redistributions
    When the load capacity of a backend increase the mechanism must redistribute the load gradually so backend does not get a spike of new requests. When the load capacity of a backend decreases the mechanism must gradually move load to other backends so the other backends do not get a spike of new requests.
     

The important thing to do about the potentially large input key space (eg. IPv4 = 2^32) is to cut the address space into groups. And then keep track of which backend each group is assigned to. The mechanism also has to remember which backends exist and what their estimated load capacity currently is.

The primary operation of the mechanism is to map an input key to a backend ID. Given the data strucures above this operation will be easy and fast:
  1: calculcate the key group of the key. Eg. take the low 5 bit of the key
  2: lookup in the keygroup-to-backend assignment.
The mechanism also has these "housekeeping" operations:
  - Register a backend / make a backend appear.
    The backend will be registered with a specified load capacity but nothing else will be done. key group assignment is only done when the application decides that there is time to consider redistribution.
  - Deregister a backend / make a backend disappear.
    The key groups assigned to the backend must be distributed to the remaining backends - a key group must never point to a non-existing backend. The backend can then be deregistered.
  - Update the estimated load capacity of a backend.
    The backend registration is modified, and the mechanism notes that it may be worth considering redistributing the assignments.
  - Redistribute the load.
    If backends have been registered/deregistered or have had their capcity estimate changed since the load evenly distributed then it is worth considering if the load can be evened out. Otherwise this operation is a no-op. To avoid spikes only one keygroup is moved per operation. This makes it possible for the application to decide how gradual the load change should be done.

Load assignment versus actual load
The mechanism does not take into account how much load is actually generated by the selection because the mechanism does not know how expensive each the work is. This can lead to some regressions if an abnormally high percentage of a single key group generates requests that are disproportional computational expensive, leading to uneven actual load. The only way to combat this is identify which types of requests are expensive and use a separate distributor with finer granularity to spread to load across the backends.

Total capacity versus spare capacity
The assumption of the mechanism is that the load capacity estimates are absolute and not relative, meaning that they represent the potential capacity available for the work ignoring the load currently used by the work, e.g. idle CPU time is a bad estimate. In truth, the distributor is technically not a load balancing algorithm but instead a load sharing algorithm.

Performance of the example implementation:
The C++ implementation on a intel E6850 (3GHz) with GCC4 with -O3 achieves approximately 500.000.000 selectNode() operations per second. If the algorithm were to be used for distributing even minimum-size ethernet frames (ieee802.3: ~37 octets on-wire) then it could be used to saturate a 148Gbps link.