All modern (since 2007) server hardware uses a Non Uniform Memory Access(NUMA) architecture. This architecture makes access to some parts of RAM slower than others.
For this reason we strongly recommend running multiple Rhino nodes on multisocket hardware, and using NUMA binding. The optimum number of nodes should be determined by performance testing, but a good starting rule of thumb is one node per socket for CPU/Memory bound applications. for I/O bound applications, performance testing must be done to determine the optimum node count. In this case the optimum may exceed the number of processors.
Performance effects of NUMA
Internal performance testing results show that on a 2 socket machine, NUMA may be safely ignored but does offer quite small benefits to maximum and 99th percentile latencies.
For larger machines (4 socket and up) ignoring NUMA architecture was not possible. It is impossible to size a Rhino node such that it can exploit all sockets without crashing or exhibiting unacceptable latencies under load.
Linux Scheduling
Running multiple Rhino nodes on a multi-socket server should in theory be sufficient for production Rhino, as the default policy for all supported OSs is to attempt to keep threads from one process on the same CPU, and balance load equally amongst CPUs. Under low cluster load and during cluster startup this is not reliable, and may not remain stable over time with daily load cycles.
Using NUMA binding tools to restrict each Rhino node to a single CPU(using local memory) guarantees that the nodes will never migrate between CPUs, and is considered safer in a production environment where sudden performance changes are undesireable.