Monday, November 25, 2013

How to make an efficient Scalable Key Lookup engine of MySQL Cluster

MySQL Cluster have all the ingridients to be designed as a very scalable and extremely efficient key lookup engine for the Cloud. As we have shown in earlier entries of my blog we've been able to scale MySQL Cluster 7.2 to handle 72 million key lookups per second or 4.3 billion key lookups per minute. This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, in addition we have also enabled scaling to even bigger data nodes, so it should be possible now to reach even higher numbers.

The aim of this blog is however not to give any new benchmark results, rather it is providing details about how the benchmark program works and how this benchmark program architecture can be used to design an efficient scalable key lookup data store.

To obtain best possible performance we want to ensure that the data node can operate as efficiently as possible. This is done by ensuring that a connection to the data node sends many key lookups bundled together. Operating on individual key lookups is possible of course, but as usual it is more efficient to operate on bigger entities than one key lookup at a time. To provide this we use a concept we call Executor Thread. The Executor Thread will only execute key lookups aimed for a certain data node. So this means that the number of Executor Threads will be a multiple of the number of data nodes (there could be more than one thread per data node if necessary). The Executor Thread will receive key lookups from an internal queue handled by the application program (in our case the flexAsynch benchmark program). The key lookups are prepared by the Definer Threads. The Definer Thread will receive a key lookup aimed for any data node, it will take this key lookup and calculate the receiving data node for this key lookup (there is API calls in the NDB API to handle this). Based on this calculation the Definer Thread will put the key lookup in the queue of the proper Executor Thread.

The architecture before the Definer Thread is dependent on the application. In the figure provided here we have shown one possible architecture where we have one receive thread that receives a flow of messages from somewhere, to process those messages we need to interpret the packets and process them, this could entail one or more key lookups. In the figure we have assumed there is one key lookup per message and that the Executor Thread can format the packet back to the sender based on the information in the internal key lookup order.

So the important part of the architecture is the Executor Thread that handles messages to one data node based on an internal data structure that defines one key lookup and defines how to process the response (this thread should do as little work as possible to ensure it can focus on communication with the data node). There should also be a Definer Thread that prepares the key lookup request and puts the request in the queue of the proper Executor Thread. The Definer Thread could also do other things and there could be few or many Definer Threads in the architecture.

So how does flexAsynch work, in this case we don't have any input traffic, we generate the key lookups in the Definer Threads. The Definer Thread has a very simple operation. It starts by preparing a set of key lookups to any data node. For each of those key lookups it puts the request in the queue of the proper Executor Thread. After placing all requests in a queue, it starts waiting for all operations to complete. After all requests have received their response from the Executor Threads we simply continue with the next batch.

The operation of the Executor Thread is also very simple, it gets the current set of key lookups waiting in queue, it prepares those for execution of the NDB API. It sends off all the operations to the data node. When the data node have executed all of the operations it reports the result back to the Definer Threads and updates some benchmark statistics, then it continues with the next batch of key lookups.

So the operation of an efficient key lookup data store is not difficult at all. To make it scale one can then add up to 48 data nodes per cluster (each is capable of handling more than 5 million key lookups per second of around 100 byte in size). Each cluster can handle a total of 255 nodes in total. Obviously it is also straightforward to operate more than one cluster to scale even further.

The benchmark code exists in storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code exists in the NEW module here (it also contains a lot of legacy code for old variants of the flexAsynch benchmark).

Tuesday, November 19, 2013

MySQL Cluster run-time environment: Part 3: Configuration recommendations

Binding threads to CPUs in the MySQL Cluster data nodes can have great benefits. So what about hyperthreading, should we use all CPU threads, or only 1 CPU thread per CPU core? The recommendation differs actually. For the most part it is beneficial to use hyperthreading. In most thread types it gives about 40% higher performance with 2 CPUs using hyperthreading compared to 1 CPU not using hyperthreading. There are a few cases where it might be beneficial to not use hyperthreading though.

The first example is for LDM threads. Using hyperthreading means we will increase the number of partitions of a table by a factor of two. In many cases this isn't beneficial, particularly when the number of LDM threads is high. I tried with using 24 LDM threads with hyperthreading on 12 CPU cores and compared it to 12 LDM threads on 12 CPU cores. This case didn't benefit from using hyperthreading of LDM threads. However if the number of LDM threads is low it would probably still pay off, so going from 2 to 4 LDM threads is still beneficial, probably also going from 4 to 8. But going from 8 to 16 is less likely to be beneficial.

I have tested send and recv threads with and without hyperthreading, my benchmarks have always improved by using hyperthreading. It has always been better to use hyperthreading. The same conclusion I've seen with tc threads. Obviously if the main thread or the rep thread for some reason becomes the major bottleneck, then it makes sense to remove use of hyperthreading here.

Avoiding use of hyperthreading can be simply done by not configuring any threads to use the second CPU on each of the CPU cores we want to avoid hyperthreading on. As an example we will configure a machine with one data node, the machine have 4 sockets with 24 cores and 48 CPU threads. The CPUs 0-5 represent the CPU threads on socket 0 and core 0-5 and thread 0, CPUs 24-29 represents socket 0 and core 0-5 and thread 1. So if we want to configure with 12 LDM threads not using hyperthreading here we could use the config:


In this configuration the LDM threads will use CPUs 0-11 which covers all cores on socket 0 and 1. No other thread is configured to use any CPU thread in those CPU sockets. The OS might still decide to use some of them, but we have left a number of empty CPUs that the OS hopefully discovers as idle and uses those to schedule OS activities. We can actually do even better than this, there is a boot config variable in Linux whereby one can specify which CPUs the OS is allowed to use. Similarly there is a similar config variable for irqbalance to ensure interrupts are not scheduled on any CPU used by the MySQL Cluster data node. The tc, send and recv thread are scheduled on any the CPU threads on socket 2 and the main, rep, wd, and io threads are using 2 cores on socket 3. The OS and other processes aren't blocked from using other CPUs, but will most likely be scheduled on any of those free CPUs with no activity on them.

The first question that one starts in specifying the MySQL Cluster run-time environment would be to decide on how many ldm threads one should use. This is dependent on how many CPUs that are accessible to the data nodes. So assuming we have access to 24 CPU threads with hyperthreading. In this case it would be natural to start by using the number of ldm threads set to 6 and not use hyperthreading. A natural initial start is to use half of the available CPU cores for ldm threads. Next one assigns about a quarter of the ldm CPU resources to tc threads, in this case we land at 3 tc threads. Next one assigns a similar number of CPUs to send and recv threads. Then one assigns the rest of the CPUs to main, rep, io and wd threads. This should give a fair amount of resources available also to the OS.

After this initial assignment it is a good idea to run some benchmark which is close to the workload of your application to see whether the config works well. For this test run one should use cpubind for all thread types to ensure that we know how much CPU resources each thread type consumes (can be easily derived looking at top with per-CPU load mode, using cpubind). First check whether ldm threads is a bottleneck, if it is then check if it is possible to increase to the next level. In this example this would mean going to 8 ldm threads then using 2/3 of the CPU resources. If this isn't possible then just make sure that the rest of the thread types have a fair amount of CPU resources to avoid any unneeded bottlenecks.

If the bottleneck isn't the ldm thread, then assign more resources to this thread type and in most cases by removing resources from non-ldm threads. There could be cases where less than half of the CPU resources are needed by the ldm threads, but I would deem those as very unusual. Given that the actual database processing is done in the ldm threads, it would be an exceptional case if other threads consume more than half of the resources.

Always remember to update the NoOfFragmentLogParts variable if changing the number of ldm threads.

After a few trials we have most likely found a decent configuration. After finding this configuration we can also consider where to use cpuset and if any threads should use the realtime or spintime variables.

So next question is when to use cpubind, cpuset, realtime and spintime.

Both cpubind and cpuset is about locking threads to an individual CPU or a set of CPUs. We could actually consider even using no cpubind and cpuset as a form of CPU locking. We are locking the threads to the set of available CPUs in the OS. Given that we might be running on a virtual OS this might actually already be a subset of the existing subset of the existing CPUs. To make any configuration of CPUs using cpubind/cpuset one has to have knowledge of how cpu ids maps to CPU sockets and CPU cores.

So the default configuration not using any cpubind/cpuset is to allow the OS to schedule the thread onto any of the available CPUs. The OS scheduler is optimised towards an interactive environment where processes need to react to human interaction. It does also a good job of server environments where a fairly high number of threads compete for a small number of CPUs. It does also do a decent job of handling server environments where there are only a handful of threads competing for a number of CPUs.

The type of threads that the normal OS schedulers have most problems to handle are long-running threads that consume a lot of CPU resources. In particular threads that run more or less constantly. What happens is that the OS scheduler eventually downgrades their priority such that other processes are given a chance to execute, since the process still wants to execute the OS now searches for a free CPU to use. This is often successful, the problem is however that this means that we migrate the thread onto a new CPU. This happens many hundreds of times per second for those busy threads. The effect is that the thread comes to a new CPU where it has no data or instructions cached in the CPU caches. So this means that a migrated thread will spend quite a lot of time to warm up CPU caches before it can run as efficiently as it could before the migration. In addition this requires more bandwidth on the memory bus which in some cases can become a bottleneck.

In the above case it is better to stay with the same CPU even if a new job is scheduled, this is the case if we can be sure that the new process isn't one more long-running thread. So this means that in order to optimise the MySQL Cluster run-time environment we need to be in control over all usages of the CPU on the machine we're using. In many cases we colocate data nodes and MySQL Server processes. We can control placement of MySQL Server processes using numactl or taskset that can be applied when starting the process or using pid when the process is already started. This ensures that the MySQL Server process is never scheduled outside the set of CPUs we gave it access to through the taskset/numactl process. This is how I control the environment when running any MySQL Cluster benchmark. Similarly I also control the benchmark processes (sysbench/dbt2 client processes/flexAsynch...).

In this manner I am certain that no application process is using the CPUs provided for execution of the data node threads. So the only threads that will execute on the CPUs are either OS kernel threads or interrupt handling. Even this can be controlled by using a special boot option isolcpus and list the cpu numbers that the OS is allowed to use. When setting this variable only the listed cpus are using a normal scheduler handling with the possibility to migrate CPUs. Usage of the rest of the CPUs can only be invoked if the application uses a locking call such as controlled by cpuset/cpubind or taskset/numactl. The final thing to control is the execution of interrupts. This is normally handled by the irqbalance process and this can be configured to avoid a bitmap specified by the irqbalance configuration variable IRQBALANCE_BANNED_CPUS. So it is possible to create a completely compartmentalized load with interrupts on certain CPUs, OS and other applications on another set of CPUs, a set of CPUs for data node, a set of CPUs for MySQL Server and a set of CPUs for any other consuming application process. Providing this is a combination of boot options, irqbalance configuration, MySQL Cluster configuration and finally using taskset/numactl on certain processes.

So when to use cpubind and when to cpuset. cpubind should mainly be used on threads that can consume up to 100% of the CPUs. This would mainly be the ldm threads normally, but can definitely be the case also for other threads dependent on the workload and the configuration. So one way of configuring MySQL Cluster environments is to use cpubind for ldm threads and put the tc, send and recv threads in a common cpuset. The nice thing with this configuration is that we can run with more threads than really necessary, so e.g. we can run with 4 tc threads, 4 send threads and 4 recv threads and put these into a cpuset consisting of e.g. 8 CPUs. In this manner the  OS can easily handle if there is a certain thread types that requires more resources for a while.

So configuring a MySQL Cluster run-time environment is about making use of static scheduling for the most critical resources and to use the flexibility of the OS scheduling to handle the less critical resources. A good design philosophy for all environments like this is to design the run-time environment with a well-known bottleneck. For MySQL Cluster data nodes we recommend to always make the ldm threads the bottleneck. The nice thing about this is that it makes it easier to understand how to handle overload and reason around it. As an example if the ldm thread are overloaded and tc threads have available resources we can ensure that the tc threads handle sending error messages about overload without even contacting the ldm threads. This can be achieved by decreasing available resources in ldm threads through configurations at least for scan operations.

Hopefully these recommendations will help you find the optimal configuration and still a safe configuration. Default configurations will work for most installations, but to get the last 10-100% performance out of the system one might need to dive a bit deeper into the configuration of the MySQL Cluster run-time environment.

Next configuration item to consider is the realtime setting. This can now be set on each thread type. Traffic queries normally execute arriving in the recv thread, sent to the tc or the ldm threads and then the reply is sent through the send thread. main thread is mainly involved in the meta-data operation which rarely are time-critical. rep threads can have a high load, but they are not part of any critical paths except for asynchronous replication to other clusters. The io thread is only involved in time-critical operations if disk data is used in MySQL Cluster. The wd threads are different, obviously it is important that the watchdog thread gets an opportunity to execute every now and then. So if other threads are using realtime it's a good idea to use realtime also on the wd thread type. recv, send and tc threads are usually doing small jobs and thus realtime scheduling might be beneficial for those threads to cut the response time. For ldm threads that execute close to 100% of the time it's debatable whether realtime is such a good idea. The OS cannot handle realtime threads executing on realtime priority for a long time, so we have implemented protection for this in the data nodes. This ensures that we decrease the priority to normal user priority even for realtime threads if they execute for too long. It's rarely any benefits of throughput to use the realtime configuration for threads. It's mainly intended to enable less variation on response time.

Another use case for realtime is when there is a mix of data node threads and other application threads where the application threads have lower priority. In this case we ensure that the data node threads gets prioritised access to CPUs before the other application threads gets access.

The final configuration item to consider is the spintime. This means that the data node thread will execute for a bit longer before entering sleep mode. So if the recv thread or any other thread sends a new message to the thread in this spintime we decrease the wake up time. The only case where this can increase throughput is if the spinning thread is the bottleneck of the data node and the spinning doesn't steal resources from any other critical thread. The main usage of spintime would be as a tool to improve response time in an environment with plentiful of CPU resources. It is important to consider that the use of spintime will increase use of CPU resources for those threads it is set on. It only applies to the thread types ldm, tc, main, rep, send and recv threads.  One should in most cases avoid mixing spintime and realtime settings on the same thread.

MySQL Cluster run-time environment: Part 2: How to configure

The first selection in selecting run-time environment is done by selecting the binary. The original NDB run-time environment was a single-threaded environment where we executed everything except IO threads in a single thread. This environment actually still exists and is selected by running with the binary ndbd. The new multithreaded environment is selected by running with the binary ndbmtd. The ndbmtd can clearly scale far beyond ndbd, but ndbd can still have advantages in environments which has a low load and needs optimisation on latency. Since ndbd does receive on socket, execution of signals and send of signals in the same thread it can have shorter latency at the cost of scalability. Most of the rest of this description discusses ndbmtd configuration. The ThreadConfig variable doesn't apply to running with ndbd since ndbd uses hard-coded thread configuation consisting of 1 main thread.

When configuring the MySQL Cluster run-time environment the most important variable is ThreadConfig. One can alternatively also use MaxNoOfExecutionThreads and LockExecuteThreadToCPU as we will show below. There is also LockMaintThreadsToCPU that can be used to bind IO threads to a specific CPU, it is now recommended to do this using ThreadConfig for the locking of io threads instead since it has more options on how to do this. RealtimeScheduler and SchedulerSpinTimer can be used to set real-time scheduling and spin time on threads. If ThreadConfig is also used then these variables are only used as default settings that can be overridden by the ThreadConfig setting. SchedulerExecutionTimer is not applicable when running with ndbmtd. It is only applicable when running with ndbd.

LockPagesInMemory is part of the MySQL Cluster run-time environment and can have a heavy impact on response times. Setting this hasn't changed and its setting is independent on the rest of the settings discussed in this blog.

NoOfFragmentLogParts is an important variable that should be set equal to number of ldm threads if number of ldm threads is larger than 4, it cannot be set to anything smaller than 4. One can set NoOfFragmentLogParts larger than number of ldm threads, there is no advantage of this that I can think of. It can be set to the following values: 4, 6, 8, 12, 16, 24, 32.

The types of threads we have are ldm, tc, main, rep, send, recv, io and wd threads. These were discussed in part 1 of this blog. main and rep have a fixed amount of threads equal to one. The wd threads are always three with specific roles, however from a configuration point of view the number of wd threads is one. ldm, tc, send and recv threads have a configurable amount of threads. There can be 1,2,4,6,8,12,16,24 or 32 ldm threads. There can be anywhere between 1 to 32 tc threads and anywhere between 1 and 16 send threads and likewise between 1 and 16 recv threads. The number of ldm, tc, send, recv, main, rep and wd threads are fixed once the configuration is given, when the ndbmtd process is started these threads are started and not stopped until the process is stopped. The io threads are different, the io threads are handled by a dynamic pool of threads.

The io thread pool is controlled by two variables, the InitialNoOfOpenFiles which gives the initial amount of io threads (one thread handles one open file). The second variable is MaxNoOfOpenFiles that specifies the maximum of io threads that will be created. DiskIoThreadPool can be used to set the number of parallel accesses to disk data files, it doesn't affect the number of io threads. It only affects that more than one thread at a time can access a disk data file.

The preferred manner of configuring most of this environment is by using the ThreadConfig variable. This variable accepts a string and the string is parsed. I will start by giving an example.


In this example we see that for each thread type we can specify the following information, count, this is the number of threads of this type, count must be 1 for main and rep and also for io and wd. For io and wd threads the count is actually ignored since IO threads are handled with a dynamic pool and wd threads are 3 specific threads (WatchDog, SocketServer and SocketClient thread).

cpubind can be given for all thread types, cpubind is a comma separated list of CPUs where one can also use ‘-‘ to indicate an array of CPUs. So e.g. 0-3 is equivalent to 0,1,2-3. Threads are assigned to CPUs in order, so for the ldm's in the above config the first ldm thread will be bound to CPU 0, the second ldm thread to CPU 1, the next to CPU 2 and the last ldm thread to CPU 3. CPU 3 here is the CPU ordering imposed by the OS. On Linux one can get details on this by running 'cat /proc/cpuinfo', this will provide exact information about the CPU Socket the CPU belongs to, the CPU core the CPU belongs and which of the hyperthreads it runs, it will also list a bunch of other things. One must have at least one CPU per thread assigned here, otherwise an error will occur, having more CPUs specified is ok, but only the first ones provided will actually be used. cpubind always means that the thread can only execute on the specified CPU.

Setting realtime=1 means that the thread will be using the real-time scheduler of the OS. To protect the OS from starvation we will yield in this case if we execute for too long with real-time priority and allow other threads to get some work done as well. As can be seen from the example we can use real-time setting on a subset of the threads. The global configuration variable RealtimeScheduler is still used and sets the default value for a thread, so if this is set to 1, then we need to set realtime=0 on a thread to ensure we don't get real-time scheduling on it. Use of the real-time scheduler is mainly intended to lower the variance of the latency. Using it isn't likely to improve performance except possibly in a highly loaded system where other processes are not requiring real-time scheduling. Given that the OS often expects real-time threads to execute for short time with high requirements on short response time, it is not recommended to mix this configuration option with spintime on the same thread.

Setting spintime=50 means that we will spend at least 50 microseconds of executing signals and waiting for more signals before setting the thread to sleep again. This parameter is mainly intended as well to improve our latency, by keeping the threads ready to go, we can react faster to a query. We cannot set the spintime higher than 500 microseconds. It will consume considerably more CPU resources setting spintime and provides a tiny bit of extra throughput.

In the latest version of MySQL Cluster 7.x we also introduced the ability to bind threads a set of CPUs. Each CPU can only belong to one CPU set, on Solaris a cpu set is exclusive also to usage of the data node. In Linux it only sets the whereabouts of the MySQL Cluster data node threads. Other threads can still be scheduled on the same CPUs.

As an example we could use the following config:

In this example we have the same amount of threads as in the previous example, but we divided the threads into two CPU sets. The first CPU set covers CPU 0-3 and is only used by the 4 LDM threads. The other CPU set covers CPU 4-7 and takes care of the remaining threads in the data node. In this manner we can arbitrarily configure scheduling domains. cpuset have the advantage that the OS gets some liberty to do dynamic scheduling, the default config with no CPU sets or CPU bindings means that we have one CPU set consisting of all CPUs. By using cpuset and cpubind we can mix usage of OS scheduler and our own fixed scheduler. Usually we want the threads with most load to have their own CPUs, for other threads that have more variable load it makes sense to use cpusets such that the OS can dynamically schedule the threads dependent on the current load of the system.

The thread types that have variable number of threads is currently ldm, tc, send and recv threads. In 7.3.3 we have extended the support of threads such that we can have 32 ldm threads, 32 TC threads, 16 send threads and 16 recv threads. TC, send and recv can have an arbitrary number and it would work also to change the ThreadConfig variable, at least after an initial node restart. However the number of LDM threads has more restrictions, first of all it cannot be changed other than after starting up a completely new cluster. Also we only allow for the following number of LDM threads, 1,2,4,6,8,12,16,24 and 32. A very important part of the config which is easy to forget is that if we increase the number of LDM threads beyond 4, then we also need to set NoOfFragmentLogParts to at least the same as the number of LDM threads. Normally one would simply set this variable to the same value as the number of LDM threads.

A final note on the usage of ThreadConfig is that one can even divide one thread type into several groups, so one could write ldm={cpubind=0},ldm={cpubind=1},ldm={cpubind=2},ldm={cpubind=3) as equivalent to ldm={count=4,cpubind=0-3}.

If one wants to avoid discovering the best run-time configuration one can also still use the MaxNoOfExecutionThreads variable together with LockExecuteThreadToCPU to get a simpler, but not as flexible to configure the run-time environment. The number of ldm, tc, send and recv threads is in this case dependent on MaxNoOfExecutionThreads through a lookup in a table which is found in the code (there is a program to generate this table).

Here is the table (it is found in mt_thr_config.cpp, both the table and the program generating it). Each entry in the table have five entries, the first one is the value of the MaxNoOfExecutionThreads, the next four ones are number of ldm threads, number of tc threads, number of send threads and finally number of recv threads.

  static const struct entry
    Uint32 M;
    Uint32 lqh;
    Uint32 tc;
    Uint32 send;
    Uint32 recv;
  } table[] = {
    { 9, 4, 2, 0, 1 },
    { 10, 4, 2, 1, 1 },
    { 11, 4, 3, 1, 1 },
    { 12, 6, 2, 1, 1 },
    { 13, 6, 3, 1, 1 },
    { 14, 6, 3, 1, 2 },
    { 15, 6, 3, 2, 2 },
    { 16, 8, 3, 1, 2 },
    { 17, 8, 4, 1, 2 },
    { 18, 8, 4, 2, 2 },
    { 19, 8, 5, 2, 2 },
    { 20, 8, 5, 2, 3 },
    { 21, 8, 5, 3, 3 },
    { 22, 8, 6, 3, 3 },
    { 23, 8, 7, 3, 3 },
    { 24, 12, 5, 2, 3 },
    { 25, 12, 6, 2, 3 },
    { 26, 12, 6, 3, 3 },
    { 27, 12, 7, 3, 3 },
    { 28, 12, 7, 3, 4 },
    { 29, 12, 8, 3, 4 },
    { 30, 12, 8, 4, 4 },
    { 31, 12, 9, 4, 4 },
    { 32, 16, 8, 3, 3 },
    { 33, 16, 8, 3, 4 },
    { 34, 16, 8, 4, 4 },
    { 35, 16, 9, 4, 4 },
    { 36, 16, 10, 4, 4 },
    { 37, 16, 10, 4, 5 },
    { 38, 16, 11, 4, 5 },
    { 39, 16, 11, 5, 5 },
    { 40, 16, 12, 5, 5 },
    { 41, 16, 12, 5, 6 },
    { 42, 16, 13, 5, 6 },
    { 43, 16, 13, 6, 6 },
    { 44, 16, 14, 6, 6 },
    { 45, 16, 14, 6, 7 },
    { 46, 16, 15, 6, 7 },
    { 47, 16, 15, 7, 7 },
    { 48, 24, 12, 5, 5 },
    { 49, 24, 12, 5, 6 },
    { 50, 24, 13, 5, 6 },
    { 51, 24, 13, 6, 6 },
    { 52, 24, 14, 6, 6 },
    { 53, 24, 14, 6, 7 },
    { 54, 24, 15, 6, 7 },
    { 55, 24, 15, 7, 7 },
    { 56, 24, 16, 7, 7 },
    { 57, 24, 16, 7, 8 },
    { 58, 24, 17, 7, 8 },
    { 59, 24, 17, 8, 8 },
    { 60, 24, 18, 8, 8 },
    { 61, 24, 18, 8, 9 },
    { 62, 24, 19, 8, 9 },
    { 63, 24, 19, 9, 9 },
    { 64, 32, 16, 7, 7 },
    { 65, 32, 16, 7, 8 },
    { 66, 32, 17, 7, 8 },
    { 67, 32, 17, 8, 8 },
    { 68, 32, 18, 8, 8 },
    { 69, 32, 18, 8, 9 },
    { 70, 32, 19, 8, 9 },
    { 71, 32, 20, 8, 9 },
    { 72, 32, 20, 8, 10}

MySQL Cluster run-time environment, Part 1

The run-time environment in MySQL Cluster have always been geared towards high throughput, but also at having many features providing real-time access to the data in MySQL Cluster. Real-time features have included the possibility to lock the memory such that no swapping occurs. It has also included the ability to control placement of the many threads introduced in the MySQL Cluster 7.x series.

In the latest version of MySQL Cluster 7.x we have improved these real-time features and enabled a higher level of control over how those features are configured in the MySQL Cluster data nodes.

Given the high flexibility of configuring MySQL Cluster 7.x we have moved away from global config variables affecting all threads and instead moved towards using the ThreadConfig variable to specify the configured behavior of MySQL Cluster data nodes. The global variables are still in use, but they only set the default value, this value can be overridden by the settings in ThreadConfig.

Given that the ThreadConfig variable has such a key role in configuring the run-time environment of the MySQL Cluster data nodes, it is a good idea to spread more light on what can be configured now in the latest versions of MySQL Cluster 7.x. To assist users in this, I've written a series of three blogs where this is the first, and this blog introduces the concepts. The second blog gives information on what syntax to use and which configuration variables to use. Finally the third blog gives recommendations on which configuration setting to use in a number of different situations.

So what can be controlled using the ThreadConfig variable. We can control all threads that are created as part of running a multithreaded data node. There is only one exception to this rule and this is the IO threads which are handled by a dynamic pool of threads. We can still set some properties of these IO threads, but there are other configuration variables available to set the properties of the pool size, initial allocation of threads for IO.

We have the following thread types in a data node. We have ldm threads, these are the most central threads that runs the actual local database processing. Each thread takes care of its own partition of the data. We have a set of tc threads that takes care of transaction coordination. Each query to the data nodes starts with a message to a tc thread, this thread type also executes join execution pushed down to the data nodes from the MySQL Server. The main thread runs a large amount of code to handle control of system restarts, meta-data operations, control over checkpointing and some other management operations. It is used very lightly in normal operation. The rep thread receives update triggers from the ldm threads and is responsible for transporting those events to the proper API node. This thread is heavily used if asynchronous replication is used, otherwise it's not used much at all.

We have also enabled the possibility to split out all TCP/IP traffic from the other threads, this includes that we have threads taking care of all send calls to other data nodes and API nodes and we have threads taking care of receive calls on sockets and preparing those for execution of the other threads. It is still possible to configure without specific send threads, in this case the sending will happen from the ldm, tc, main and rep threads directly.

One way of understanding how these different threads use the CPU is to assign them to different CPUs and watch in top how much of each CPU that is used. I will briefly give some guidelines on what effects how much a certain thread type is used.

The ldm threads is the most common bottleneck in a data node. Each ldm thread handles the hash index, the ordered indexes and the actual data of a hash partition of a certain table. This means that most of the time the various ldm threads have similar load on the CPU. However if there are certain rows that are very hot, then some level of imbalance could happen between the CPUs. There are really three main activities going on in the ldm threads that are worth mentioning for normal operation. These are primary key lookups (using the hash index on a table), scans of a table (there are a couple of variants on how to scan a table) and there is checkpointing activity. The checkpoint activity is a background activity that comes and goes and its load is dependent on configuration settings. The cost of the primary key lookup is a fixed lookup cost and then there is a variable cost dependent on the amount of data read in the row and sent to the API. The same is true for scans. Each CPU handling a ldm thread could normally handle in the order of a few hundred thousand lookups or one or two million scanned rows (with delivered data, one can scan even faster if the row wasn't selected to be sent to the API). One might wonder how unique key lookups come into play, they are simply two primary key lookups, the first on the unique index and the second on the table itself. So seen from the ldm threads point of view there is no unique key lookup, this is a concept handled through transaction coordinator in the tc thread and local triggers in the ldm thread.

The tc thread is involved in every lookup and scan as well, the amount of work done in the CPU is completely dependent on the type of queries. Obviously if the ldm thread spends little time on each request, then the tc thread is more involved and spends more CPU per ldm thread. For most activities one would need about 1 tc thread per 4 ldm threads. There are however exceptions when there are many long scan queries then less tc threads are needed (e.g. sysbench), when there are many short primary key lookups, then there might be a need for more tc threads (e.g. flexAsynch). Also with complex queries the tc threads are involved in that they handle parts of the join processing.

The main thread is usually lightly loaded and hardly ever requires a full CPU, the load on the rep thread can be high if one uses asynchronous replication, otherwise it's negligible load on this thread. The IO threads normally also have negligible load, the exception to this is where one have configured IO with compression in which case the compression and decompression happens in the IO threads. This causes a high load on each IO thread.

We have also introduced a new category of threads, the wd threads, these are actually 3 specific threads. It is the watchdog thread, the socket server thread and the socket client thread. The watch dog thread wakes up once every 100 milliseconds and does some brief checks to ensure that no data node thread has been stuck. The socket server thread listens to some sockets and handles the connection setup phase of new sockets. The data node does very small amount of socket setups normally since sockets stay up until nodes are stopped. The socket client thread is only involved at connection setup phase of sockets, it handles the client side of a connection setup. It effectively calls connect on a socket to see if it has become connected. Neither of those threads will consume any sizeable part of a CPU. Previously these threads were handled as part of the io threads. From a configuration point of view we manage these threads as one.

The send threads can have a substantial load dependent on the amount of traffic sent from a data node, but also dependent on how many nodes we communicate with. There is always a higher cost of communication if only small traffic goes on between two nodes. This is due to the fact that networking cost has a fixed overhead per message sent, so with small amount of traffic there will be less data per message and thus a higher overhead. A send thread can handle messages to any other node, so when a message is to be sent we choose the first send thread which is not currently used. We always search in the same order, so this means that the first send thread will always be most loaded and the last send thread will be the least loaded.

The receive threads are organized such that each socket is handled by one receive thread, there is one socket per node we are communicating with. So this means that the load of a receive thread is dependent on the activity on its sockets. The sockets communicating with API nodes are all similar and activity depends on what happens on the application side. The communication between data nodes is for the most part very small, there is one exception here and this is the socket between two nodes in the same node group. Here there is a lot of traffic between the nodes when we have write transactions, each write transaction means that there is transaction coordination messages going between the data nodes in the same node group. Here it is also important to recall that on the API side one can have several API nodes per application process. In particular the MySQL Server has a configuration variable called --ndb-cluster-connection-pool that sets the number of API nodes that the MySQL Server can use to communicate with the data nodes. Each API node have a separate socket to each of the data nodes in the cluster.

So armed with this information we can easily do tests to see how loaded the various threads are, this will help us to understand if we need to change our configuration to remove some bottleneck.

To understand how the  various thread types interact for the most common execution of a primary key lookup and a scan operation, we have added a few figures describing this.

The first figure shows how a primary key read operation flows between the threads in a data node.

The second figure shows how a scan operation flows between the threads in a data node.

The third figure shows a more detailed view on what goes in a scan operation, the protocol looks the same independent of scan type.