Tuesday, November 19, 2013

MySQL Cluster run-time environment, Part 1

The run-time environment in MySQL Cluster have always been geared towards high throughput, but also at having many features providing real-time access to the data in MySQL Cluster. Real-time features have included the possibility to lock the memory such that no swapping occurs. It has also included the ability to control placement of the many threads introduced in the MySQL Cluster 7.x series.

In the latest version of MySQL Cluster 7.x we have improved these real-time features and enabled a higher level of control over how those features are configured in the MySQL Cluster data nodes.

Given the high flexibility of configuring MySQL Cluster 7.x we have moved away from global config variables affecting all threads and instead moved towards using the ThreadConfig variable to specify the configured behavior of MySQL Cluster data nodes. The global variables are still in use, but they only set the default value, this value can be overridden by the settings in ThreadConfig.

Given that the ThreadConfig variable has such a key role in configuring the run-time environment of the MySQL Cluster data nodes, it is a good idea to spread more light on what can be configured now in the latest versions of MySQL Cluster 7.x. To assist users in this, I've written a series of three blogs where this is the first, and this blog introduces the concepts. The second blog gives information on what syntax to use and which configuration variables to use. Finally the third blog gives recommendations on which configuration setting to use in a number of different situations.

So what can be controlled using the ThreadConfig variable. We can control all threads that are created as part of running a multithreaded data node. There is only one exception to this rule and this is the IO threads which are handled by a dynamic pool of threads. We can still set some properties of these IO threads, but there are other configuration variables available to set the properties of the pool size, initial allocation of threads for IO.

We have the following thread types in a data node. We have ldm threads, these are the most central threads that runs the actual local database processing. Each thread takes care of its own partition of the data. We have a set of tc threads that takes care of transaction coordination. Each query to the data nodes starts with a message to a tc thread, this thread type also executes join execution pushed down to the data nodes from the MySQL Server. The main thread runs a large amount of code to handle control of system restarts, meta-data operations, control over checkpointing and some other management operations. It is used very lightly in normal operation. The rep thread receives update triggers from the ldm threads and is responsible for transporting those events to the proper API node. This thread is heavily used if asynchronous replication is used, otherwise it's not used much at all.

We have also enabled the possibility to split out all TCP/IP traffic from the other threads, this includes that we have threads taking care of all send calls to other data nodes and API nodes and we have threads taking care of receive calls on sockets and preparing those for execution of the other threads. It is still possible to configure without specific send threads, in this case the sending will happen from the ldm, tc, main and rep threads directly.

One way of understanding how these different threads use the CPU is to assign them to different CPUs and watch in top how much of each CPU that is used. I will briefly give some guidelines on what effects how much a certain thread type is used.

The ldm threads is the most common bottleneck in a data node. Each ldm thread handles the hash index, the ordered indexes and the actual data of a hash partition of a certain table. This means that most of the time the various ldm threads have similar load on the CPU. However if there are certain rows that are very hot, then some level of imbalance could happen between the CPUs. There are really three main activities going on in the ldm threads that are worth mentioning for normal operation. These are primary key lookups (using the hash index on a table), scans of a table (there are a couple of variants on how to scan a table) and there is checkpointing activity. The checkpoint activity is a background activity that comes and goes and its load is dependent on configuration settings. The cost of the primary key lookup is a fixed lookup cost and then there is a variable cost dependent on the amount of data read in the row and sent to the API. The same is true for scans. Each CPU handling a ldm thread could normally handle in the order of a few hundred thousand lookups or one or two million scanned rows (with delivered data, one can scan even faster if the row wasn't selected to be sent to the API). One might wonder how unique key lookups come into play, they are simply two primary key lookups, the first on the unique index and the second on the table itself. So seen from the ldm threads point of view there is no unique key lookup, this is a concept handled through transaction coordinator in the tc thread and local triggers in the ldm thread.

The tc thread is involved in every lookup and scan as well, the amount of work done in the CPU is completely dependent on the type of queries. Obviously if the ldm thread spends little time on each request, then the tc thread is more involved and spends more CPU per ldm thread. For most activities one would need about 1 tc thread per 4 ldm threads. There are however exceptions when there are many long scan queries then less tc threads are needed (e.g. sysbench), when there are many short primary key lookups, then there might be a need for more tc threads (e.g. flexAsynch). Also with complex queries the tc threads are involved in that they handle parts of the join processing.

The main thread is usually lightly loaded and hardly ever requires a full CPU, the load on the rep thread can be high if one uses asynchronous replication, otherwise it's negligible load on this thread. The IO threads normally also have negligible load, the exception to this is where one have configured IO with compression in which case the compression and decompression happens in the IO threads. This causes a high load on each IO thread.

We have also introduced a new category of threads, the wd threads, these are actually 3 specific threads. It is the watchdog thread, the socket server thread and the socket client thread. The watch dog thread wakes up once every 100 milliseconds and does some brief checks to ensure that no data node thread has been stuck. The socket server thread listens to some sockets and handles the connection setup phase of new sockets. The data node does very small amount of socket setups normally since sockets stay up until nodes are stopped. The socket client thread is only involved at connection setup phase of sockets, it handles the client side of a connection setup. It effectively calls connect on a socket to see if it has become connected. Neither of those threads will consume any sizeable part of a CPU. Previously these threads were handled as part of the io threads. From a configuration point of view we manage these threads as one.

The send threads can have a substantial load dependent on the amount of traffic sent from a data node, but also dependent on how many nodes we communicate with. There is always a higher cost of communication if only small traffic goes on between two nodes. This is due to the fact that networking cost has a fixed overhead per message sent, so with small amount of traffic there will be less data per message and thus a higher overhead. A send thread can handle messages to any other node, so when a message is to be sent we choose the first send thread which is not currently used. We always search in the same order, so this means that the first send thread will always be most loaded and the last send thread will be the least loaded.

The receive threads are organized such that each socket is handled by one receive thread, there is one socket per node we are communicating with. So this means that the load of a receive thread is dependent on the activity on its sockets. The sockets communicating with API nodes are all similar and activity depends on what happens on the application side. The communication between data nodes is for the most part very small, there is one exception here and this is the socket between two nodes in the same node group. Here there is a lot of traffic between the nodes when we have write transactions, each write transaction means that there is transaction coordination messages going between the data nodes in the same node group. Here it is also important to recall that on the API side one can have several API nodes per application process. In particular the MySQL Server has a configuration variable called --ndb-cluster-connection-pool that sets the number of API nodes that the MySQL Server can use to communicate with the data nodes. Each API node have a separate socket to each of the data nodes in the cluster.

So armed with this information we can easily do tests to see how loaded the various threads are, this will help us to understand if we need to change our configuration to remove some bottleneck.

To understand how the  various thread types interact for the most common execution of a primary key lookup and a scan operation, we have added a few figures describing this.

The first figure shows how a primary key read operation flows between the threads in a data node.



The second figure shows how a scan operation flows between the threads in a data node.



The third figure shows a more detailed view on what goes in a scan operation, the protocol looks the same independent of scan type.


2 comments:

vishnurao said...

had a question.

based on the comment:

"This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, "

What is the limitation you were referring to ?

in 7.3, can the number of data nodes go beyond 48?
is the restriction of 255(max 48 data nodes) nodes still there?

On the sideline wanted to know what is maximum data set (on disk) that each data node can handle.

Thanking you.
with regards,
vishnu rao

Mikael Ronstrom said...

The limitation I refer to is that we can have no more than 255 nodes in a cluster. This limit still exists in 7.3. What has improved is that one API node can handle much more traffic since we have removed many bottlenecks in the API.

The max number of data nodes is still 48 in 7.3 and the max number of nodes is still 255 in 7.3 as well. So no changes of those limits in 7.3.

I don't know what the limit for maximum data set on disk each data node can handle. I presume it will be mainly limited by the file system limits of the OS you run. We have no specific limitations of our disk data implementation that I am aware of.