Saturday, April 09, 2011

MySQL Cluster performance aspects

MySQL Cluster was designed for high performance from the very beginning. To achieve high performance one has to understand many aspects of computing. As an example the protocol is very important. In the original design work in 1994 we had a master thesis student build a prototype using a protocol which was based on BER encoding and other standard parts of many telecom protocols. After seeing the code in this prototype which was several thousands of lines of code just to handle the protocol, I realised that this type of protocol will simply cost too much on both the client side as well as on the server side. So this type of prototypes in early design work is extremely useful since it would have been very difficult to change this protocol once we started down the path of developing the Data Server.

Based on this work we instead opted for a protocol where almost everything in the protocol was of fixed size and entirely based on sending 32-bit words. We didn't want a protocol which transferred bytes to avoid the extra computational complexity this would require. So the NDB protocol which is used for query processing uses a message called TCKEYREQ, this message has about 10 32-bit words describing various fixed parameters such as TableId, ConnectionId, PartitionId and so forth. There is also a 32-bit word that contains a set of bits that is used to interpret the message. Actually reading this protocol can be done, completely avoiding branches since the bits can be used to address the proper words in the protocol message through some arithmetic. The only branching needed happens in taking care of keys and the actual query information which is of variable size.

The next important component of performance is the execution model. The MySQL Cluster Data nodes uses an execution model which is extremely well suited for modern CPUs. The Data nodes uses a set of threads, where each thread implements its own little OS with a scheduler. All communication inside the data nodes is based on messages. From a SW point of view the code to receive internal messages is exactly the same as the handling of messages arriving over the network. When sending a message it's the address which defines the type of message. The address contains three parts, the node id, the thread id and the module id (block number in the code). If the message is sent to a module with the same node id and thread id as the sending thread, then the message is an internal message and it will be sent by putting the message in the local message buffer, if the node id is the same but the thread id differs, then the message will be sent to another thread. The communication between threads is extremely efficient based on shared memory communication and this code is using the most efficient ways to communicate based on the HW and the OS. Finally when the node id differs, the message is sent as a network packet over to another data node or client node. There is a TCP/IP link between all nodes (fully connected mesh) and the data node will use mechanisms to ensure that the packets sent contains as many messages as possible without sacrificing latency (the user can affect the acceptable latency through a config parameter).

Given this model it means that a thread can be actively executing thousands of queries without any need of doing any context switches. This is one reason why MySQL Cluster benefits greatly when threads are locked to certain CPU cores and there is no contention from other programs to use these CPU cores. The data nodes have their own local OS and thus work extremely efficiently when the OS scheduler stays out of the way.

This particular model of executing where each thread of execution executes until it decides to send a message (the unit of execution is always execution of a message) was very popular in the 70s because of its efficiency. It was replaced by the time-sharing model given the simplicity of the time-sharing model. When designing MySQL Cluster we decided that a Data Server to handle millions of queries per second has more requirements on the efficiency of execution compared to the requirements of the simplicity of the design. Another great benefit of this execution model is that as the load on the Data Server increases, the throughput also grows. This is so since the execution threads will execute for longer time before they will look at the sockets for incoming traffic, this means that more messages will be gathered every time and thus the cost of each message byte decreases, the same happens with sending messages that as the number of messages to execute per round grows, the more data will be sent on each send call and thus decreasing the cost of each sent message byte.

The design is extremely modular even though its using a more complex execution model. Each module can only communicate with other modules using messages and the modules share no data. Thus if an error occurs in a module it's either due to bugs in this model or due to bad input data to the module. To debug the data node we trace every important branch, every message executed with it's data. This means that if a crash occurs we have very detailed information about how the crash occurred including the last thousand or so branches taken in the code and a few thousand of the last messages executed in the data node.

The final aspect of performance is the actual implementation of the database algorithms. To cover this in one blog message is obviously not possible but it covers an efficient design of data structures (we implement a hash based index and an ordered index), efficient implementation of the actual record storage with an efficient data structure to contain the record (includes capabilities to handle variable sized data and handling NULLable fields in a storage efficient manner and even being able to add fields to a record by usage of dynamic fields which are NULL when not present in the record). It includes an efficient model for recovery and finally an efficient model for transaction handling. In all of those aspects MySQL Cluster have added additional innovation to the world of databases with a particular focus on the performance aspects.

There is actually one more important part of the performance of MySQL Cluster and this is the programming API on the client side. I will discuss this in my next blog.

No comments: