Wednesday, January 18, 2017

MySQL Cluster up and running in less than 4 minutes

This blog is full of graphics.
A while ago I decided to try out the MySQL Cluster Auto Installer.
I have my own scripts and tools to work with MySQL Cluster so
I don't normally need to use it. But I wanted to know what it could
do and could not do. So I decided to take it for a spin.

I was actually positively surprised. It was very quick to get up and running.
Naturally as with any graphics tool it will get you to a point, if it meets issues
it can be hard to discover the issues. But there are ways to debug it as well
and naturally you have access to all MySQL log files as well as all the
NDB log files.

My personal takeaway is that the MySQL Cluster Auto Installer is a very good
tool for developing applications towards MySQL Cluster. For a production
installation I would probably want a bit more control over things and would
most likely write up some scripts for it. Also in a DevOps environment there
are other tools that can come in handy. But for a developer that wants to
develop an NDB API application or a MySQL application or any other
type of application in MySQL Cluster it seems a very good tool to get
MySQL Cluster up and running and shutting it down when ready for the

We are working on some improvements of the Auto Installer, so if you have
opinions or bug reports referring to the Auto Installer please let us know
your preferences.

When I tried things out I used MySQL Cluster 7.5.4 and it worked flawlessly
on Mac OS X. On Windows there was some issues with my Windows installation
using Swedish and Windows uses the cp1252 character set which isn't compatible
with UTF-8. So I had to fix some conversion of some messages from Windows.
This bug fix is available in MySQL Cluster 7.5.5 that was recently launched.

The nice thing with a Python program was that to fix the bug one could simply edit
the script and run it again.

In this blog I will show you each step needed to get to a running MySQL Cluster
installation. I was personally able to perform the entire installation, definition of
the cluster, deploying the cluster and starting the cluster and waiting for this to
complete in 3 minutes and 50 seconds. Most likely for a newbie it will take a bit
longer, but with this blog as aid it should hopefully proceed very quickly.

This blog shows how to do this on my development machine which is running
Mac OS X. I also tested it on Windows and the steps are almost the same although
the Windows installer looks a bit different to the Mac OS X installer.

It should similarly be similarly easy to do this on Linux and Solaris.

My personal next step is to try out how it works to do a similar thing with
multiple machines. I have some Intel NUCs and some laptop running Linux that
should be possible to control from the Auto Installer in a similar fashion.

Developing a MySQL Cluster application definitely benefits from having a few
small servers in the background and Intel NUCs are a nice and cheap variant of
this that comes in at a very reasonable price.

After that I will also test the same thing but using some virtual machines in
the Cloud.

Description of the cluster you get up and running

After completing the below steps you will have MySQL Cluster up and running.
This will give you 2 MySQL Servers to access. Both of those MySQL Servers
can be used to both read and write all data. If you create a table in one MySQL
Server it will be present also in the other MySQL Server.

Some things such as views, triggers and functions are still per MySQL Server,
so you need to define them on each server.

So it is very easy to load balance your application towards those MySQL Servers
since they are all equal. The same would be true even if you had 100 MySQL
Servers. The actual data is stored in the data nodes but most of the processing
is done in the MySQL Servers.

You can also use the same setup to execute NDB API applications using a
low-level interface to MySQL Cluster that will in many cases have 10x
better performance but obviously also a higher development cost.

Another nice tool to develop applications with high performance using
the NDB API more directly is using ClusterJ which is a native Java
API towards MySQL Cluster. This interface is very easy to use
and uses an object-relational mapping.

Description of the steps

To download MySQL Cluster the easiest way is actually in my opinion to google it.
So MySQL Cluster Download in a google window will get you to the MySQL Cluster
download page quickly.

I am running this demonstration on a Mac OS X computer, so the download page
will automatically send me to the download page for MySQL Cluster on Mac OS X
 I personally prefer the Mac installer image (.dmg file). The download of this will take
some time since it is about 400 MBytes. I am based in Sweden where 100 Mbit per
second download speeds are normal, so for me this takes about 1 minute.

After clicking Download I also need to click on one more page to get the download started.

While waiting for the download to complete I get a progress bar on the download on Mac OS X.

Once the download I open up the Downloads and click on the most recent download.
This will start up the installation process. The window below then pops up and I double
click on the package symbol.

Next I get a message about what I am doing with some references to further information.
I simply click Continue here.

Next the GPL license is presented to me, I click Continue.

I have to click Agree on that I agree to the usage terms described in the
GPL license.

Next before the actual installation starts I will be presented with the fact that this
will consume 1.6 GByte of my disk space. I click Install to start the Install.

On Mac OS X any new software install requires administrative privileges. So I get a
login window where I have enter my password to authenticate that I really want this
software installed. After typing the password I click Install Software.

After a short time the install is complete and I can click Close to finish the

Now we have managed to install MySQL Cluster. The next step is now to
start up the MySQL Cluster Auto Installer. This is a bit more involved but still
not that difficult. On Windows this can be started with a simple double-click
on the setup.bat file in the bin directory. On Mac OS X you need to start up
a terminal window to start the Web Server that drives the Auto Installer.

To start a terminal you go into the Launchpad and click on the symbol below.

Then click on the Terminal symbol and a terminal window will appear.

In the terminal window you change the directory to the bin directory of the
MySQL Cluster installation as can be seen in the window below. In this
directory you execute the Python program

This Python Program starts a web server on the port 8081, if this port is already
used it will attempt with 8082, 8083 and so forth up until 8100 before it gives up.

Immediately after this web server has started it will launch a window in your
default browser.

In my case this is a Safari browser. I get a warning message that this isn't the
most well tested browser, however I have had no issues with using the
Safari browser so simply click Close here.

Now I get to the starting window of the MySQL Cluster Auto Installer.
I am doing my first installation so I will click on Create New Cluster.

I now get to the Define Cluster page. I define the hosts to use. Here I will
only run the tests on localhost, so only one host is used.
I will use the simple testing variant that will use a part of my memory but not all.
If I run this on Windows I should also click away the use of SSH unless I want to
follow the instructions on how to use SSH on Windows. This is documented in the
MySQL Cluster Auto Installer manual.

When I am finished on the page I click Next.

Now the next page is pretty cool. The Web server has discovered where I have my
installation, it has found out which OS I am using, it has knowledge of how much
memory I have, it knows how many CPU cores I have and it proposes an installation
directory that I in this case will accept.

When I run this command on my Windows box I usually move the installation
directory to the D: disk instead of the C: since the C: disk is pretty full on my
Windows box.

Now we get to define the processes. The default in this case is one management server,
2 data nodes, 2 MySQL Servers and 3 API node slots (can be used to execute various
NDB tools or NDB API applications).

I see no specific reason to change so I click Next immediately.

Next I come to a page where I can edit the configuration before launching the start
of the cluster. On each node I can define Node Id, Data directory and Hostname if
I want to change those from the default.

If I click Show advanced configuration options and then click on Data layer I get
a chance to edit the MySQL Cluster configuration file. Here I can edit a multitude
of configuration parameters for MySQL Cluster although not all of them.

After I have finalised setting up the configuration I click Next. This brings
me to the Deploy and Start page.

I start by Deploying the cluster to get a chance to manually edit the configuration
files as well. If I don't care to edit those I can click Deploy and start cluster
which will also start the cluster immediately after deploying.

I only clicked Deploy cluster.

When the deployment of the cluster is completed I get a completed window,
simply click Close.

Now I go back to my terminal window and check what Deploy did.
In the Data Directory of node 49 (the management server) we find the
config.ini file. If I want to edit the MySQL Cluster Configuration file
further I can do that in this file before starting the cluster.

Next in the MySQL Server I have the my.cnf file and a couple of prepared
databases. If I want to edit the MySQL Server configuration further I can
do this in the my.cnf file.

There is no files in the Data Directory of the data nodes. This is created at
initial start of those nodes.

Now when I am done with the preparation I click Deploy and start cluster to
get the cluster started.

This presents me with a progress bar. It starts by starting the management
server. This goes so quickly that it wasn't possible to catch it.

Next it starts the data nodes. This take a bit of time, in my case about 30 seconds.

Next it starts the second data node.

Finally after starting all data nodes it will start the MySQL Server one at a time.

After starting the MySQL Servers the startup is complete. We click Close on the
information popup and then we have a page that also presents the state of each
node in the cluster based on what the management server sees.

Finally we can see the files now created in the management server and the Data
directory of the data nodes.

We now have a cluster with 2 data nodes, 2 MySQL Servers and 1 NDB
management server up and running.

You can access the MySQL Servers on port 3306 and 3307 as any
MySQL Server and start performing any test you want to apply for
your new MySQL Cluster installation.

Monday, November 07, 2016

MySQL Cluster and real-time requirements

This blog gives some background to the decisions made when designing the
storage engine NDB Cluster used in MySQL Cluster around how to support
real-time requirements (or as I sometime refer to it, predictable response
time requirements).

Requirement analysis

When analysing the requirements for NDB Cluster based on its usage in telecom
databases two things were important. The first requirement is that we need to
be able to respond to queries within a few milliseconds (today even down to
tens of microseconds). The second requirement is that we need to do this while
at the same time supporting a mix of simple traffic queries combined with a
number of more complex queries running at the same time.

The first requirement was the main requirement that led to NDB Cluster using a
main memory storage model with durability on disk using a REDO log and
various checkpoints. Today we also support storing non-indexed columns on
disk in combination with columns stored in main memory.

Potential solutions

The second requirement was a bit harder to handle. To solve the second requirement
in an extremely large environment with many CPUs can be done by allowing the
traffic queries and management queries to run on different CPUs. This model will
however not work at all in a confined environment with only 1-2 CPUs and it will
even be hard to put to work in a large environment since the usage of the
management queries will come and go quickly.

The next potential solution is to simply leave the problem to the OS. Modern OSs
of today use a time-sharing model. However each time quanta is fairly long
compared to our requirement of responding within parts of a millisecond.
So this model won't work very well either.

Yet another possibility would be to use a real-time operating system, but this would
marginalise the product too much.

Most DBMS today use the OS to handle the requirements on reponse times. So as an
example if one uses MySQL/InnoDB and send various queries to the MySQL Server,
some traffic queries and some management queries, MySQL will use different threads
for each query. MySQL will deliver good throughput even in the context of very
varying workloads since the OS will use time-sharing to fairly split the CPU usage
amongst the various threads. However it will not be able to handle response time
requirements of parts of a millisecond with a mixed load of simple and complex


So when designing NDB Cluster we wanted to avoid this problem. NDB was designed
within Ericsson. In Ericsson a real-time telecom switch had been developed in the
70s, the AXE. The AXE is still in popular use today and new versions of it are still
developed. AXE had a solution to this problem which was built around a message
passing machine.

I spent a good deal of the 90s developing a virtual machine for AXE called AXE VM
that later turned into a real product called APZ VM (APZ is the name of the CPU
subsystem in the AXE). This virtual machine was able to execute on any machine.
The AXE VM used a model where execution was handled as execution of signals. A
signal is simply a message, this message contains an address label, it contains
a signal number and it contains data of various sizes. A signal is executed inside
a block, a block is a module that is self-contained, it owns all its data and
the only manner to get to the data in the block is through sending a signal to the

So effectively the AXE VM implemented a real-time operating system inside a normal
operating system such as Windows, Linux, Solaris or Mac OS X.

The AXE VM also had a lot of handling of the language used in AXE called PLEX. This
is no longer present in NDB. But NDB still is implemented using signals and blocks.
The blocks are implemented in C++ and in AXE VM it was possible to have such
blocks, they were called simulated blocks. In NDB all blocks are nowadays simulated

How does NDB solve the real-time problem

So how does this model enable response times of down to parts of a millisecond even
in a highly loaded system. First of all it is important to state that NDB does
handle this. We have very demanding customers both in the telecom, networking and
in financial sectors and lately also in the storage world that expects to run
complex transactions involving tens of different key lookups and scan queries and
that expects these transactions to complete within a few milliseconds even at
90-95% load in the system.

As an example in the financial sector missing the deadline might mean that you miss
the opportunity to buy or sell some stock equity in real-time trading. In the telecom
sector your telephone call setup and other telco services depends on immediate
response to complex transactions.

At the same time these systems also need to ensure that they can analyse the data
in real-time, these queries obviously have less demanding response time
requirements, but they are not allowed to impact the response time of the traffic queries.

The virtual machine model implements this by using a design technique where each
signal is only allowed to execute for a few microseconds. A typical key lookup
query in modern CPUs takes less than two microseconds to execute. Scanning a table
is divided up into scanning a few rows at a time where each such scan takes less
than ten microseconds. All other maintenance work to handle restarts, node failures,
aborts, creating new tables and so forth is similarly implemented with the same
requirements on signal execution.

So what this means is that a typical traffic transaction is normally handled by one
key lookup or a short scan query and then the response is sent back to the API node.
A transaction consists of a number of such interactions normally on the order of
tens of such queries. This means that each interaction needs to complete within
100-200 microseconds in order to handle response times of a few millseconds
for the transaction.

NDB can handle this response time requirement even when 20-30 messages are
queued up before the message given that each message will only take on the order
of 1-2 microseconds to execute. So most of the time is still spent in the transporter
layer sending the message and receiving the message.

A complex query will execute in this model by being split into many small signal
executions. Each time a signal is completed it will put itself back into the queue
of signals and wait for its next turn.

So traffic queries will always have the ability to meet strict requirements on
response time. Another nice thing with this model is that it will adapt to
varying workloads within a few microseconds. So if there is currently no traffic
queries to execute, then the complex query will get the CPU to itself since the
next signal will execute immediately after being put on the queue.

Handling memory allocation

One more important factor in ensuring that NDB can always operate in an optimal
manner and deliver the expected throughput is that we control memory. All the
memory is allocated at startup, this means that we cannot get into a situation where
we oversubscribe the main memory of the machine. NDB even have a number of
config parameters to ensure that the memory used by NDB data nodes is never
paged out.

Locking of CPUs

One more manner of ensuring that NDB always operates in an optimal manner is to
control the placement of threads onto different CPUs.

Behaviour of NDB at high load

There is one more very important aspect of this model. As load increases two
things happens. First we execute more and more signals every time we have
received a set of signals. This means that the overhead to collect each
signal decreases. Second executing larger and larger sets of signals means
that we send larger and larger packets. This means that the cost per packet
decreases. Thus actually NDB data nodes executes more and more efficiently
as load increases. This is a very important characteristic that avoids many
overload problems.

Building a mixed environment for traffic and management queries

Finally the separation of Data Server and Query Server functionality makes it
possible to use different Query Server for traffic queries to the ones used
for complex queries. So in the MySQL Cluster model this means that you can
use a set of MySQL Servers in the cluster to handle short real-time queries.
You can use a different set of MySQL Servers to handle complex queries.
Thus MySQL Cluster can handle real-time requirements in a proper configuration
of the cluster even when operating using SQL queries.


The interface to the Data Server is as you can now see implemented on top of
signals, the most common ones are TCKEYREQ that implements all types of
key lookups using the primary key and SCAN_TABREQ that implements all types
of scan queries (also including join queries that have been pushed down to
data nodes). There is a protocol to carry these signals that currently uses
TCP/IP sockets but have historically also been carried by SCI, Infiniband
and shared memory transporters.

So the separation of Data Server and Query Server functionality might mean
that MySQL Cluster have slightly longer minimum response time compared to
a local storage engine in MySQL, but MySQL Cluster will continue to deliver
low and predictable response times even using varying workloads and even
when executing at very high loads.

One experiment that was done when developing pushdown join functionality
showed that the performance of those pushed down joins was the same
when executing in an otherwise idle cluster as when executing in a cluster
that performed 50.000 update queries per second.

NDB has been designed such that with some work of configuring it properly
it can be extremely reliable in delivering predictable response times. At the
same time we're working hard to make it easier and easier to configure also
when you don't want to control every bell and whistle. One step in this direction
is the introduction of the ability to read also from backup replicas and the
adaptive control of which threads that help out in sending.

Friday, November 04, 2016

Separation of Query Server and Data Server

In MySQL we have separated the Query Server and the Data Server
functionality. The Query Server is what takes care of handling the
SQL queries and maps those to lower layers call into the Data Server.
The API to the Data Server in MySQL is the storage engine API.

Almost all DBMS have a similar separation between Data Server and Query

When I performed my research work, that later led to development of
NDB Cluster, that is now MySQL Cluster, an important part of the
research was to handle the separation of the Data Server and
the Query Server.

As part of my research we looked deeply into the next generation mobile
networks and their use of network databases. From these studies it was
obvious that the traffic part of the applications almost always made very
simple queries, mostly key lookups and in some cases slightly more complex
queries were used.

At the same time there is also management applications in the telecom
network, these applications will often use more complex queries and will
almost always use some sort of standard access method such as SQL.

Normally the traffic queries have strict requirements on response times
whereas the management applications have much less strict requirements
on response times.

So it was clear that there was a need for both a fast path to the data
as well as standard APIs used for more complex queries and for data

From this it was clear that it was desirable to have a clearly defined
Data Server API in a telecom DBMS to handle most of the traffic queries.

Now the next question that came up was the placement of the API. There were
a number of alternatives. One method was to have an API such that the
application can execute direct function calls into the Data Server. A
number of databases uses this approach. The reason for avoiding this
approach is the high availability requirement. If the application gets
a wrong pointer and writes some data out of place, then it can change
the data inside the DBMS. Another problem with this approach is that it
becomes very difficult to manage in a shared nothing architecture since
the application will have to be colocated with its data for the benefits
to be of any value.

So choosing a network-based protocol as the Data Server API was a choice
to ensure the highest level of reliability of the DBMS and its applications.
We have a clear place where we can check the correctness of the API calls
and only through this API can the application change the data in the Data

Another reason that made it natural to choose a network protocol as API was
that the development of technologies for low-latency and high bandwidth had
started already in the 1990s. The first versions of NDB Cluster actually had
an SCI transporter as its main transporter which ensured that communication
between computers could happen in microseconds. TCP/IP sockets have since
replaced it since also SCI and Infiniband now have support for TCP/IP sockets
that is more or less as fast as direct use of SCI and Infiniband.

One more reason for using a network protocol as API is that it enables us to
build very scalable Data Servers.

To make it easier to program applications we developed the C++ NDB API that
is used to access to the NDB Data Server protocol.

The marriage between MySQL and NDB Cluster was a natural one since NDB Cluster
had mainly focused on the Data Server parts and by connecting to the MySQL
Server we had a natural implementation of the Query Server functionality.

The requirements on extremely fast failover times in telecom DBMSs made it
necessary to implement NDB Cluster as a shared nothing DBMS. So effectively
the Data Server API has support for storing relational tables in a shared
nothing architecture. The methods available in the NDB Data Server API is
methods for key-value access for read and write, scan access using full
table scan and ordered index scans. In 7.2 we also added some functionality
to pushdown join execution into the Data Server API.

To decrease the amount of interaction we also added an interpreter to the
NDB Data Server, this can be used for simple pushdown of filters, it can
be used to perform simple update operations (such as increment a value)
and it can handle LIKE filters.

So what have been the benefits and disadvantages of these architecturial
choices been over the years.

One advantage is that MySQL Cluster can be used for many different things.

One important use case is what it was designed for, there are many good
examples of applications written against any of the direct NDB APIs to
serve telecom applications, financial applications and web applications
while still being able to access the same data through an SQL interface
in the MySQL Server. These applications takes advantages of the
performance advantage that make it possible to scale applications to
as much as hundreds of millions of operations per second.

Another category in poular use with MySQL Cluster is to implement an
LDAP server as the Query Server on top of the NDB Data Server. This would
have been very difficult using a Query Server API since the Query Server
adds a very significant overhead to simple requests.

The latest example of use cases is to use the Data Server to implement
a scalable file system. This has been implemented by HopsFS in replacing
the Name Server in Hadoop HDFS with a set of Name Servers that use a
set of NDB Data Servers to store the actual metadata. Most people that
hear that such an architecture is built on something with MySQL in the
name will immediately think of the overhead in using SQL interface to
implement a file system. But obviously it isn't the SQL interface which
is used, it is implemented directly on top of the Java implementation
of the NDB API, ClusterJ. Personally I also have a hobby project that
I play around every now and then for the last 10 years that will
implement a Linux filesystem on top of NDB Cluster using a new NDB API,
a new NDB management server and using the FUSE API to implement the
file system in userspace.

The final category obviously is the use of MySQL Cluster with the
SQL interface. There are many applications that use this to get to the
high availability of MySQL Cluster but still using the SQL interface.

The disadvantages is obviously that DBMSs that have a more direct
API between the Query Server and the Data Server will get benefits
in that they don't have to go over a network API to access its data.
With NDB Cluster you pay this extra cost to get higher availability,
more flexible access to your data and higher scalability.

We have worked very hard since the inception of NDB Cluster into MySQL
to ensure that the performance of SQL queries is as close as possible
to the colocated Query Server and Data Server APIs. We've gotten very
close and we are working on getting even closer.

At the same time by separating the Data Server and the Query Server we
have made it possible to work on parallelising some queries in an easy
manner. This makes the gap much smaller and for many complex queries
NDB will even outperform local storage engines.

Thursday, October 20, 2016

Read any replica in MySQL Cluster 7.5

MySQL Cluster uses a distributed architecture where we have one primary
replica and one or more backup replicas of all data in the cluster.
In previous versions we have not supported reading from the backup
replicas. The reason for this comes from the transaction algorithm in
the NDB storage engine. When the transaction have committed we have
unlocked the rows in the primary replica, but the locks in the backup
replica is unlocked in the complete phase.

The impact of this is that we could potentially perform an update
of a row and then immediately read it and not see our own write. To
avoid this problem we have avoided to read the backup replica.

If it is important to read the backup replica then we need to ensure
that transactions are not reported back to the user until the complete

When we decided to support read from backup replicas we considered how
to handle this. One manner would be to simply have a flag in the
updating transactions that signals that we want the response after the
complete phase instead of after the commit phase.

We decided that this approach would be possible, but we opted for a
safer approach where it is a table property that defines if tables
can be used for reading the backup replica or not.

Given that there is a latency cost to wait until the complete phase
is done this feature is not enabled for all tables. In 7.5 we decided
to have the default that read from backup replicas isn't enabled.

There are a number of ways to enable read from backup replicas.

Let's start with the case when you haven't created the table yet.
In this case the first manner to handle it is to create a table
with the read backup flag set. This is done by using a special
flag set in the COMMENT section of the CREATE TABLE statement.
To set the flag we use the following comment:
We run a parser on the COMMENT section where we look for the
NDB_TABLE keyword and after that we look for an equal sign
followed by a new keyword READ_BACKUP which can be set to 1 or
0. It is also possible set several table properties in one comment
and in this case we add a comma and the next property. So e.g.

The above feature is a good approach when you want read from backup
replicas in some tables, but not from all tables. A more likely
approach is to set the new configuration variable ndb-read-backup
in the MySQL Server that ensures that all tables gets created with
the read backup flag set.

So basically it is necessary to decide if you want to focus on
read performance or low latency of updating transactions. Given
that NDB is very often used for high write loads we have kept the
default to not set the read backup flag. But many applications
are read-focused, so in this case it makes sense to set the
config variable ndb_read_backup to 1.

There is one more case where there are already existing tables
that have been upgraded from an earlier MySQL Cluster version.
In this case it is possible to change those tables to set the
read backup flag online.

The syntax to use this for a table called t1 would be:
This will run an online change that can happen concurrent with
reading and writing transactions.

Now when equipped with read backup flag set on your tables it is
possible to localize your application. Assume that you have the
following MySQL Cluster setup.

Now we want to ensure that the MySQL Server 1 always uses the NDB
data node 1 and vice versa for reading. If the MySQL Server 1 and
the NDB data node 1 are using the same hostname or ip address we
will automatically detect that they are on the same host. So in
this case we will always prefer to send reads to the same host.
If they are on the same host but uses different ip address then
we can tell the MySQL Server 1 that its closest data node
neighbour is the node id of NDB data node 1. This is done by
setting this node id in the MySQL configuration variable
ndb-data-node-neighbour to this value.

So with this settings we are able to more or less make a read
optimised architecture using the MySQL Cluster.

Many people that uses MySQL/InnoDB that tries out MySQL Cluster
will most likely expect the behaviour provided with the read
backup flag set. So users of MySQL Cluster that comes from the
MySQL world should most likely set the ndb-read-backup flag
to get the expected behaviour.