Tuesday, May 01, 2018

Linux configuration for MySQL Cluster

NDB Cluster was designed from the ground up for real-time operations.
It has its origins in the telecom industry where predictability of performance
and latency is absolutely critical. In addition the telecom vendors are
competing very much on pricing of their product.

This leads to that it is important to get the most performance from each
telecom server. Increasing the performance by 10% means that you need 10%
less network equipment. If many servers are shipped this can be a substantial
cost that is worth spending valuable engineering time to achieve.

Thus if you are using or are planning to MySQL Cluster on many servers it
is a good idea to spend some time ensuring that one achieves optimal
performance. If you are operating MySQL Cluster in just a few servers the
payback on your time investment might not be as high.

Another reason to take a deep interest in performance of MySQL Cluster is
of course that you want to learn more about Linux and MySQL Cluster.

In this blog post I will mainly discuss one specific optimisation that can have
large impact on your performance. This is placing the execution threads of
MySQL Cluster and Linux on to the proper CPUs.

In my book MySQL Cluster 7.5 inside and out I devote 3 chapters to this
topic. One chapter on the usage of hyperthreading in data nodes, one
chapter on configuration of CPU locking for data nodes in NDB and
one chapter on configuring Linux interrupt handling.

In this blog post I will mainly discuss the Linux configuration.

NDB is a distributed DBMS where all operations pass through the network,
thus network operations is an essential part of the setup of a MySQL Cluster.

So how does a network packet arrive from the wire into MySQL Cluster and
out again.

The first thing that happens is that the network card receives a packet. It now
needs to inform a CPU about this packet such that Linux TCP/IP code can
handle the packet.

Now the Linux TCP/IP is divided into 3 parts. The first part is handling the
HW interrupt from the network card, the second part is handling the
soft interrupt. The final part is a function call interrupt from the soft interrupt
to the device driver to handle most of the interrupt handling.

The function call interrupt part was introduced in Linux 2.6.35, thus in older
Linux kernels this concept doesn't exist.

Modern Linux kernels also use the NAPI mechanism, this means that the
HW interrupt is disabled during the time we process a set of HW interrupts.
This mechanisms avoids overloading the CPUs with interrupt handling and
takes care of interrupts in batches at high load.

Now Linux interrupt setup is highly configurable and the defaults depends
among other things on the Linux distribution.

There are three main areas that can be configured for Linux. These are
RSS (Receive Side Scaling), RPS (Receive Packet Steering) and
RFS (Receive Flow Steering).

RSS handles setup of the HW interrupt and the soft interrupt. Often the
default is to spread these interrupts on all CPUs. This is usually not a
good idea for MySQL Cluster since some of the CPUs we want to
protect as much as possible from all OS activity (these are in particular
the LDM threads).

RPS is the new mechanism introduced in Linux 2.6.35 and one can configure
the CPUs where the function call interrupts are handled.

Finally RFS tries to ensure that the function call interrupt is executed
on the same CPUs where the application calls recv from. In NDB data nodes
this happens in the recv threads. In a MySQL Server the communication
from the data nodes arrives in the NDB API receive threads (this can be
configured to be locked to a specific CPU). The communication from MySQL
clients arrive in the connection thread where they are executed, so these are
spread on all CPUs the MySQL server is executing on.

When the NDB data node has processed the messages arriving on the network
it will send some responses onto the network. In the NDB data nodes this
happens either in all the threads or it happens in a send thread or a combination
of the two.

Linux has the possibility to configure transmit interrupts as well through
something called XPS (Transmit Packet Steering). The optimal behaviour is
achieved if the transmit interrupts are handled by the same CPU as the send
call is done from.

Configuring Linux and MySQL Cluster together can have a major positive
impact on performance. To achieve the best performance it is important to
consider the flow of data through the servers to process the database requests
from NDB clients (e.g. the MySQL Server).

Saturday, March 17, 2018

NDB Cluster and disk columns

NDB is mainly an In-memory database. We have however also the possibility to
store non-indexed columns on disk. This data uses a page cache as any
other normal disk-based DBMS.

Interestingly with the increases of memory sizes one could think that
disk data becomes less important for MySQL Cluster. The answer is actually
the opposite.

The reason is again the HW development. NDB is designed with predictable
latency as a very basic requirement. In the past disks meant hard drives. Access
time to a hard disk was several milliseconds at best. Given that our requirement
was to handle complex transactions within 10 milliseconds disk data storage
was out of the question.

Modern HW is completely different, they use SSD devices, first attached through
the SATA interface that enabled up to around 500 MByte per second and
a few thousand IO operations per second (IOPS). The second step was the
introduction of SSD devices on the PCI bus. This lifted the performance up to more
than  1 GByte per second. These devices are extremely small and still very powerful.
I have an Intel NUC at home that has two of those devices.

Thus the performance difference between disk storage and RAM has decreased.

The next step on the way was to change the storage protocol and introduce NVMe
devices. These still use the same HW, but use a new standard that is designed for
the new type of storage devices. Given those devices we have now the ability to
execute millions of IOPS on a standard server box with access times of a few tens
of microseconds.

For NDB this means that this HW fits very well into the NDB architecture. The work
we did on developing the Partial LCP algorithm did also a lot of work on improving
our disk data implementation. We see more and more people that use disk data
columns in NDB.

The next step is even more interesting, this will bring storage into the memory bus and
access times of around one microsecond. For NDB this disk storage can be treated as
memory to start with, thus making it possible to soon have multiple TBytes of memory
in standard boxes.

Thus HW development is making the NDB engine more and more interesting to use.

One notable example that uses disk data columns in NDB is HopsFS. They use the
disk data columns to store small files in the meta data server of the HopsFS
implementation of the Hadoop HDFS Name Server. This means much faster
access to small files. The tests they did showed that they could handled hundreds
of thousands of file reads and writes per second even using fairly standard SSD disks
on the servers.

The implementation of disk data in NDB is done such that each row can have three
parts. The fixed memory part that is accessed quickly using a row id. The variable
sized part that is accessed through a pointer from the fixed size part.

The disk columns are also accessed through a reference in the fixed size part. This
reference is an 8-bit value that refers to the page id and page index of the disk
columns.

Before we can access those pages we go through a page cache. The page cache was
implemented on caching techniques that was state of the art a few years ago.

The idea is quite simple. The page cache uses a normal hot page queue. Pages are
brought up in this queue when they are accessed. A single access will bring it up,
but to be more permanent in the page cache a page has to be accessed several times.

Now each page is represented in those queues by a page state record. The basis
of the page cache algorithm is that a page can be represented in a page state
record even if the page is not in the page cache.

NDB has a configuration variable called DiskPageBufferEntries, by default this is
set to 10. It is the multiplication factor of how many more pages we have
page state records compared to the amount of pages we have in the page cache.

So for example if we have set DiskPageBufferMemory to 10 GByte and we have
set DiskPageBufferEntries we will have page state records that holds pages of
100 GBytes in the queues. Thus even when a page is paged out we keep it in the
list and thus we can see patterns of reuse that are longer than the page cache
we have access to. The factor of 10 means that the page state records are of
about 3% of the size of the page cache itself. Thus the benefits of the extra
knowledge about page usage patterns comes at a fairly low cost. The factor
10 is configurable.

Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)
and can also store a number of TBytes on NVMe devices. NDB is well suited
for those modern machines and MySQL Cluster 7.6 have been designed to be
suitable for this new generation of HW.

Friday, March 16, 2018

Discovering rows that have been updated since last checkpoint

One important problem that requires a solution is to decide whether
a row has been updated since the last checkpoint or not.

Most implementations use some kind of mechanism that requires extra
memory resources and/or CPU resources to handle this.

NDB uses the fact that each row is already stamped with a timestamp.
The timestamp is what we call a global checkpoint id. A new global
checkpoint is created about once every 2 seconds (can be faster or
slower by configuration).

Thus we will overestimate the number of rows written since last checkpoint
with a little bit, but with checkpoints taking a few minutes, the extra overhead
of this is only around 1%.

Thus when we scan rows we check the global checkpoint id of the row, if
it is bigger than the global checkpoint that the last checkpoint had fully
covered we will write the row as changed since last checkpoint. Actually
we also have the same information on the page level, thus we can check
the page header and very quickly scan past an entire page if it hasn't been
updated since last checkpoint.

The same type of scanning is used also to bring a restarting node up to
synch with the live node. This algorithm has been present in NDB since
MySQL 5.1.

Partial LCPs and Read-only tables

In MySQL Cluster 7.5 we use Complete Checkpoints. In MySQL Cluster 7.6
we implement an approach where we only checkpoint a part of the database
in each checkpoint.

A special case is a checkpoint of a table partition where no changes
at all have happened since the last checkpoint. In this case we implemented
a special optimisation such that it is not necessary to checkpoint anything
at all for this table partition. It is only necessary to write a new LCP
control file which is 4 kBytes in size for each table partition (can grow to
8 kBytes if the recovery will require more than 980 checkpoints to
recover.

This means that if your database contains a large set of read-only tables,
there will be no need to checkpoint those tables at all. This feature
is used also when setting EnablePartialLcp to false.

Partial LCPs and disk space

One of the main objectives of the new Partial LCP algorithm in MySQL
Cluster 7.6 is to keep up with the development of modern HW.

I have already described in previous blogs how Partial LCP can handle
nicely even database sizes of 10 TBytes of memory with a very modest
load on the disk devices.

Now modern HW has shifted from using hard drives to using SSDs.

The original approach in NDB is assuming that the checkpoints and
REDO logs are stored on hard drives. In MySQL Cluster 7.5 the
disk space required for the REDO log is that it is a bit larger than the
DataMemory size. The reason is that we want to survive also when
loading massive amounts of data.

In MySQL Cluster 7.5 we cannot remove any checkpoint files until
a checkpoint is fully completed. This means that we require around
4x the memory size of disk space for REDO logs and checkpoints.

With hard drives this is not a problem at all. As an example my
development box has 32 GBytes of memory with 2 TByte of disk
space. Thus 64x more disk space compared to the memory space.

With modern servers this size difference between memory and
disks is decreasing. For example many cloud VMs only have
a bit more than 2x the disk size compared to the memory size.

So one goal of MySQL Cluster 7.6 is to fit in much less disk
space.

The aim is to solve this with a three-thronged approach.

1) Partial LCP means that we can execute the checkpoints much
faster. Since REDO logs only need to be kept for around two
checkpoints this means a significant decrease of size requirements
for REDO logs. The aim is to only need around 10% of the disk
space of memory for the REDO logs. This work is not completed
in 7.6.4. As usual there are no guarantees when this work will be
completed.

2) Using Partial LCP we can throw away old LCP files as soon
as we have created a new recoverable LCP for the table partition.
Thus it is no longer necessary to store 2 LCPs on disk. At the
same time there is some overhead related to Partial LCPs. By default
setting this overhead is 50% plus a bit more. Thus we should always
fit within about 1.6x times the memory size.

It is possible to set EnablePartialLcp to false, in this case all
checkpoints will be Complete Checkpoints. This means more
writes to disk for checkpoints, but it will decrease the storage
space to around the same as the memory size.

3) Using CompressedLCP set to 1 we can decrease LCP storage
by another factor of 2-3x (usually around 2.7x). This feature has
existed for a long time in NDB.

Thus it should be possible to significantly decrease the requirements
on storage space when running NDB using MySQL Cluster 7.6.

NDB Checkpoints vs Disk-based checkpoints

At one point in the development of the Partial LCP algorithm I started
wondering how it compares to an approach where one uses a standard
page-based checkpointing as happens in a traditional disk-based DBMS.

The scenario I thought was a case where one have a 10 TByte memory in
the system. In this case the likelihood of updating the same row twice
in a checkpoint is very small (except for hotspot rows of course).

With a row size of 250 bytes there will be 50 billion rows in the database.
I often assume that the checkpoint takes about 5 minutes in my modelling.
This means that even with 1 million writes per second less than 1% of the
data in the database is updated during a checkpoint.

A more normal update speed would be e.g. 100.000 writes per second.
In this case each checkpoint will write 10 Mbytes of rows per second
and thus about 6 GBytes is written in 5 minutes. This represents the
Delta, the changed records.

In addition in the Partial LCP implementation a part of the database
must also be written. In this case we have written 0.075% of the database
since the last checkpoints. The defaults requires thus that we select
a number of parts that ensures that at least 0.15% of the database is
fully written. Thus 2 of the 2048 parts will be written and thus the
total checkpoint size will be 22.5 GByte. To write this in 5 minutes
we need to write 75 Mbyte per second checkpoint data.

Now let us do the same experiment with a traditional disk-based
DBMS. We assume a standard page size of 8 kBytes. This means
that the page cache will have 1.28 billion pages. Thus with 100.000
updates per second we will update 36 M pages. Thus around 3% of
the pages. Thus it is very unlikely that any large number of pages
have more than one page write.

Thus a checkpoint must write each and every one of those 36M pages.
This means a total checkpoint size of 288 GByte. This means that the
DBMS must write almost 1 Gbyte of checkpoints per second, thus
more than 10x the amount NDB will write using the Partial LCP
algorithm.

In NDB it is possible to write even less during a checkpoint by setting
the configuration parameter RecoveryWork higher. Setting this to
its maximum size 100 means in the above calculation that we
only need to write 15 GByte per checkpoint and thus checkpoint
speed is 50 MBytes per second.

The drawback of this setting is that we increase the work at recovery
instead. The overhead stored on disk with setting 100 is 2x and this
overhead will amount to the same overhead at recovery time. The
default setting is 50% overhead in storage and at recovery.

It is possible to set it lower as well, down to 25%. In this case we will
write more, in the example we would write 37.5GByte and thus
125 MByte per second. So still 8x better than the disk-based
DBMS. In this case the overhead in storage is 25% and similarly
the overhead at recovery.

Although the overhead for restoring checkpoints is higher using
Partial LCP, the recovery will be a lot faster in 7.6. Recovery
contains running one LCP as part of recovery. This LCP
can be 100x faster compared to executing an LCP in 7.5.
Thus recovery will often be significantly faster using 7.6.

Also the disk storage for LCPs is decreased in 7.6.

NDB Checkpoints and research on In-Memory Databases

I just read an article called Low-Overhead Asynchronous Checkpointing in
Main-Memory Database Systems. It was mentioned in a course in Database
Systems at Carnegie-Mellon University, see here.

In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designed
for modern HW with TBytes of main memory. I think studying this implementation
will be very worthwhile both for users of NDB, but also for researchers in DBMS
implementations. It implements a new class of checkpoint algorithms that is currently
a research topic in the database research community.

It was interesting to compare our approach that I called Partial LCP with approaches
taken by other commercial in-memory databases and with the approach presented
in the paper.

LCP is Local CheckPoint which is the name we use for our checkpoint protocol
in NDB.

The course presents a number of ideal properties of a checkpoint implementation.

The first property is that doesn't slow down regular transaction processing.

In the case of NDB we execute checkpoints at a steady pace which consumes
around 5-10% of the available CPU resources. This will decrease even more with
the implementation in 7.6.

The second is that it doesn't introduce any latency spikes.

NDB checkpointing both new and old executes in steps of at most 10-20
microseconds. So there will be extremely small impact on latency of
transactions due to checkpointing.

The third property is that it doesn't require excessive memory overhead.

NDB checkpointing consumes a configurable buffer in each database thread. The
ideal size of this is around 1 MByte. In addition we have a REDO log buffer that
is usually a bit bigger than that. That is all there is to it. There is no extra memory
space needed for checkpointing rows. The checkpointing performs a normal scan
of the rows and copies the memory content to the buffer and as soon as the buffer
is full it writes it to disk using sequential disk writes.

It is fair to say that NDB does a good job in handling those ideal properties.

The course presents two variants called fuzzy checkpoints and consistent checkpoints.
The course defines fuzzy checkpoints as a checkpoint that can write uncommitted
data. I would normally use the term fuzzy checkpoint to mean that the checkpoint
is not consistent at a database level, but can still be consistent on a row basis.

Actually NDB is a mix of the definition provided in the course material. It is a
consistent checkpoint for each row. But different rows can be consistent at very
different points in time. So on a row basis NDB is consistent, but at the database
level the checkpoint is fuzzy. Thus to perform recovery one needs to install the
checkpoint and then apply the REDO log to get a consistent checkpoint restored.

Next the course presents two variants called Complete Checkpoints and Delta
Checkpoints. Complete Checkpoint means that the entire database is written in
each checkpoint. Delta Checkpoint means that only changes are written in a
checkpoint.

This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpoint
scheme. 7.6 uses a Partial Checkpoint scheme.

In my view the NDB variant is a third variant which is not complete and not a
Delta Checkpoint. Partial means that it writes the Delta, that is it writes all changes
since the last checkpoint. But it does also write a Complete Checkpoint for a part
of the database, thus the name Partial Checkpoint. Thus it is similar to an
incremental backup scheme.

NDB can divide the database up in up to 2048 parts, each checkpoint can write
0 parts (only if no changes occurred in the table partition since last checkpoint).
It can write 1 part if the number of writes is very small, it can write all 2048 parts
if almost all rows have been updated and it can write anywhere between 1 and
2048 based on how many rows were updated since last checkpoint.

Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.
As we move towards TBytes of memory this is no longer a plausible approach.

The NDB approach means that we can perform a checkpoint in a few minutes
even in a system with 16 TBytes of memory where we need to write about
8 GBytes plus the changes since the last checkpoint.

Thus NDB takes the step into a new world of massively large In-Memory DBMSs
with the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.

My new book "MySQL Cluster 7.5 inside and out" describes the LCP
implementation in 7.5, the description of the Partial LCP can be found in my blogs
and also some very detailed descriptions in the source code itself. Among other
things a 10-page proof of that the algorithm actually works :)

The nice thing with the Partial LCP approach in NDB is that it requires no
more work after writing the checkpoint. There is no need of merging checkpoints.
This happens automatically at recovery. There is some amount of overhead in
that the checkpoints can have some rows in multiple checkpoints and thus there is
some amount of overhead at recovery. We calculate the number of parts to use
based on the amount of changes. We even implemented a LCP simulator that
calculates the overhead while inserting and deleting large amounts of row
and has been used to find the proper configurable parameters for the algorithm.


Wednesday, February 28, 2018

UPDATE: MySQL Cluster 7.5 inside and out

Publishing a book internationally turned out to be a bit more complex than I
originally thought. Therefore there are three different ways to order the book
MySQL Cluster 7.5 inside and out.

The E-book which is now available world-wide.
The paperback version. This is also now available world-wide.
Finally the bound version which is available from Nordic countries
and Germany and Switzerland.

The original idea was to publish it as an E-book and as a bound book.
Given that the book is 640 pages long I felt that I wanted a bound book
to ensure that I can read the book a lot. I've got a few copies of the bound
book at home and I have it on my desk all the time together with Jesper
and Mikiyas Pro MySQL NDB Cluster book.

As it turned out the printer only had international agreements to
print paperback books with figures in black and white (the bound
version have color figures). To ensure that the book is world-wide
available I decided to also publish a paperback version.

So for example at the UK/US Amazon's bookshop the versions available are
the E-book and the paperback version.

Personally I still prefer the bound version. I discovered that a german
internet site have international delivery. So if you want to buy the bound version
of the book you can use this site: Hugendubel.de.

If you have any comments on the book, any errata, you can publish a comment
on this blog. I will also publish comments to this blog every now and then when
I discover any errors or comments.

Feel free to also provide ideas for future inclusion in possible future editions of
this book.

Monday, February 12, 2018

Adaptive algorithms in NDB and in cars

The world is brimming with the ideas of self-driving cars and all sorts of
other concepts where computers are supposed to take care of
decision making.

This makes a bit worried, for one because I simply like to drive and
would not want a computer to interfere with my driving. I am already
quite irritated by many automatic things in cars that don't really work
when winter is upon in Sweden :)

Anyways this post is not about that, this post is more about the general
problem of designing adaptive algorithms.

I've been designing NDB software for more than 20 years. During the
course of these years I have learned a bit about what is optimal
when executing NDB. Most of the software I write today is about
putting this knowledge into the NDB software itself.

This is a trend in databases today to automate configuration handling
in a DBMS. In NDB we started this trend in MySQL Cluster 7.4
when we implemented a "top" facility inside the NDB data nodes.
At the same time we also keep track of lags in writing to disk.

We used this knowledge to design an adaptive algorithm that changes
the speed of writing local checkpoints based on the current CPU usage
and IO lag.

We moved on in 7.5 and implemented an adaptive algorithm to control
from where sending will happen. This algorithm is also based on
keeping track of CPU usage in the threads in the data node.

The new partial LCP algorithm is also highly adaptive where it decides
how incremental the LCP should be based on the writing in the
database.

There is also work ongoing on some adaptiveness in the NDB API
where some threads will start up to assist the receive thread in the NDB
API when it gets overloaded.

There is even more work ongoing to ensure that the checkpoint speed
adapts also to conditions where we are starting to run out of REDO log.

Now the idea of adaptive algorithms is clearly a good idea, but, and there
is a big but, there are two problems with ANY adaptive algorithm.

The first problem is oscillation. Adaptive algorithms works by changing
the environment based on input from the environment. When you look
at an adaptive algorithm that works it is actually quite impressive. By
merely finding the proper conditions based on the input you can get a
system that quickly adapts to any working condition and finds a new
optimal spot to work in.

My original study at the university was mathematical statistics.
One important fact in most mathematical statistics is that you
have stable states and you have transient states.

An adaptive algorithm will work fine as long as the frequency of
changes in the environment is not faster than the time it takes to
find a new stable state.

As an example in the algorithms in NDB, most of them takes
decisions to change the environment about once per second.
One important thing to make those adaptive algorithms better
at adapting is to not change the controls to much. If one base
the decision on what to do the next second only on the last
second the adaptive algorithm is quite likely to
self-oscillate.

Thus it is important to build in some inertia in the adaptive
algorithm. This protects the algorithm from going wild.
But it doesn't make it adapt to conditions that change
quicker than the change frequency. Adaptive algorithms
cannot handle that.

So this is the first problem, to ensure that the adaptive
algorithm is quick enough to change to handle the
changing environment, but not so quick that it starts to
self-oscillate.

The second problem is when two adaptive algorithms
crash into each other. As an example in NDB we have a
problem when CPU load is extremely high due to
application activity while at the same time we are
coming close to the limit of the REDO log. In this case
we have two adaptive algorithms that conflict, one wants
to decrease the checkpoint speed to keep the application
activity while the other algorithm tries to slow down the
checkpoint activity to avoid running out of REDO log.

Now in a car the bets are higher, its human lifes involved.
Almost the same problem a self-driving car will have to
solve when the driver has decided on the speed he wants
to travel while at the same time the control of the car sees
dangers coming up ahead. These dangers could be other
cars, cliffs or any other thing.

Sometimes cars even have to make decision on whether
its own passengers should survive or whether the by-stander
should survive.

So the software of a self-driving car and any other
self-controlling software suffers from two big problems
to solve.

1) How often should I take input from the environment and
decide to change the controller parameters.
2) How should I handle conflicting requirements

Failure in handling 1) will lead to self-oscillating
behaviour and failure to handle 2) will lead to
crashes.

So hopefully any developer of self-driving cars has read up
a lot on adaptive algorithms and know exactly when the
algorithm is safe and when it isn't.

Personally I always feel a bit uneasy about any adaptive
algorithm since I know that it is almost impossible to
predict exactly how it is going to behave in all situations.

The mathematics involved in understanding adaptive
algorithms requires a lot of understanding of differential
equations.

Thursday, February 08, 2018

Content of MySQL Cluster 7.5 inside and out

Here is a link to the Book content in the new book MySQL Cluster 7.5 inside and out.

MySQL Cluster 7.5 inside and out

A new book on MySQL Cluster is out. It is called MySQL Cluster 7.5 inside and out.
It is on its way out to all internet sites. Currently it is accessible on adlibris.se and on
BoD's bookshop and now also cdon.com. They are all in swedish, but with Google
Translate that should be possible to overcome. It will hit most other international book
stores within a week or less.

It is currently available for orders as a printed book, it will become available as an
eBook in 1-2 weeks. The printed book is in bound format since I wanted to make it
possible to read it frequently, it is 640 pages long.

I started development on NDB as my Ph.D research. The first years I did collect requirements
and tried to understand how database internals works. In 1996 the development started.
I wrote my Ph.D thesis in 1998 that stated most of the ideas used in the early versions of
the NDB implementation.

The idea on writing a book about MySQL Cluster have been coming up for me every now
and then since more than 10 years back. However all the time I felt it was more important
to focus on one more feature to develop.

In 2015 I decided that it was more important to write down a description of the features in
NDB. So in 2016 I started writing this book. As usual with book projects they take a lot longer
than expected.

At about the same time Jesper Wisborg Krogh and  Mikiya Okuno also started writing a
book about MySQL Cluster. This is called Pro MySQL NDB Cluster.

So the good news is that we now have two very good books about MySQL Cluster.
Jesper and Mikiyas book is written from the perspective of the DBA that have
decided to use NDB.

My book explains why NDB was developed, it describes a great number of applications
where it fits in. It compares it to other clustering solutions for MySQL.

I wanted to make sure that the first step to install and get started with MySQL Cluster
isn't a showstopper, so I described in some detail how to install it and get up and running
on various platforms. This includes a chapter on MySQL Cluster and Docker. Later
there is also a chapter on using NDB in the cloud.

Next it goes through NDB from an SQL point of view and describes all the things that
are good to understand when working with MySQL Cluster. It goes through the direct
NDB APIs (C++ API, Java API and a Node.js API). It goes through how to import
and export data to/from NDB.

It explains the various ways you can replicate between clusters using MySQL Cluster.
It also explains why those solutions exist and what the problem it is trying to solve is.

We have developed quite a few ndbinfo tables that can be used to gather an understanding
of the cluster in operation. These tables are explained and the purpose of them.

Next I dive into some internals, describing the software architecture, the message flows
and the restarts in NDB. I provide some advices on how to optimise restart times.

Next I dive deep into the configuration of MySQL Cluster, both the cluster configuration
and the MySQL servers. I provide detailed descriptions of how to configure for optimal
performance. I also provide details on the memory impact of many configuration parameters.
The configuration chapters include detailed descriptions of how to setup an optimal
execution environment for NDB, this includes details on how to set up the Linux
infrastructure for optimal performance.

Finally I go through our testing frameworks that we make use. I go through in detail
the benchmark framework I developed for more than 10 years called dbt2-0.37.50
that can be used to benchmark with sysbench, DBT2 and flexAsynch.

Finally the history of MySQL Cluster is provided.


Monday, February 05, 2018

Wednesday, January 31, 2018

Partial LCP in MySQL Cluster 7.6.4

Today MySQL Cluster 7.6.4 DMR is out. This new version contains some very interesting
new developments in the area of checkpointing.

When I developed the original NDB algorithms in the late 90s the computer I had access to
had 2 CPUs, 1 GByte of memory. At the time we were aiming at 10.000 updates per second.

So with average row sizes of 100 bytes this meant that we changed 1 MByte per second.
The local checkpoint algorithm was designed to be executed once per about 5 minutes.

So this meant that most of the database would be changed at high loads. So the
checkpointing algorithm writes the entire database. This means that a lot of updates
are merged together in the checkpoint.

This algorithm has been used now in NDB for 20 years and it still works fine.

Now HW is developing in a number of ways.

1) Memory is getting larger and larger. Today it is not uncommon to find machines
with TBytes of memory.

2) The ratio between available disk space is decreasing.

In the past it was not uncommon to have 100 times as much disk space as memory space.
With SSDs this factor have been decreased significantly. Particularly in servers where
NDB resides this factor have decreased to around 2-3 in many cases.

In addition the development of persistent memory is ongoing, this is likely to cause
memory to grow with a jump of another 4x or so. This means that even tens of TBytes
in a machine is likely to be common.

When starting the development of the new recovery algorithm in NDB 2 years ago the
requirement was thus to implement a new recovery algorithm that will handle
main memory sizes of up to at least 16 TByte of memory and with disk sizes that are
about 2x the memory size.

These requirements leads to the following conclusions:
1) We need to implement some form of incremental checkpoint algorithms.
2) We can only maintain one copy of the data on disk
3) We must have the ability to use REDO logs that are much smaller than the memory size

Currently in NDB a checkpoint is not completed until all data have been written. This
means that we must have disk space to handle up to at least 2x the memory size for
checkpoints.

During massive inserts (e.g. importing data), it is necessary to have a very large REDO log
to ensure that we don't run out of REDO log during import of a database.

These requirements are ok if there is sufficient disk space, but we wanted to make sure
that we don't rely on large disk spaces in the future.

In addition we wanted reengineer the LCP algorithm to take decisions locally and not
rely on all nodes participating in each decision about LCPs. This means that we can now
perform checkpoints locally in a node during restart without affecting LCPs in other
nodes. This is particularly interesting for initial node restarts where a new node will
take a very long time to execute an LCP whereas the live nodes can perform LCPs in
a very short time.

There were two main possible implementations for incremental checkpoints.
1) Use a standard page cache implementation also for main memory
This would mean that we would store two pages for each page in main memory
and write each page such that we always keep the old page until the LCP
is completed.

2) A partial LCP where a part of the rows are fully checkpointed and the rest only
checkpoints the changed rows.

I did analyse the two algorithms and concluded that the standard page cache
algorithm writes far too with small rows.

When the memory size is TBytes in size, the likelihood of one page having more
than write in a checkpoint is small, thus each row change will lead to one
page written in LCP.

With a row size of 100 bytes and a page size of 32 kBytes this would lead to
a waste of more than 300x of the disk bandwidth.

In addition it would still require 2x the disk space.

So the choice was taken to go with the partial LCP variant. Interestingly the
analysis of the standard page cache algorithms will be a problem for all
disk-based DBMSs. The growth to larger page caches will mean that more
and more disk bandwidth is spent on writing checkpoints.

So here is a short description of how the partial LCP algorithm works.

1) For each table partition we keep one or two LCP control files. This file
is normally 4 kBytes in size (can be 8 kBytes in some cases).
This file is used at recovery to know which checkpoint files to use in recovery,
it is also used at the next LCP to know which checkpoint files to write.

2) We keep track of the number of rows in a table partition and we keep
track of the number of row changes since the last LCP was started on the
table partition. These two numbers are used to decide on how many parts
of the table partition to fully checkpoint.

If the number of changes is 0, we only write a new control file.
If there are changes we will write at least 1 part and at most a full
local checkpoint that writes all 2048 parts.

3) The table partition is divided into parts based on the row id. We use the
page part of the row id to decide on which part a row id is part of.

4) The number of parts to write uses some mathematical formulas.
As it turns out there is an interesting relation here.
If we write less parts fully the work at recovery is increasing and the
size of all checkpoints increases but at the same time the amount of
writes to disk is decreasing.
With more parts written per checkpoint we increase the amount of
writes to disk per checkpoint, but we decrease the checkpoint size
and the amount of work at recovery.

We decided to make the choice here configurable in a new configuration
parameter called RecoveryWork. This can be set between 25 and 100 and
defaults to 50.

At 50 the checkpoint size will be around 1.6 times the data size. The
amount of checkpoint writes to do will be around 3 times the size of
the changed rows.

Setting it to 25 means that the checkpoint size will be around 1.35 times
the data size. The checkpoint writes will be around 4 times the size of
the changed rows.

Setting it to 100 means that the checkpoint size will be around 2.1 times
the data size and the checkpoint writes will be around 2 times the size
of the changed rows.

Thus there is an exponential dependency on the amount of checkpoint
writes required to achieve the minimum restart time.

We have selected a balanced approach as the default setting.

It is also possible to set EnablePartialLcp to 0. In this case we always
write full checkpoints if any row changed. This means that the checkpoint
size will be equal to the data size. In this case it isn't possible to use a
small REDO log since checkpoint write speed will ensure that we can
complete an LCP in a certain time. In this setup the REDO log should
be 2x the size of data size to ensure that we can handle survive even a
large import of a database.

The above calculation is based on calculation under the assumption on
that the amount of writes is very small per LCP compared to the data size.
The code contains large comments in Backup.cpp that explains this in
even more detail.

There is an additional 12.5% in the checkpoint size due to the fact that
we only delete files and a full checkpoint writes 8 files, so in the worst
case we might have to keep a file that contains only 1 part that is relevant
and the rest is not needed anymore, this means that 1/8 could in the worst
case be wasted space. Normally this would not be the case, but we want
to ensure that we can keep the disk space within limits all the time, even
in the worst case.




MySQL Cluster 7.6.4 is out

MySQL Cluster 7.6.4 DMR is out.

This new version contains a number of goodies.

1) Local checkpoint algorithm have been rewritten
The new checkpointing is designed to scale to at least 16 TBytes of DataMemory sizes
Checkpoints will be much faster, this decreases recovery times significantly
Table fragments that are not updated will not need any new checkpoints written
Checkpoint size on disk is significantly decreased

2) MySQL Cluster Configurator (MCC, Auto Installer)
MCC is significantly improved. Particularly for setups where you
have external computers either on-premise or in the cloud.

3) New cloud feature
In the cloud with availability domains/zones it is possible to have
10x difference between latency inside an AD compared to between
ADs. To run MySQL Cluster in a cloud region with synchronous
replication between ADs one can now configure nodes with a
LocationDomainId. This LocationDomainId will be used to
ensure that transaction coordinator is placed in the same AD and
that we always prefer reading data from our own AD if possible.

4) New ODirectSyncFlag
When using ODirect there are a number of file systems that ensures that writes
are also synched to disk. If the user knows that he is working in such an
environment setting ODirectSyncFlag can improve disk write speeds by
around 2-3x. This is particularly interesting when using hard drives.

5) Change default behaviour of restart configuration
We changed the BuildIndexThreads from 0 to 128 to improve speed of
index rebuilds. We added a new configuration setting to specify which
CPUs that can be used for index rebuilds.

We increased batch sizes (and made them configurable) to improve
performance of unique index creation and online add node and some
other algorithms.

We changed the default algorithm for initial node restart to rebuild indexes
in a special phase.

All these changes can lead to a very significant reduction in restore times.

6) Many improvements to our parallel query implementation (pushdown join,
SPJ). The improvement depends on the queries, but in our special benchmark
query we have improved latency of query execution to almost half.

7) Parallel UNDO log applier for disk columns
The phase where we apply the UNDO log is now fully parallelised over all
LDM threads. For a scenario with 4 LDM threads we've seen a speed up of
5x for the UNDO log applier phase (this is only used to UNDO changes in
pages containing disk columns during a restart).

8) Bug fixes
We have continued our quality improvements to ensure that each new version
is even more stable compared to the previous one.