Monday, February 12, 2018

Adaptive algorithms in NDB and in cars

The world is brimming with the ideas of self-driving cars and all sorts of
other concepts where computers are supposed to take care of
decision making.

This makes a bit worried, for one because I simply like to drive and
would not want a computer to interfere with my driving. I am already
quite irritated by many automatic things in cars that don't really work
when winter is upon in Sweden :)

Anyways this post is not about that, this post is more about the general
problem of designing adaptive algorithms.

I've been designing NDB software for more than 20 years. During the
course of these years I have learned a bit about what is optimal
when executing NDB. Most of the software I write today is about
putting this knowledge into the NDB software itself.

This is a trend in databases today to automate configuration handling
in a DBMS. In NDB we started this trend in MySQL Cluster 7.4
when we implemented a "top" facility inside the NDB data nodes.
At the same time we also keep track of lags in writing to disk.

We used this knowledge to design an adaptive algorithm that changes
the speed of writing local checkpoints based on the current CPU usage
and IO lag.

We moved on in 7.5 and implemented an adaptive algorithm to control
from where sending will happen. This algorithm is also based on
keeping track of CPU usage in the threads in the data node.

The new partial LCP algorithm is also highly adaptive where it decides
how incremental the LCP should be based on the writing in the
database.

There is also work ongoing on some adaptiveness in the NDB API
where some threads will start up to assist the receive thread in the NDB
API when it gets overloaded.

There is even more work ongoing to ensure that the checkpoint speed
adapts also to conditions where we are starting to run out of REDO log.

Now the idea of adaptive algorithms is clearly a good idea, but, and there
is a big but, there are two problems with ANY adaptive algorithm.

The first problem is oscillation. Adaptive algorithms works by changing
the environment based on input from the environment. When you look
at an adaptive algorithm that works it is actually quite impressive. By
merely finding the proper conditions based on the input you can get a
system that quickly adapts to any working condition and finds a new
optimal spot to work in.

My original study at the university was mathematical statistics.
One important fact in most mathematical statistics is that you
have stable states and you have transient states.

An adaptive algorithm will work fine as long as the frequency of
changes in the environment is not faster than the time it takes to
find a new stable state.

As an example in the algorithms in NDB, most of them takes
decisions to change the environment about once per second.
One important thing to make those adaptive algorithms better
at adapting is to not change the controls to much. If one base
the decision on what to do the next second only on the last
second the adaptive algorithm is quite likely to
self-oscillate.

Thus it is important to build in some inertia in the adaptive
algorithm. This protects the algorithm from going wild.
But it doesn't make it adapt to conditions that change
quicker than the change frequency. Adaptive algorithms
cannot handle that.

So this is the first problem, to ensure that the adaptive
algorithm is quick enough to change to handle the
changing environment, but not so quick that it starts to
self-oscillate.

The second problem is when two adaptive algorithms
crash into each other. As an example in NDB we have a
problem when CPU load is extremely high due to
application activity while at the same time we are
coming close to the limit of the REDO log. In this case
we have two adaptive algorithms that conflict, one wants
to decrease the checkpoint speed to keep the application
activity while the other algorithm tries to slow down the
checkpoint activity to avoid running out of REDO log.

Now in a car the bets are higher, its human lifes involved.
Almost the same problem a self-driving car will have to
solve when the driver has decided on the speed he wants
to travel while at the same time the control of the car sees
dangers coming up ahead. These dangers could be other
cars, cliffs or any other thing.

Sometimes cars even have to make decision on whether
its own passengers should survive or whether the by-stander
should survive.

So the software of a self-driving car and any other
self-controlling software suffers from two big problems
to solve.

1) How often should I take input from the environment and
decide to change the controller parameters.
2) How should I handle conflicting requirements

Failure in handling 1) will lead to self-oscillating
behaviour and failure to handle 2) will lead to
crashes.

So hopefully any developer of self-driving cars has read up
a lot on adaptive algorithms and know exactly when the
algorithm is safe and when it isn't.

Personally I always feel a bit uneasy about any adaptive
algorithm since I know that it is almost impossible to
predict exactly how it is going to behave in all situations.

The mathematics involved in understanding adaptive
algorithms requires a lot of understanding of differential
equations.

Thursday, February 08, 2018

Content of MySQL Cluster 7.5 inside and out

Here is a link to the Book content in the new book MySQL Cluster 7.5 inside and out.

MySQL Cluster 7.5 inside and out

A new book on MySQL Cluster is out. It is called MySQL Cluster 7.5 inside and out.
It is on its way out to all internet sites. Currently it is accessible on adlibris.se and on
BoD's bookshop and now also cdon.com. They are all in swedish, but with Google
Translate that should be possible to overcome. It will hit most other international book
stores within a week or less.

It is currently available for orders as a printed book, it will become available as an
eBook in 1-2 weeks. The printed book is in bound format since I wanted to make it
possible to read it frequently, it is 640 pages long.

I started development on NDB as my Ph.D research. The first years I did collect requirements
and tried to understand how database internals works. In 1996 the development started.
I wrote my Ph.D thesis in 1998 that stated most of the ideas used in the early versions of
the NDB implementation.

The idea on writing a book about MySQL Cluster have been coming up for me every now
and then since more than 10 years back. However all the time I felt it was more important
to focus on one more feature to develop.

In 2015 I decided that it was more important to write down a description of the features in
NDB. So in 2016 I started writing this book. As usual with book projects they take a lot longer
than expected.

At about the same time Jesper Wisborg Krogh and  Mikiya Okuno also started writing a
book about MySQL Cluster. This is called Pro MySQL NDB Cluster.

So the good news is that we now have two very good books about MySQL Cluster.
Jesper and Mikiyas book is written from the perspective of the DBA that have
decided to use NDB.

My book explains why NDB was developed, it describes a great number of applications
where it fits in. It compares it to other clustering solutions for MySQL.

I wanted to make sure that the first step to install and get started with MySQL Cluster
isn't a showstopper, so I described in some detail how to install it and get up and running
on various platforms. This includes a chapter on MySQL Cluster and Docker. Later
there is also a chapter on using NDB in the cloud.

Next it goes through NDB from an SQL point of view and describes all the things that
are good to understand when working with MySQL Cluster. It goes through the direct
NDB APIs (C++ API, Java API and a Node.js API). It goes through how to import
and export data to/from NDB.

It explains the various ways you can replicate between clusters using MySQL Cluster.
It also explains why those solutions exist and what the problem it is trying to solve is.

We have developed quite a few ndbinfo tables that can be used to gather an understanding
of the cluster in operation. These tables are explained and the purpose of them.

Next I dive into some internals, describing the software architecture, the message flows
and the restarts in NDB. I provide some advices on how to optimise restart times.

Next I dive deep into the configuration of MySQL Cluster, both the cluster configuration
and the MySQL servers. I provide detailed descriptions of how to configure for optimal
performance. I also provide details on the memory impact of many configuration parameters.
The configuration chapters include detailed descriptions of how to setup an optimal
execution environment for NDB, this includes details on how to set up the Linux
infrastructure for optimal performance.

Finally I go through our testing frameworks that we make use. I go through in detail
the benchmark framework I developed for more than 10 years called dbt2-0.37.50
that can be used to benchmark with sysbench, DBT2 and flexAsynch.

Finally the history of MySQL Cluster is provided.


Monday, February 05, 2018

Wednesday, January 31, 2018

Partial LCP in MySQL Cluster 7.6.4

Today MySQL Cluster 7.6.4 DMR is out. This new version contains some very interesting
new developments in the area of checkpointing.

When I developed the original NDB algorithms in the late 90s the computer I had access to
had 2 CPUs, 1 GByte of memory. At the time we were aiming at 10.000 updates per second.

So with average row sizes of 100 bytes this meant that we changed 1 MByte per second.
The local checkpoint algorithm was designed to be executed once per about 5 minutes.

So this meant that most of the database would be changed at high loads. So the
checkpointing algorithm writes the entire database. This means that a lot of updates
are merged together in the checkpoint.

This algorithm has been used now in NDB for 20 years and it still works fine.

Now HW is developing in a number of ways.

1) Memory is getting larger and larger. Today it is not uncommon to find machines
with TBytes of memory.

2) The ratio between available disk space is decreasing.

In the past it was not uncommon to have 100 times as much disk space as memory space.
With SSDs this factor have been decreased significantly. Particularly in servers where
NDB resides this factor have decreased to around 2-3 in many cases.

In addition the development of persistent memory is ongoing, this is likely to cause
memory to grow with a jump of another 4x or so. This means that even tens of TBytes
in a machine is likely to be common.

When starting the development of the new recovery algorithm in NDB 2 years ago the
requirement was thus to implement a new recovery algorithm that will handle
main memory sizes of up to at least 16 TByte of memory and with disk sizes that are
about 2x the memory size.

These requirements leads to the following conclusions:
1) We need to implement some form of incremental checkpoint algorithms.
2) We can only maintain one copy of the data on disk
3) We must have the ability to use REDO logs that are much smaller than the memory size

Currently in NDB a checkpoint is not completed until all data have been written. This
means that we must have disk space to handle up to at least 2x the memory size for
checkpoints.

During massive inserts (e.g. importing data), it is necessary to have a very large REDO log
to ensure that we don't run out of REDO log during import of a database.

These requirements are ok if there is sufficient disk space, but we wanted to make sure
that we don't rely on large disk spaces in the future.

In addition we wanted reengineer the LCP algorithm to take decisions locally and not
rely on all nodes participating in each decision about LCPs. This means that we can now
perform checkpoints locally in a node during restart without affecting LCPs in other
nodes. This is particularly interesting for initial node restarts where a new node will
take a very long time to execute an LCP whereas the live nodes can perform LCPs in
a very short time.

There were two main possible implementations for incremental checkpoints.
1) Use a standard page cache implementation also for main memory
This would mean that we would store two pages for each page in main memory
and write each page such that we always keep the old page until the LCP
is completed.

2) A partial LCP where a part of the rows are fully checkpointed and the rest only
checkpoints the changed rows.

I did analyse the two algorithms and concluded that the standard page cache
algorithm writes far too with small rows.

When the memory size is TBytes in size, the likelihood of one page having more
than write in a checkpoint is small, thus each row change will lead to one
page written in LCP.

With a row size of 100 bytes and a page size of 32 kBytes this would lead to
a waste of more than 300x of the disk bandwidth.

In addition it would still require 2x the disk space.

So the choice was taken to go with the partial LCP variant. Interestingly the
analysis of the standard page cache algorithms will be a problem for all
disk-based DBMSs. The growth to larger page caches will mean that more
and more disk bandwidth is spent on writing checkpoints.

So here is a short description of how the partial LCP algorithm works.

1) For each table partition we keep one or two LCP control files. This file
is normally 4 kBytes in size (can be 8 kBytes in some cases).
This file is used at recovery to know which checkpoint files to use in recovery,
it is also used at the next LCP to know which checkpoint files to write.

2) We keep track of the number of rows in a table partition and we keep
track of the number of row changes since the last LCP was started on the
table partition. These two numbers are used to decide on how many parts
of the table partition to fully checkpoint.

If the number of changes is 0, we only write a new control file.
If there are changes we will write at least 1 part and at most a full
local checkpoint that writes all 2048 parts.

3) The table partition is divided into parts based on the row id. We use the
page part of the row id to decide on which part a row id is part of.

4) The number of parts to write uses some mathematical formulas.
As it turns out there is an interesting relation here.
If we write less parts fully the work at recovery is increasing and the
size of all checkpoints increases but at the same time the amount of
writes to disk is decreasing.
With more parts written per checkpoint we increase the amount of
writes to disk per checkpoint, but we decrease the checkpoint size
and the amount of work at recovery.

We decided to make the choice here configurable in a new configuration
parameter called RecoveryWork. This can be set between 25 and 100 and
defaults to 50.

At 50 the checkpoint size will be around 1.6 times the data size. The
amount of checkpoint writes to do will be around 3 times the size of
the changed rows.

Setting it to 25 means that the checkpoint size will be around 1.35 times
the data size. The checkpoint writes will be around 4 times the size of
the changed rows.

Setting it to 100 means that the checkpoint size will be around 2.1 times
the data size and the checkpoint writes will be around 2 times the size
of the changed rows.

Thus there is an exponential dependency on the amount of checkpoint
writes required to achieve the minimum restart time.

We have selected a balanced approach as the default setting.

It is also possible to set EnablePartialLcp to 0. In this case we always
write full checkpoints if any row changed. This means that the checkpoint
size will be equal to the data size. In this case it isn't possible to use a
small REDO log since checkpoint write speed will ensure that we can
complete an LCP in a certain time. In this setup the REDO log should
be 2x the size of data size to ensure that we can handle survive even a
large import of a database.

The above calculation is based on calculation under the assumption on
that the amount of writes is very small per LCP compared to the data size.
The code contains large comments in Backup.cpp that explains this in
even more detail.

There is an additional 12.5% in the checkpoint size due to the fact that
we only delete files and a full checkpoint writes 8 files, so in the worst
case we might have to keep a file that contains only 1 part that is relevant
and the rest is not needed anymore, this means that 1/8 could in the worst
case be wasted space. Normally this would not be the case, but we want
to ensure that we can keep the disk space within limits all the time, even
in the worst case.




MySQL Cluster 7.6.4 is out

MySQL Cluster 7.6.4 DMR is out.

This new version contains a number of goodies.

1) Local checkpoint algorithm have been rewritten
The new checkpointing is designed to scale to at least 16 TBytes of DataMemory sizes
Checkpoints will be much faster, this decreases recovery times significantly
Table fragments that are not updated will not need any new checkpoints written
Checkpoint size on disk is significantly decreased

2) MySQL Cluster Configurator (MCC, Auto Installer)
MCC is significantly improved. Particularly for setups where you
have external computers either on-premise or in the cloud.

3) New cloud feature
In the cloud with availability domains/zones it is possible to have
10x difference between latency inside an AD compared to between
ADs. To run MySQL Cluster in a cloud region with synchronous
replication between ADs one can now configure nodes with a
LocationDomainId. This LocationDomainId will be used to
ensure that transaction coordinator is placed in the same AD and
that we always prefer reading data from our own AD if possible.

4) New ODirectSyncFlag
When using ODirect there are a number of file systems that ensures that writes
are also synched to disk. If the user knows that he is working in such an
environment setting ODirectSyncFlag can improve disk write speeds by
around 2-3x. This is particularly interesting when using hard drives.

5) Change default behaviour of restart configuration
We changed the BuildIndexThreads from 0 to 128 to improve speed of
index rebuilds. We added a new configuration setting to specify which
CPUs that can be used for index rebuilds.

We increased batch sizes (and made them configurable) to improve
performance of unique index creation and online add node and some
other algorithms.

We changed the default algorithm for initial node restart to rebuild indexes
in a special phase.

All these changes can lead to a very significant reduction in restore times.

6) Many improvements to our parallel query implementation (pushdown join,
SPJ). The improvement depends on the queries, but in our special benchmark
query we have improved latency of query execution to almost half.

7) Parallel UNDO log applier for disk columns
The phase where we apply the UNDO log is now fully parallelised over all
LDM threads. For a scenario with 4 LDM threads we've seen a speed up of
5x for the UNDO log applier phase (this is only used to UNDO changes in
pages containing disk columns during a restart).

8) Bug fixes
We have continued our quality improvements to ensure that each new version
is even more stable compared to the previous one.