Tuesday, March 10, 2015

LCP Pausing module in MySQL Cluster 7.4

A new feature that assists in making node restart much faster is
the new PAUSE LCP protocol. This is an excerpt from the MySQL
Cluster 7.4 source code. There is also a fair amount of new
comments in the 7.4 source code which are only valid in the code
context.

This module contains code that executes for the purpose of pausing
LCP reporting to our meta data for a short time while we are copying the
meta data to a new starting node.

In order to better understand the handling of the LCP protocol we will
describe the LCP protocol, this includes both the old and the new protocol.

The LCP protocol is controlled by the DIH in the master node.
When an LCP has been completed we will immediately start checking for
the need for a new LCP to be started.

The first step here is to ensure that we have had sufficient activity in
the cluster to necessitate an LCP to be executed again.

To check this we send TCGETOPSIZEREQ to all DBTCs in the cluster. This
will gather in an estimate of how much writes we've had in the cluster
since the last LCP was started. There are also various ways to ensure
that we start an LCP immediately if so needed.

If the activity was sufficient we will start the LCP. Before starting the LCP
we will calculate a number of GCI values that are important, oldest restorable
GCI and so forth. Next we will send TC_CLOPSIZEREQ to all DBTCs in
the cluster to clear the activity counter in DBTC as preparation for the next
LCP start.

In the old way we will then grab a mutex on the fragment info, this
mutex will be held until the LCP is completed. The mutex is held in
the master node, in a master takeover the mutex needs to be taken
also in the new master node. Since all LCPs goes through the master
node this has the same effect as a distributed mutex on the fragment
info.

In the new way we will start the LCP immediately without grabbing
the mutex.

The first step in starting is to calculate the set of LQHs involved in
the LCP and the set of DIHs involved in the LCP. A node is involved in
the LCP in DIH if it has had the meta data copied to it. It will
participate in an LCP in LQH if the data has been restored and we're
ready to perform a full LCP.

Next we update to the new LCP id of the new LCP.

The next step is performed in the master node by walking through all
fragment replicas of all active tables to see how much of the REDO log
we can cut away when starting the new LCP. At the first order of a
LCP of a fragment in an LDM instance we will set the new log tail in
that LDM instance.

After calculating the new GCI values and setting the LCP id we will
synchronize this information with all other nodes in the cluster.
This information will also be synchronized to the file system in
the Sysfile. This file is where all restarts start by looking at
the state of the our database on files.
The COPY_GCIREQ signal is used to distribute this message.

When all nodes have synchronized this information to disk and confirmed
this to the master then we are ready to start sending orders to perform
the individual checkpoints of the fragment replicas.

The next step is that we want to set the tables to be involved in the
LCP. At this point we want to ensure that the same set of tables is
calculated in all nodes. To ensure this we grab the mutex that ensures
no tables are able to commit their CREATE TABLE statements until we are
done with this step.
This is started by the signal START_LCP_REQ. This signal also contains
list of nodes involved in the LCP both for LQH and DIH.

CREATE TABLE can create new tables prior to this point  which we will
include, and that's ok as they cannot possibly affect the new redo tail
position. DROP TABLE can drop tables prior to this point, which could
remove the need to maintain some old redo, but that will be handled in
the following LCP.

Each table to execute the LCP on is marked with a proper state in the
variable tabLcpStatus. Also each fragment replica to execute the LCP
on is marked with true in the lcpOngoingFlag and we set the number of
replicas to perform LCP on per fragment as well.

These preparatory steps are done in a synchronized manner, so all nodes
have received information about the COPY_GCIREQ and now all nodes have
heard the START_LCP_REQ signals. So in a master takeover we can ask all
nodes about their LCP state and we can derive if we sent the COPY_GCIREQ
to all nodes and similarly we can derive if we sent and completed the
START_LCP_REQ step. To derive this requires all nodes to have heard of
those signals, not just one of them since a crash can occur in the
middle of signal sending.

In a master takeover if we haven't completed the COPY_GCIREQ step then
we can start the next LCP from the beginning again. If COPY_GCIREQ has
been completed but not the START_LCP_REQ, then we can restart the
START_LCP_REQ step. Finally if the START_LCP_REQ has been completed
then we know that the execution of checkpoints on individual fragment
replicas is ongoing. Obviously in a master take over we should ensure
that the processing of START_LCP_REQ is completed before we report
back our state to the master node to ensure that we make the master
takeover handling as simple as possible.

So now that we know exactly which tables and fragment replicas to
checkpoint it is time to start the actual checkpoint phase.

The master node will send LCP_FRAG_ORD to DBLQH for each of the fragment
replicas to execute the LCP on.

In the old way there was a queue of such LCP_FRAG_ORD with limited size in
DBDIH (queue size was 2 in 7.3 and earlier and 128 in 7.4 versions).
Also DBLQH had a queue for LCP_FRAG_ORDs, in 7.3 this was 2 in size and
in early versions of 7.4 it was 64.

In the new version we can send LCP_FRAG_ORD to LQH as before, LQH has an
infinite queue size (it simply stores the LCP_FRAG_ORD on the fragment
record, so there is no limit to the queue size since all fragments can
be in the queue). In addition at master takeover we also support receiving
the same order two or more times. By ensuring that we keep track of that
we already received a LCP_FRAG_ORD on a fragment we can also easily discard
LCP_FRAG_ORDs that we already received.

These features mean that LQH can process a Local Checkpoint without much
interaction with DIH / DIH Master, which enables simplifications at DIH
and DIH Master in later versions. In principle we could send off all
LCP_FRAG_ORDs immediately if we like and more or less turn the LDM
instances into independent LCP execution engines. This is a step in the
direction of more local control in LQH over LCP execution.

When all LCP_FRAG_ORD have been sent, then a special LCP_FRAG_ORD
is sent to all participating LQH nodes. This signal has the flag lastFragmentFlag
set, it doesn't contain any fragment to checkpoint, it is only a flag that
indicates that we've sent the last LCP_FRAG_ORD.

LQH will execute orders to execute LCP on a fragment in the order they are
received. As a fragment is completing its LCP it will generate a new message
LCP_FRAG_REP. This message is broadcasted to all participating DIHs. First
the message is sent from DBLQH to the local DIH. Finally the local DIH will
broadcast it to all participating DIHs.

This new Pausing LCP module is involved here by being able to queue also
LCP_FRAG_REP before they are broadcast to the participating DIHs. They are
queued on the fragment replica records in the local DIH and thus we have
no limits on the queue size.

This allows the DIH Master state to be stabilised as necessary during an
LCP, removing the need in some cases to wait for an LCP to complete before
performing some other activity.

When LQH have executed all the LCP_FRAG_ORDs and have received the
last fragment flag, then the LDM will perform a number of activities to
complete the local checkpoint. These activities is mostly used by the
disk data tables.

After all these activities have completed the LQH will send
LCP_COMPLETE_REP to the local DIH. The local DIH will broadcast it to all
participating DIHs.

When all LQHs have sent all LCP_FRAG_REP and it has also sent the
LCP_COMPLETE_REP, then the LCP is completed. So a node that has seen
LCP_COMPLETE_REP from all nodes participating in the LCP knows that
it has received all the LCP_FRAG_REP for the LCP.

In a master takeover in the old way we could not resend the LCP_FRAG_ORD
to the LQH again. To avoid this we used an extra master takeover
protocol EMPTY_LCP_REQ. This protocol ensures that all LQHs have completed
the queues and that all LCP_FRAG_REPs have been sent to all participating
DIHs and likewise with the LCP_COMPLETE_REP such that the new master has
a precise view of which fragment replicas have completed the LCP execution
so far.

Thus when the master takeover is completed we know that each DIH has received
all the LCP_FRAG_REP for which an LCP_FRAG_ORD have been sent and also
all LCP_COMPLETE_REP that have been produced. This means that we are now
ready to restart the process of sending LCP_FRAG_ORD again.

The problem with this approach is that can consume a very long time to
execute the entire LCP fragment queue in LQH if the queue size increases
(increased from 2 to 64 going from 7.3 to 7.4) and the size of the
fragments also increase. So the master takeover can take a substantial
time in this case.

So the new manner is to allow for the LQH to get LCP_FRAG_ORD and also
the special last LCP_FRAG_ORD several times with the same LCP id and
discard those that it receives for a second time. In this manner we can
simply restart sending the LCP_FRAG_ORD from the beginning. When we are
done with this we can start checking for completion of the LCP in the
normal way.

When the master has sent the last special LCP_FRAG_ORD and these have been
received by the receiving nodes, then the master will actually itself not
do anything more to execute the LCP. The non-master nodes will however send
LCP_COMPLETE_REP to the master node. So this means that a new LCP won't
start until all participating DIHs have completed the processing of the
last LCP.

So effectively taking over as master in this phase doesn't really require
any specific work other than redirecting the LCP_COMPLETE_REP from the
non-masters to the new master. If it has already been sent it should be
seen in the response to the MASTER_LCPREQ from the node. So after
receiving the last MASTER_LCPCONF we have information enough about whether
we need to send more LCP_FRAG_ORDs or not.

We can still optimise the sending of LCP_FRAG_ORD a little bit by avoiding
to send LCP_FRAG_ORD to a fragment replica where we have already received
a LCP_FRAG_REP for it. It would be possible to avoid sending extra
LCP_FRAG_ORDs in various ways, but it doesn't really cost much, LCP_FRAG_ORD
is a small signal and the number of signals sent is limited to the number
of fragment replicas. So this would make sense if we have to support
extremely large clusters and extremely many tables in combination.

As this description shows some interesting places to test master failures
are:
1) Master failure while clearing TC counters (TC_CLOPSIZEREQ).
2) Master failure while distributing COPY_GCIREQ.
3) Master failure while distributing START_LCP_REQ
4) Master failure while processing the LCP and sending LCP_FRAG_ORDs
4.1) Before any LCP_FRAG_REP received
4.2) After receiving many LCP_FRAG_REPs, but not all
4.3) After receiving all LCP_FRAG_REPs, but not all LCP_COMPLETE_REPs
4.4) After receiving all LCP_FRAG_REPs, and all LCP_COMPLETE_REPs.

While distributing above can be interpreted as one test case of before
distributing, one in the middle of distributing and one when all
responses have been received.

It is also important to similarly test PAUSE_LCP_REQ handling in all of
the above states. This can be handled by inserting an ERROR_INSERT that
effectively stops the process to copy meta data at some point and then
setting some variable that triggers the copying of meta data to continue
at a state that we wanted to accomplish.

No comments: