Source code interpretation: MySQL 8.0 InnoDB lock-free design log system

Source code interpretation: MySQL 8.0 InnoDB lock-free design log system

about the author

Zhang Yongxiang, the current NetEase Cloud RDS developer, continues to pay attention to the field of MySQL and database operation and maintenance, is good at MySQL operation and maintenance, and knows ID: Yannangui.


An important new feature in MySQL 8.0 is the reconstruction of the Redo Log subsystem. By introducing two new data structures recent_written and recent_closed, the previous two hotspot locks are removed: log_sys_t::mutex and log_sys_t::flush_order_mutex.


This lock-free reconstruction allows different threads to write to redo_log_buffer in parallel, but this brings about the problem that log_buffer is no longer written in the order of LSN growth, and the dirty pages in flush_list are no longer strictly guaranteed The problem of increasing order of LSN.


This article will introduce the refactoring of log_buffer related code in MySQL 8.0, and introduce the solution to the problem introduced by concurrent writing to log_buffer.


1. Overview of MySQL Redo Log System


Redo Log, also known as WAL (Write Ahead Log), is the key to the InnoDB storage engine to achieve transaction persistence.


In the InnoDB storage engine, the transaction execution process is divided into one MTR (Mini TRansaction), and each MTR changes to the data page during the execution process will generate a corresponding log, this log is Redo Log. When the transaction is committed, as long as the Redo Log is guaranteed to be persisted, the persistence of the transaction can be guaranteed.


Because Redo Log writes files sequentially during the persistence process, the cost of persisting Redo Log is much less than that of persisting data pages. Therefore, under normal circumstances, the persistence of data pages lags far behind Redo Log.


Each Redo Log has a corresponding LSN (Log Sequence Number). At the same time, the data page will also record the LSN of the Redo Log that has modified the data page . When the data page is persisted to the disk, this is no longer needed The Redo log before the LSN recorded in the data page. This LSN is called Checkpoint.


When recovering from a failure, you only need to reapply the Redo Log after Checkpoint to get all the data pages that were not persisted before the instance Crash.


The InnoDB storage engine maintains a global Redo Log Buffer in memory to cache changes to Redo Log. When mtr is submitted, it will copy the local log generated during the execution of mtr to the global Redo Log Buffer , and add mtr Data pages modified during execution (called dirty pages) are added to the flush list in a global queue. 


The InnoDB storage engine will place the logs in the Redo Log Buffer to disk according to different strategies , or flush the dirty pages in the flush list and advance the checkpoint.


In the process of dirty page placement and Checkpoint advancement, it is necessary to strictly ensure the order of the Redo log being placed first and then flushing the dirty pages. Before MySQL 8, the InnoDB storage engine strictly guaranteed that the order in which MTR was written to the Redo Log Buffer was increased in LSN The order of, and the dirty pages in the flush list are sorted in ascending LSN order.


When multiple threads write to Redo Log Buffer and flush list concurrently , this constraint is achieved through two global locks log_sys_t::mutex and log_sys_t::flush_order_mutex.


2. the submission process of MTR in MySQL 5.7


In MySQL 5.7, the operations of Redo Log writing into the global Redo Log Buffer and adding dirty pages to the flush list are completed in the commit phase of mtr. The simplified code is:



There is a picture in the official MySQL blog that shows this process well:



3. the lock-free design in MySQL 8


As can be seen from the above code, when multiple MTRs are submitted concurrently, these MTRs are actually completed serially from the local log Copy redo to the global Redo Log Buffer and add Dirty Page to the Flush list. The serial operation here is the bottleneck of the entire MTR submission process. If it can be changed to parallelism, it will definitely improve the efficiency of MTR submission.


However, serialized submission can strictly guarantee the continuity of Redo Log and the increment of Page modified LSN in the flush list. These two constraints make the behavior of flushing Redo Log and dirty pages to disk very simple. Just write the contents of the Redo Log Buffer to the file in order, flush the dirty pages into the table space in the order of the flush list, and advance the Checkpoint.


When MTR is no longer submitted in a serial manner, the following problems will need to be resolved:


  • The serial copy of the local log of MTR to the global Redo Log Buffer can ensure that the log of each MTR is continuous and will not be divided in the Redo Log Buffer . When copying logs in parallel, additional means are needed to ensure that the mtr logs are still continuous after being copied to the Redo Log Buffer . MySQL 8.0 atom using a global variable log_t :: sn copy data before the MTR Redo Log Buffer reservation required in good position, so that the parallel data to the copy Redo Log Buffer when it will not interfere with each other.

  • Since multiple MTRs copy data to the Redo Log Buffer in parallel , some MTR copy is bound to be faster, and some MTR copy is slower. At this time , there may be holes in the Redo Log Buffer , so a method is needed to determine the Redo. What content in the Log Buffer can be written to the file. The new data structure Link_buf introduced in MySQL 8.0 solves this problem.

  • Adding dirty pages to the flush list in parallel will break the monotonic constraint of the LSN corresponding to each data page in the flush list. If the dirty pages are still placed in the order in the flush list, how to determine the location of the Checkpoint?


The following article will discuss the above three issues separately:


1. MTR copy log to Redo Log Buffer without lock


In MySQL 8.0, the submission part of MTR can be represented by the following pseudo code:



Compared with the code in 5.7, the most obvious difference is the removal of log_sys->mutex lock and log_sys->flush_order_mutex lock, and the key to achieve Redo Log lock-free is log_buffer_reserve(*log_sys, len) This function, the key is The code has only two sentences:



As you can see, here is an atomic operation std::atomic<uint64>.fetch_add(log_len) to pre-allocate space in the global Redo Log Buffer before Copy Redo to achieve parallel writing without conflict.


2. Log Buffer hole problem


The pre-allocation method can make multiple MTR non-conflicting copy data to Redo Log Buffer , but because some threads are faster and some threads are slower, it will inevitably cause the Redo Log Buffer hole problem, which makes Redo Log Buffer flush to disk The behavior becomes complicated.



As shown in the figure above, the first and third threads in the Redo Log Buffer have completed the writing of the Redo Log , and the second thread is writing to the Redo Log Buffer . At this time, the Redo of the three threads cannot be combined. Placed. A data structure Link_buf was introduced in MySQL 8.0 to solve this problem.


Link_buf is actually a fixed-length array, and ensures that the update of each element of the array is atomic, and reuses the released space in a circular manner.


Link_buf is used to assist in expressing the use of other data structures. In Link_buf, if the value corresponding to an index position i is non-zero value n, it means that the data structure of the Link_buf auxiliary mark, starting from i and the following n elements have been occupied . At the same time, Link_buf maintains a variable M to represent the current maximum reachable LSN. The structure diagram of Link_buf is as follows:



At the interface level, Link_buf actually defines 3 effective behaviors:



Redo Log Buffer maintains two Link_buf type variables recent_written and recent_closed to maintain the modification information of Redo Log Buffer and flush list.


For redo log buffer, the corresponding relationship between buffer usage and recent_written is shown in the figure below:



The variable buf_ready_for_write_lsn maintains the maximum LSN value that can guarantee no holes, which is the result of recent_written->tail(). The Redo Log before this can be safely persisted to disk.


When the data in the first hole position is successfully written, the mtr that writes the data will update the recent_written internal state as shown in the following figure by calling log.recent_written.add_link(start_lsn, end_lsn):



This part of the code is in the log_buffer_write_completed method of the file.


Each time the recent_written is modified, a separate thread log_writer will be triggered to scan the recent_written backwards and update the value of buf_ready_for_write_lsn (call the recent_written->advance_tail() method). The log_writer thread is actually the thread that performs log writing to the file. The interior of the recent_written variable scanned by the log_writer thread is shown in the following figure:



This solves the void problem caused by concurrent writes to log_buffer by MTR. Through the newly introduced Link_buf type data structure, it is very convenient to know which part of the Redo Log can execute the operation of writing to the disk.


More details about placing orders


In MySQL 8, the redo log placement process is completed by two independent threads, log_writer and log_flusher, the former is responsible for writing the data in the Redo Log Buffer to the OS Cache, and the latter is responsible for non-stop execution of fsync operations The data in the OS Cache is actually written to the disk.


The two threads are synchronized through a global atomic variable log_t::write_lsn, write_lsn represents the largest LSN of the Redo log currently written to the OS Cache.



The placement of the redo log in the log buffer does not need to be concerned by the user thread. The user thread only needs to wait for the notification of log_writer or log_flusher according to the different behaviors defined by innodb_flush_log_at_trx_commit when the transaction is submitted.


After the log_writer thread monitors that recent_written has been modified, the redo log in log_buffer that is larger than log_t::write_lsn and smaller than buf_ready_for_write_lsn is flushed to the OS Cache, and log_t::write_lsn is updated.


The log_flusher thread calls fsync() once after listening to the update of write_lsn and updates flushed_to_disk_lsn. This variable holds the value of the latest fsync to the file.



In this design mode, the user thread is only responsible for writing the log to the log_buffer, and the flushing and placing of the log is completely asynchronous. According to the different behaviors defined by innodb_flush_log_at_trx_commit, the user thread needs to wait for the log to be written into the operating system cache or when the transaction is committed. Disk.


Before 8.0, the user thread triggered fsync or waited for the first submitted thread to execute fsync (Group Commit behavior). In MySQL 8.0, the user thread only needs to wait for the flushed_to_disk_lsn to be large enough.



In 8.0, a fragmented message queue is used to notify user threads. For example, if the user thread needs to wait for flushed_to_disk_lsn >= X, it will join the message queue to which X belongs. Fragmentation can effectively reduce the loss of message synchronization and the number of threads that need to be notified at one time.



In 8.0, the waiting user thread is notified by the background thread log_flush_notifier. The synchronization relationship between the user thread, log_writer, log_flusher, and log_flush_notifier is as follows.



In 8.0, in order to prevent the user thread from being awakened immediately after falling into the waiting state, the user thread will spin to check the waiting condition before waiting. Two new Dynamic Variables are added in 8.0: innodb_log_spin_cpu_abs_lwm and innodb_log_spin_cpu_pct_hwm to control the water level of the CPU when performing a spin operation, so as to avoid the spin operation taking up too much CPU.


3. Flush list concurrency control and check point advancement


Going back to the code submitted by the MTR above, you can see that after writing the Redo Log to the global log buffer, mtr immediately starts the step of adding dirty pages to the flush list. The process is divided into three function calls.



Here is also a Link_Buf type lock-free structure recent_closed to track the flush list concurrent write status.


Assuming that the range of the redo log generated by MTR at the time of submission is [start_lsn, end_lsn], after MTR adds the dirty pages corresponding to these redo to a flush list, it immediately marks the segment from start_lsn to end_lsn in the recent_closed structure. recent_closed also maintains the variable M internally. M corresponds to an LSN, which means that all dirty pages smaller than the LSN are added to the flush list.


Unlike redo log writing, MTR needs to wait for the value of M to be not too different from start_lsn before writing to the flush list before it can be written. This is to control the holes on the flush list within a range. The schematic diagram of this process is as follows:



Before MTR is written to the flush list, it needs to wait for the difference between the value of M and start_lsn to be a constant L. This constant measures the degree of disorder in the flush list, which makes the determination of the checkpoint simple (in the actual code, the value of L Is the internal capacity of recent_closed).


As can be seen from the above code, the behavior of actually adding to the flush list in 8.0 is not completely concurrent, but it is not completely serial in 5.7, but is controlled to parallel writes within a range L.


Since MTR needs to wait for the condition start_lsn-M <L to be established before it can be added to the flush list, on the other hand, for each Page in the flush list, if the corresponding modified LSN is Ln, then it can be concluded that the Page corresponding to Ln-L must be It has been added to the flush list, and it must be before the current Page (because the check condition when the Page is added is Ln-L <M, before M is a continuous LSN without holes).


In other words, if the original policy of flushing dirty pages to disk in the order of flush list is continued, only the advancement of Checkpoint needs to be changed from the LSN corresponding to the original Page to LSN-L.


When actually implemented in MySQL 8.0, Checkpoint advancement is still written in accordance with the LSN corresponding to the Page, but it is executed from Checkpoint-L during Recover. These two methods are actually equivalent.


However, in MySQL 8.0, the Recover phase starts from Checkpoint-L, and it may be encountered that Checkpoint-L is the middle position of a certain Redo instead of the starting position, so some extra work has to be done for some boundary conditions. .


4. summary


For the InnoDB storage engine, the processing of Redo Log is the key to achieving transaction persistence. In MySQL 5.7 and before, through two global locks, the MTR submission process is actually serialized to ensure the correctness of RedoLog and dirty page processing. This makes the MTR submission process unable to give full play to the advantages of multi-core due to lock competition.


The Link_buf data structure introduced in 8.0 turns the entire module into a Lock_free mode, which will inevitably bring performance improvements.