dbi Blog

MariaDB has been quite active lately, the version 11.8 was already quite a step forward and apart from the changes on LTS schedule and the EOL durations, the RC 12.3 LTS is bringing some interesting changes to binlogs, pushing even further performance improvements and reliability. Instead of using the traditional flat file binlogs, they can be now integrated into InnoDB WALs. This is quite the change and will have some implications for production. Although this is quite new and there will be some optimizations in the future I guess, I figured we could already benchmark the performance gains and the pros and cons.

Every MariaDB (and MySQL) production DBA has felt the pain of the classic binlog performance tax or replication management. Every committed transaction must cross a two-phase commit (2PC) boundary between InnoDB and the binary log — two separate, sequential fsync() calls per transaction group, regardless of what innodb_flush_log_at_trx_commit is set to. The binlog has always been a flat file, written independently of InnoDB’s Write-Ahead Log (WAL), which means:

Two durability paths to synchronize on every commit
Complex crash recovery logic to reconcile InnoDB state with binlog state
sync_binlog=1 + innodb_flush_log_at_trx_commit=1 = two fsyncs, always
Performance degrades steeply under high concurrency, especially on cloud disks where fsync latency is measured in milliseconds

MariaDB 12.3 ships a fundamental architectural answer to this problem, tracked under MDEV-34705 and authored by Kristian Nielsen: the binary log is now stored inside InnoDB tablespaces, using InnoDB’s own Write-Ahead Log for durability. This is not an incremental tuning — it is a redesign of the commit pipeline.

To understand what changed, you need to understand what the classic binlog commit path looks like:

BEGIN transaction
  → InnoDB: write undo log (redo log)
  → InnoDB: prepare (XA prepare, writes to redo log)
  → Binlog: write event to binlog file
  → Binlog: fsync() [sync_binlog=1]
  → InnoDB: commit (writes commit record to redo log)
  → InnoDB: fsync() [innodb_flush_log_at_trx_commit=1]
COMMIT

The problem: two separate fsync() calls, two separate file paths, and InnoDB cannot commit without first knowing that the binlog has been durably written. If the server crashes between the binlog write and the InnoDB commit, recovery must scan the binlog to find prepared-but-not-committed transactions. This is expensive both in steady-state (latency) and at recovery time (scan time).

Group commit mitigates this partially — multiple transactions can be flushed together — but the fundamental 2PC overhead remains.

With binlog_storage_engine=innodb, the binlog is no longer a separate flat file. It lives inside the InnoDB storage engine, in InnoDB tablespace files with the .ibb extension, pre-allocated at max_binlog_size (default 1 GB each). The commit path becomes:

BEGIN transaction
  → InnoDB: write undo log + binlog event (single redo log)
  → InnoDB: commit (one fsync if innodb_flush_log_at_trx_commit=1)
COMMIT

The two-phase commit between binlog and engine is eliminated. The binlog and data changes are atomic — they land or fail together in the InnoDB WAL.

This has two dramatic consequences:

1. innodb_flush_log_at_trx_commit=0 or =2 becomes safe. Because crash recovery is now handled entirely by InnoDB, you can run with innodb_flush_log_at_trx_commit=0 (no fsync, OS buffer only) and still have a consistent binlog after a crash. Previously, this setting was dangerous for replication because the binlog and InnoDB could diverge after a crash. With the new model, they cannot diverge — they are the same file.

2. With innodb_flush_log_at_trx_commit=1, one fsync replaces two. The most common production setting now only costs a single WAL fsync. Group commit opportunities also improve because the binlog is no longer a separate bottleneck.

(You might want to check the documentation for this parameter link.)

New binlog files use the .ibb extension and are pre-allocated. Here is what the binlog directory looks like side by side:

Classic FILE binlog:

$ ls -lah /var/lib/mysql/binlog/
drwx------ 2 mysql mysql 4.0K  #binlog_cache_files
-rw-rw---- 1 mysql mysql 1.9K  mysqld-bin.000001
-rw-rw---- 1 mysql mysql 4.0K  mysqld-bin.000001.idx
-rw-rw---- 1 mysql mysql 1.4K  mysqld-bin.000002
-rw-rw---- 1 mysql mysql    0  mysqld-bin.000002.idx
-rw-rw---- 1 mysql mysql   80  mysqld-bin.index

InnoDB binlog:

$ ls -lah /var/lib/mysql/binlog/
-rw-rw---- 1 mysql mysql 1.0G  binlog-000000.ibb
-rw-rw---- 1 mysql mysql 1.0G  binlog-000001.ibb

The difference is immediately striking: the .ibb files are pre-allocated at max_binlog_size (default 1 GB). There is no .index file, no .idx GTID index files, no .state file. GTID state is periodically written as state records inside the binlog itself, controlled by --innodb-binlog-state-interval (bytes between state records). GTID position recovery scans from the last state record forward.

Another subtle difference visible in the mariadb-binlog output: all InnoDB binlog events show end_log_pos 0 (position tracking is handled by InnoDB internally), whereas FILE binlog events show actual byte offsets (end_log_pos 257, end_log_pos 330, etc.).

Events can span multiple .ibb files, and parts of the same event can be interleaved across files. mariadb-binlog coalesces them transparently. For correct output across files, pass all files at once in order.

This is a significant improvement over the old model: mariadb-backup now includes the binlog in the backup by default, transactionally consistent with the data. The backed-up server is not blocked during binlog copy. A restored backup can be turned into a replica using CHANGE MASTER ... MASTER_DEMOTE_TO_SLAVE=1 directly — no separate binlog position reconciliation needed.

Operational Impact: GTID, Replica Resync, and Split-Brain Recovery

This is arguably the most impactful day-to-day change for DBAs apart from the TPS throughput. You can’t use anything else than GTID mode for replication if you use innodb as the binlog storage engine. I still run into a lot of MariaDB 10.11 so some DBAs might not use this on a daily basis yet. To help them understand the next part is a quick reminder of the GTID capabilities that has been around for 12 years already…

What `MASTER_USE_GTID = slave_pos` Actually Means

A common point of confusion when reading MariaDB documentation for the first time: in the CHANGE MASTER TO statement, slave_pos looks like a placeholder you are supposed to fill in. It is not. It is a literal enum keyword — one of exactly three accepted values:

MASTER_USE_GTID = no           -- classic file/offset mode, GTID disabled
MASTER_USE_GTID = slave_pos    -- use @@gtid_slave_pos (what this replica last received)
MASTER_USE_GTID = current_pos  -- use @@gtid_current_pos (slave_pos + any local writes)

slave_pos is not a value you provide — it is an instruction telling MariaDB to read the starting replication position from the server’s own @@gtid_slave_pos system variable, which was populated automatically during the backup restore. You are providing an instruction, not data. The actual position is determined by the engine, not by you.

For completeness: MariaDB accepts the keyword unquoted (unlike MySQL’s tendency to quote string enums), which is why it looks like a variable name in documentation examples.

Replica Resync Before and After: The Concrete Difference

This is best understood by comparing the full procedure side by side.

With FILE binlog (classic):

Problem: two independent sources of truth to reconcile.

Take fresh backup with mariadb-backup
Read xtrabackup_binlog_info from the backup:
→ binlog.000042 position 198472910
BUT: this position may not match what InnoDB actually committed
(MDEV-21611 — mariabackup does not always update the InnoDB
binlog position during prepare, causing mismatch)
Restore the backup on the replica
Manually declare:
CHANGE MASTER TO
MASTER_LOG_FILE = ‘binlog.000042’, ← you looked this up
MASTER_LOG_POS = 198472910; ← you looked this up
START SLAVE;
Watch for errors, validate row counts, hope the position was right

The fragility: step 2/4 requires you to provide specific data (a filename string and a byte offset integer). If that data is wrong by even one transaction — which MDEV-21611 demonstrates can happen — the replica either re-applies events it already has, or misses events entirely. This produces silent data drift, not an immediate error.

With InnoDB binlog + GTID:

Problem: eliminated. One source of truth.

Take fresh backup with mariadb-backup
→ .ibb binlog files included automatically, consistent with data
Restore the backup on the replica
→ @@gtid_slave_pos populated from InnoDB state automatically
Declare the replica:
CHANGE MASTER TO
MASTER_HOST = ‘primary’,
MASTER_USER = ‘replicator’,
MASTER_PASSWORD = ‘replpass’,
MASTER_USE_GTID = slave_pos, ← instruction, not a value
MASTER_DEMOTE_TO_SLAVE = 1; ← folds local writes into GTID pos
START SLAVE;
→ replica announces its GTID set to the primary
→ primary streams from that point forward
→ done

You never touch a file name or a byte offset. The position is embedded in the InnoDB state from the moment the backup was consistent.

Split-Brain Recovery

Split-brain is where this matters most. Consider a scenario where the primary and replica became isolated:

Primary:  committed GTIDs 0-1-1 through 0-1-10042
Replica:  applied GTIDs   0-1-1 through 0-1-10039  (was lagging at split)
          then received 3 rogue writes (briefly promoted to primary)
          now has:  0-1-10040, 0-1-10041, 0-1-10042  (divergent, different transactions)

With FILE binlog, reconciling this required manually scanning both binlog files to find the exact divergence point — line by line, event by event. Many teams simply wiped the replica and re-provisioned from scratch to be safe. Even experienced DBAs would spend hours on this.

With GTID, the divergence is immediately visible and unambiguous:

-- On the drifted replica:
SELECT @@gtid_slave_pos;
-- Returns: 0-1-10042  (but these last 3 GTIDs are rogue, divergent from primary)

-- On the primary:
SELECT @@gtid_binlog_pos;
-- Returns: 0-1-10042  (completely different transactions at 10040-10042)

To resync, declare the last known good position explicitly and restart:

STOP SLAVE;
RESET SLAVE;
SET GLOBAL gtid_slave_pos = '0-1-10039';  -- last confirmed good GTID from primary
START SLAVE;
-- MariaDB streams 10040-10042 from the primary, no manual intervention

With gtid_strict_mode = ON (recommended), MariaDB will refuse to apply a GTID it has already seen with different content — you get an explicit error rather than silent data corruption. The divergence surface is precisely identified, not guessed at.

`MASTER_DEMOTE_TO_SLAVE=1` — The Safety Net for Former Primaries

When resetting a server that was briefly acting as a primary (during split-brain, or when repurposing a former primary as a replica), its @@gtid_binlog_pos includes local writes that the current primary does not know about. Without reconciliation, MASTER_USE_GTID = slave_pos would start from @@gtid_slave_pos, which might be behind the local writes — creating a gap.

MASTER_DEMOTE_TO_SLAVE = 1 handles this automatically: it computes the union of @@gtid_slave_pos and @@gtid_binlog_pos, sets that as the new gtid_slave_pos, and proceeds. In plain English: “I may have had local writes — include them in my starting position declaration so I don’t replay things I already have.”

Without MASTER_DEMOTE_TO_SLAVE=1:
— Risk: potential GTID gap or conflict if server had local writes

With MASTER_DEMOTE_TO_SLAVE=1:
— MariaDB automatically: SET gtid_slave_pos = gtid_slave_pos ∪ gtid_binlog_pos
— Then connects to primary from that unified position
— Safe regardless of the server’s previous role

In summary, the GTID + InnoDB binlog combination reduces your split-brain/resync runbook from a multi-step forensic procedure with manual position arithmetic to three commands:

STOP SLAVE;
SET GLOBAL gtid_slave_pos = '<last_known_good_gtid>';
CHANGE MASTER TO MASTER_USE_GTID = slave_pos, MASTER_DEMOTE_TO_SLAVE = 1;
START SLAVE;

For teams (like me) that have been burned by binlog position mismatches in production, this operational simplification was a game changer, and now with adding binlogs into InnoDB WALs, the procedure is position-agnostic. This means a faster and easier way to create a replica and automate a full re-sync because mariadb-backup captures data and binlog atomically and restoring sets @@gtid_slave_pos automatically. No parsing, no conditionals, no positions arithmetic, the same script will work whether you re-sync after lag, crash, or split-brain (which is even less likely) idempotently.

Galera Cluster and InnoDB Cluster: What Changes (and What Doesn’t)

Why Galera Is Blocked and Why It’s Non-Trivial to Fix

Understanding the block requires understanding what Galera actually does at commit time — and crucially, what it does not use the binlog for between cluster nodes.

Galera nodes do not replicate by shipping binlog events to each other. The replication mechanism is the wsrep write set protocol: at commit time, the node extracts the changed rows as a write set, broadcasts it to all cluster nodes, and all nodes run a certification protocol (conflict detection) before any node commits. The commit is synchronous across all nodes.

Galera write path:
  Client writes to Node A
  → InnoDB: prepare transaction
  → wsrep: extract write set from transaction
  → wsrep: broadcast write set to all nodes
  → wsrep: certification round (all nodes vote — conflicts?)
  → wsrep: apply write set on all nodes simultaneously
  → InnoDB: commit on all nodes
  → binlog written AFTER commit (for async slaves attached to the cluster)

The binlog in a Galera cluster has exactly one purpose: feeding async replica slaves that hang off one of the cluster nodes for analytics, backup, or reporting. The Galera nodes themselves never read each other’s binlog.

The problem with InnoDB binlog is the certification hook. The wsrep plugin currently intercepts between InnoDB prepare and InnoDB commit — it sits at the 2PC boundary that InnoDB binlog eliminates. With the new architecture, InnoDB prepare and commit are atomic — there is no pause point where wsrep can insert its certification round. The wsrep plugin needs to be re-integrated at a different layer of the storage engine API, which is a significant engineering effort. The GALERA26 label on MDEV-34705 acknowledges this as planned future work, but it is not present in 12.3.

What wsrep needs:
InnoDB prepare → [wsrep certification] → InnoDB commit

What InnoDB binlog does:
InnoDB prepare + commit = atomic, no pause point→ Incompatible. wsrep cannot insert itself.

Impact on Galera Async Slaves

Even if you are not running Galera nodes with InnoDB binlog, you may be running async replicas attached to a Galera node — a common pattern for offloading analytics queries or running mariadb-backup from a dedicated replica. These are unaffected as long as the Galera node itself continues to use binlog_storage_engine=FILE. The async slave receives standard .bin events from the Galera node and nothing in its pipeline changes.

The restriction is: the Galera node acting as the async replication source must stay on FILE binlog. You cannot mix binlog storage engines within the same replication chain in a meaningful way — the source determines the format.

Production Decision Matrix by Topology

Topology	InnoDB binlog viable?	Notes
Standalone primary + async replicas (GTID)	✅ Yes	Primary use case, fully supported
Primary + async replicas (file/offset)	❌ Not yet	Migrate to GTID first
Semi-sync AFTER_COMMIT + async replicas	✅ Yes	Supported in 12.3.1; AFTER_SYNC is architecturally incompatible
Semi-sync AFTER_SYNC	❌ Never	Architecturally incompatible with 2PC removal
Galera cluster nodes	❌ Not yet	wsrep hook incompatible, planned for future
Async slave off a Galera node	✅ Unaffected	Galera node uses FILE binlog as source
MaxScale read/write split	🔶 Test required	Replication protocol unchanged, failover scripts may need GTID update
MySQL InnoDB Cluster	N/A	MySQL-only, not applicable to MariaDB

Known Limitations and Open Issues (12.3.1 RC)

This feature is marked as opt-in for good reason. The following limitations are documented and/or reported in the JIRA tracker. Read these before recommending it for production.

Documented Limitations (by design, 12.3.1)

Limitation	Detail
Mutual exclusivity	Enabling the new binlog ignores existing `.bin` files. No migration path from old to new binlog format in-place.
GTID mandatory	Replicas must use GTID-based replication (`CHANGE MASTER ... MASTER_USE_GTID=slave_pos`). Filename/offset positions are unavailable (`MASTER_POS_WAIT()` does not work).
Semi-sync: AFTER_SYNC not supported	`AFTER_SYNC` semi-sync cannot work because 2PC no longer exists. `AFTER_COMMIT` semi-sync is supported. MDEV-38190 tracks further semi-sync enhancements.
`sync_binlog` ignored	The option is accepted but silently ignored. Durability is controlled by `innodb_flush_log_at_trx_commit` only.
Old filename/offset API gone	`BINLOG_GTID_POS()`, `MASTER_POS_WAIT()` unavailable. Use `MASTER_GTID_WAIT()`.
Third-party binlog readers	Tools that read `.bin` files directly (e.g., older versions of Debezium, Maxwell) will not understand `.ibb` format. Tools that connect to the server via the replication protocol may work unmodified.
Galera not supported (12.3.1)	The GALERA26 label on MDEV-34705 indicates future support, but wsrep-based clusters cannot use this feature in the current RC.

Open JIRA Issues (as of March 2026)

MDEV-38462 — InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size with new binlog Severity: likely Major. The new binlog writes into the InnoDB redo log, which means large transactions or high write rates can exhaust innodb_log_file_size more rapidly than before. If the redo log is undersized for the binlog load, crash recovery may fail. Lab implication: always set innodb_log_file_size generously (≥2 GB) when enabling the new binlog.

MDEV-38304 — InnoDB Binlog to be Stored in Archived Redo Log A follow-on feature request for archiving binlog data within the InnoDB redo log archiving infrastructure. Not a blocker, but signals that the archival story is incomplete.

MDEV-34705 (parent, closed/fixed in 12.3.1, resolved 2026-02-08) — The main implementation MDEV. Review the sub-tasks and linked issues for outstanding items, notably MDEV-38190 (semi-sync enhancements) and MDEV-38307 (wsrep/Galera feasibility study).

Operational Cautions for Your Lab

Do not test innodb_flush_log_at_trx_commit=0 on a shared Azure disk — results will be misleading due to OS-level write caching. Use a dedicated Premium SSD P30+ or Ultra Disk.
The sync_binlog parameter being silently ignored is a foot-gun during migration — add it to your monitoring alerting if teams use it as a proxy for “durable binlog.”
Pre-allocated 1 GB .ibb files mean disk space consumption “looks” immediately high even under low load.

The Benchmark Design

What We Are Measuring

Three performance dimensions across two configurations:

CONFIG A: binlog_storage_engine=FILE (classic)
CONFIG B: binlog_storage_engine=innodb (new)

Each dimension tested under three durability profiles:

D1: innodb_flush_log_at_trx_commit=1 + sync_binlog=1 (full durability, “gold”)
D2: innodb_flush_log_at_trx_commit=2 + sync_binlog=0 (OS-buffered WAL)
D3: innodb_flush_log_at_trx_commit=0 (no fsync, maximum throughput)

Note: D2/D3 on CONFIG B is where the architectural gain is most visible. D3 on CONFIG A is effectively unsafe for replication; we benchmark it anyway as a theoretical ceiling.

Metrics

Metric	Tool	Notes
Write TPS	sysbench oltp_write_only	Primary headline number
Commit latency p50/p95/p99	sysbench	Shows tail latency behavior
Large transaction commit time	custom SQL	Single 100MB INSERT
Crash recovery time	`kill -9` + timer	Wall clock to `[Ready for connections]`
Replica lag under load	`Seconds_Behind_Master`	1 primary → 1 replica
InnoDB redo log write bytes	`SHOW ENGINE INNODB STATUS`	Redo amplification
fsync count (fio/strace proxy)	iostat + custom	Actual IO calls

Concurrency Levels

1 thread (establish single-threaded overhead)
8 threads (typical application concurrency)
32 threads (high concurrency, group commit regime)
64 threads (stress, disk saturation test)

Azure Lab Setup

Component	Primary	Replica
VM SKU	`Standard_E4ds_v5` (4 vCPU, 32 GB RAM)	`Standard_E2ds_v5` (2 vCPU, 16 GB RAM)
OS Disk	Premium SSD (128 GB)	Premium SSD (128 GB)
Data Disk	P30 Premium SSD 500 GB (5,000 IOPS)	P30 Premium SSD 500 GB (5,000 IOPS)
OS	Ubuntu 24.04 LTS	Ubuntu 24.04 LTS
MariaDB	12.3.1 RC (binary tarball)	12.3.1 RC (binary tarball)

Host caching was set to None on both P30 data disks to ensure we measure actual I/O latency, not Azure’s host-level read cache. The data directory, InnoDB redo log, and binlog all reside on the P30 data disk (/data/mysql).

Each profile was tested with 2 runs per thread count (averaged). The datadir was fully reinitialized between profiles to prevent cross-contamination.

My Lab Results

The following results were obtained on Azure Standard_E4ds_v5 VMs (4 vCPU, 32 GB RAM) with P30 Premium SSD data disks (500 GB, 5,000 IOPS, host caching disabled). MariaDB 12.3.1 RC installed from binary tarball.

Lab parameters: sysbench oltp_write_only, 4 tables × 1M rows, 60s per run (15s warmup), 2 runs per data point (averaged), 20 GB InnoDB buffer pool, 4 GB redo log.

Write TPS — The D1 Headline: 2.4–3.3× Faster

Threads	FILE (D1) TPS	InnoDB (D1) TPS	Speedup	FILE p99	InnoDB p99
1	129	307	2.4×	16.1 ms	8.1 ms
8	444	1,073	2.4×	37.3 ms	15.1 ms
32	1,392	4,625	3.3×	43.8 ms	34.1 ms
64	2,564	8,279	3.2×	72.1 ms	29.4 ms

Even at single-thread concurrency, InnoDB binlog is 2.4× faster. At 64 threads, the gap widens to 3.2× — the elimination of the binlog fsync becomes more valuable as group commit contention increases.

The p99 latency tells the story most clearly: FILE binlog latency rises steeply with concurrency (16 ms → 72 ms), while InnoDB binlog stays dramatically lower (8 ms → 29 ms at 64 threads). The second fsync creates a latency cliff under contention that InnoDB binlog simply does not have.

D2/D3 — Parity When fsyncs Are Already Gone

Threads	FILE (D2) TPS	InnoDB (D2) TPS	FILE (D3) TPS	InnoDB (D3) TPS
1	2,950	2,861	2,943	2,985
8	9,961	9,924	10,027	11,681
32	10,818	11,091	11,202	10,973
64	10,815	10,462	11,121	10,751

At D2 (innodb_flush_log_at_trx_commit=2, OS-buffered) and D3 (=0, no fsync), both binlog engines produce nearly identical throughput: ~10,000–11,000 TPS at saturation. The bottleneck shifts from binlog sync overhead to InnoDB page writes and the P30 disk’s 5,000 IOPS ceiling.

This is the control experiment that validates the D1 results: the InnoDB binlog advantage comes specifically from eliminating the sync_binlog=1 fsync. When fsyncs are already absent, there is nothing to eliminate.

Crash Recovery — Same Speed, Different Architecture

Profile	Recovery time	Pages to recover	Notes
file_d1	34.1s	54,820	5+ XA prepared transactions, 2PC reconciliation
innodb_d1	39.2s	41,688	Single-engine recovery, no XA
file_d2	29.1s	45,430	XA prepared, binlog sync needed
innodb_d2	27.1s	43,219	Clean single-path recovery
file_d3	33.1s	44,309	Binlog-InnoDB sync at recovery
innodb_d3	39.2s	44,846	Clean single-path recovery

All six profiles: zero data loss (1,000,000 rows before crash = 1,000,000 rows after recovery).

Recovery times are comparable (~27–39 seconds). The real difference is architectural:

FILE binlog recovery log shows the classic 2PC reconciliation:

InnoDB: Starting crash recovery from checkpoint LSN=613360854
InnoDB: To recover: 54820 pages
InnoDB: Transaction 1418253 was in the XA prepared state.
InnoDB: Transaction 1418254 was in the XA prepared state.
...

InnoDB binlog recovery log shows single-path recovery — no XA coordination:

InnoDB: Starting crash recovery from checkpoint LSN=6438545394
InnoDB: To recover: 41688 pages
InnoDB: Continuing binlog number 2 from position 583656195.
mariadbd: ready for connections.

The critical operational difference: with FILE binlog at D2/D3 (sync_binlog=0), the binlog can lose events on crash while InnoDB has already committed them — creating silent primary-replica divergence. With InnoDB binlog, this class of failure is architecturally impossible because the binlog is inside the redo log.

Large Transaction Commit — Redo Amplification Is Real

Metric	FILE (D1)	InnoDB (D1)
Single 100K-row UPDATE commit time	1,793 ms	1,564 ms
Redo amplification	1.03×	1.98×
Raw data modified	104 MB	104 MB
Redo log written	107 MB	206 MB
Per-iteration (10K rows) redo	~5.6 MB	~10.8 MB

InnoDB binlog writes ~2× the redo log because every binlog event is also an InnoDB redo log record. Despite this, the commit time is actually slightly faster (1.56s vs 1.79s) because there is no separate binlog fsync to wait for.

For OLTP workloads with small transactions, the redo amplification is negligible — the fsync elimination dominates. For bulk ETL operations producing large redo volumes, this overhead is measurable and should be factored into innodb_log_file_size sizing.

Replication Lag — Same Lag, 56% More Throughput

We measured replication lag with a single async replica (E2ds_v5, 2 vCPU) under sustained 16-thread write load for 180 seconds, monitoring Seconds_Behind_Master every 10 seconds.

Metric	FILE (D1)	InnoDB (D1)
Primary TPS during load	1,276	1,994
Lag at 60s	58s	89s
Lag at 120s	121s	148s
Lag at 180s (load stops)	185s	206s
Lag at 300s	313s	321s
Lag growth rate	~1.0 s/s	~1.1 s/s

The lag growth rate is nearly identical despite the primary pushing 56% more transactions with InnoDB binlog. The bottleneck is the replica’s single-threaded SQL apply — it falls behind at approximately the same rate regardless of how fast the primary commits.

elapsed_s | file_d1 lag | innodb_d1 lag
       10 |          6s |           39s
       60 |         58s |           89s
      120 |        121s |          148s
      180 |        185s |          206s
      240 |        249s |          265s
      300 |        313s |          321s

The implication: InnoDB binlog does not make replication lag worse — it makes the primary faster while the replica remains the limiting factor. Enabling slave_parallel_threads on the replica would likely amplify the advantage further, but that is a test for another day.

Analysis and Recommendations

What the Results Mean for Production

The 2.4–3.3× TPS improvement at D1 durability on Azure P30 disks (5,000 IOPS) is not a synthetic artifact — it represents the removal of a real fsync() call from the commit path. On higher-IOPS storage (Azure Ultra Disk at 10K+ IOPS, NVMe at 100K+ IOPS), the absolute TPS numbers will be higher, but the relative improvement should hold because the 2PC fsync overhead is a fixed latency cost per commit group.

The p99 latency curve tells the production story most clearly: FILE binlog p99 climbs from 16 ms to 72 ms as threads increase from 1 to 64, while InnoDB binlog stays at 8 ms to 29 ms. Tail latency stability under increasing concurrency is exactly what application teams need.

The D2/D3 parity results are equally important: they prove the improvement comes specifically from eliminating sync_binlog=1, not from some general overhead reduction. When fsyncs are already absent, InnoDB binlog adds no benefit — but crucially, it adds no penalty either.

When NOT to Use InnoDB Binlog

Despite the compelling performance numbers:

Large batch/ETL workloads: The 2× redo amplification on large transactions increases redo log pressure. For workloads dominated by bulk INSERTs or mass UPDATEs, size innodb_log_file_size accordingly.
Galera clusters: Not supported in 12.3.1 RC. The wsrep certification hook is incompatible with the new commit path.
Semi-sync AFTER_SYNC: Architecturally incompatible — the 2PC boundary that AFTER_SYNC hooks into no longer exists. AFTER_COMMIT is supported and is the only available wait point with InnoDB binlog.
Third-party binlog readers: Tools that read .bin files directly will not understand .ibb format. Tools using the replication protocol are unaffected.
Disk-constrained environments: Pre-allocated 1 GB .ibb files mean immediate disk usage even under low load.

`innodb_log_file_size` Recommendation

With binlog_storage_engine=innodb, the redo log absorbs all binlog writes. Our lab used innodb_log_file_size=4G without encountering MDEV-38462. For production with high write throughput, start with 4 GB and monitor SHOW ENGINE INNODB STATUS for checkpoint age approaching the log file size. If crash recovery reports “insufficient innodb_log_file_size”, increase to 8 GB for example and monitor. The 2× redo amplification we measured means your redo log needs to be roughly twice as large as it would be without InnoDB binlog.

Bottom Line

For GTID-based async replication with standard tooling, binlog_storage_engine=innodb delivers a 2.4–3.3× TPS improvement at full durability, with 2–2.5× lower p99 latency under
concurrency. The migration requires GTID (mandatory), gives up file/offset positioning (use MASTER_GTID_WAIT() instead), and needs generous innodb_log_file_size. At relaxed durability
(D2/D3), throughput is identical, but crash consistency is now guaranteed, which was never the case with FILE binlog.

More importantly, this changes the architecture decision matrix. Binlog-based replication has always carried a performance tax, that pushed some teams toward Galera
clusters not because they needed synchronous multi-master or local HA, but simply to avoid that penalty while maintaining an offsite copy for DR. I often meet teams that conflate HA and
DR requirements and end up implementing a 3-node Galera cluster spread across 3 sites — paying the cross-site certification latency on every single commit for the sake of a disaster
recovery copy that doesn’t need to be synchronous. That trade-off no longer exists. With InnoDB binlog, a simple primary + async replica gives you DR capability at effectively the same
write throughput as a standalone server, while keeping your Galera cluster (if you need one) local where certification latency stays low.

For OLTP workloads, this is the single largest performance improvement available in MariaDB 12.3 and it comes with better crash consistency, simpler automation, and a cleaner
separation between HA and DR as a bonus.

References

MDEV-34705: Improving performance of binary logging by removing the need of syncing it — https://jira.mariadb.org/browse/MDEV-34705
MDEV-38462: Crash recovery broken with insufficient innodb_log_file_size (new binlog) — https://jira.mariadb.org/browse/MDEV-38462
MariaDB 12.3 Changes & Improvements — https://mariadb.com/docs/release-notes/community-server/12.3/mariadb-12.3-changes-and-improvements
Official binlog implementation documentation (knielsen branch) — https://github.com/MariaDB/server/blob/knielsen_binlog_in_engine/Docs/replication/binlog.md
Mark Callaghan’s benchmark of the new binlog feature — referenced in the MariaDB.org release blog
MariaDB Foundation: New binlog implementation blog post — https://mariadb.org/tag/mariadb-releases/

Post Views: 1,384

MariaDB 12.3 – Binlog Inside InnoDB

Operational Impact: GTID, Replica Resync, and Split-Brain Recovery

What `MASTER_USE_GTID = slave_pos` Actually Means

Replica Resync Before and After: The Concrete Difference

Split-Brain Recovery

`MASTER_DEMOTE_TO_SLAVE=1` — The Safety Net for Former Primaries

Galera Cluster and InnoDB Cluster: What Changes (and What Doesn’t)

Why Galera Is Blocked and Why It’s Non-Trivial to Fix

Impact on Galera Async Slaves

Production Decision Matrix by Topology

Known Limitations and Open Issues (12.3.1 RC)

Documented Limitations (by design, 12.3.1)

Open JIRA Issues (as of March 2026)

Operational Cautions for Your Lab

The Benchmark Design

What We Are Measuring

Metrics

Concurrency Levels

Azure Lab Setup

My Lab Results

Write TPS — The D1 Headline: 2.4–3.3× Faster

D2/D3 — Parity When fsyncs Are Already Gone

Crash Recovery — Same Speed, Different Architecture

Large Transaction Commit — Redo Amplification Is Real

Replication Lag — Same Lag, 56% More Throughput

Analysis and Recommendations

What the Results Mean for Production

When NOT to Use InnoDB Binlog

`innodb_log_file_size` Recommendation

Bottom Line

References

Adrien Obernesser

Leave a Reply:

Related blog articles

Operational Impact: GTID, Replica Resync, and Split-Brain Recovery

What MASTER_USE_GTID = slave_pos Actually Means

Replica Resync Before and After: The Concrete Difference

Split-Brain Recovery

MASTER_DEMOTE_TO_SLAVE=1 — The Safety Net for Former Primaries

Galera Cluster and InnoDB Cluster: What Changes (and What Doesn’t)

Why Galera Is Blocked and Why It’s Non-Trivial to Fix

Impact on Galera Async Slaves

Production Decision Matrix by Topology

Known Limitations and Open Issues (12.3.1 RC)

Documented Limitations (by design, 12.3.1)

Open JIRA Issues (as of March 2026)

Operational Cautions for Your Lab

The Benchmark Design

What We Are Measuring

Metrics

Concurrency Levels

Azure Lab Setup

My Lab Results

Write TPS — The D1 Headline: 2.4–3.3× Faster

D2/D3 — Parity When fsyncs Are Already Gone

Crash Recovery — Same Speed, Different Architecture

Large Transaction Commit — Redo Amplification Is Real

Replication Lag — Same Lag, 56% More Throughput

Analysis and Recommendations

What the Results Mean for Production

When NOT to Use InnoDB Binlog

innodb_log_file_size Recommendation

Bottom Line

References

Adrien Obernesser

Leave a Reply:

Related blog articles

Setting up MariaDB Galera Cluster

An overview of “Oracle Cloud Infrastructure: everything for an efficient migration” event by dbi services

MariaDB manual Switchover

A generic jdbc tester (part II)

What `MASTER_USE_GTID = slave_pos` Actually Means

`MASTER_DEMOTE_TO_SLAVE=1` — The Safety Net for Former Primaries

`innodb_log_file_size` Recommendation