Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MariaDB GTID Issues #2150

Open
bartvw opened this issue Jan 16, 2025 · 18 comments
Open

MariaDB GTID Issues #2150

bartvw opened this issue Jan 16, 2025 · 18 comments

Comments

@bartvw
Copy link
Contributor

bartvw commented Jan 16, 2025

We are receiving this error when trying to restart Maxwell: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog

However, when we log into the replica server that Maxwell is connecting to, using mysqlbinlog we can clearly see that the binlog file that corresponds with the GTID 0-1073719201-287920301, which is listed in the positions table, is in one of the binlog files. The transaction is also fairly recent (today), so it is not possible that the server doesn't have it (it is keeping transactions of 2 weeks old).

We are puzzled what could cause this, and how we might fix this. We are currently not able to resume Maxwell because of this problem. Any help would be appreciated!

Here's the context in the log output:

  Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  MysqlSavedSchema - Restoring schema id 234 (last modified at Position[BinlogPosition[1-23-4275169097], lastHeartbeat=0])
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  MysqlSavedSchema - beginning to play deltas...
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  MysqlSavedSchema - played 274 deltas in 102ms
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  BinlogConnectorReplicator - Setting initial gtid to: 1-23-4280651845,1-321-4280653879,0-1073719201-287920301
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  BinaryLogClient - 0-1073719201-287920301,1-321-4280653879
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  BinaryLogClient - Connected to 127.0.0.1:3306 at 0-1073719201-287920301,1-321-4280653879 (sid:666, cid:4014261)
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  BinlogConnectorReplicator - Binlog connected.
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 WARN  BinlogConnectorReplicator - communications failure in binlog:
Jan 16 15:17:13 maxwell maxwell[3735692]: com.github.shyiko.mysql.binlog.network.ServerException: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog >
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1043) [mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:631) [mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:932) [mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at java.lang.Thread.run(Thread.java:840) [?:?]
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  BinlogConnectorReplicator - Binlog disconnected.
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  TaskManager - Stopping 3 tasks
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 ERROR TaskManager - cause:
Jan 16 15:17:13 maxwell maxwell[3735692]: com.github.shyiko.mysql.binlog.network.ServerException: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog >
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1043) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:631) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:932) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at java.lang.Thread.run(Thread.java:840) [?:?]
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  TaskManager - Stopping: com.zendesk.maxwell.schema.PositionStoreThread@51fddea9
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  TaskManager - Stopping: com.zendesk.maxwell.bootstrap.BootstrapController@5f3a2ae6
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 INFO  TaskManager - Stopping: com.zendesk.maxwell.replication.BinlogConnectorReplicator@5b05b5d9
Jan 16 15:17:13 maxwell maxwell[3735692]: 2025-01-16 15:17:13 ERROR Maxwell - Maxwell couldn't find the requested binlog, exiting...
Jan 16 15:17:13 maxwell maxwell[3735692]: com.github.shyiko.mysql.binlog.network.ServerException: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog >
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1043) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:631) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:932) ~[mysql-binlog-connector-java-0.27.4.jar:0.27.4]
Jan 16 15:17:13 maxwell maxwell[3735692]:         at java.lang.Thread.run(Thread.java:840) ~[?:?]
Jan 16 15:17:14 maxwell maxwell[3735692]: 2025-01-16 15:17:14 INFO  TaskManager - Stopped all tasks
Jan 16 15:17:14 maxwell systemd[1]: maxwells-daemon.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 16 15:17:14 maxwell systemd[1]: maxwells-daemon.service: Failed with result 'exit-code'.
Jan 16 15:17:14 maxwell systemd[1]: maxwells-daemon.service: Consumed 11.995s CPU time.
Jan 16 15:17:20 maxwell systemd[1]: Stopped maxwells-daemon.service - Maxwell's Deamon. 
@osheroff
Copy link
Collaborator

tricky yeah. definitely use SHOW BINLOG EVENTS on the server in question to verify that the GTID in question is present. What I'm not clear on is like:

  • if a replica doesn't contain any knowledge about the other GTIDs listed in the position, could that cause this error? Ie the GTID position is currently:
1-23-4280651845,1-321-4280653879,0-1073719201-287920301

do you think it could be a problem with the other two domains listed maybe?

Just guessing here, sometimers my guesses work out tho :)

@bartvw
Copy link
Contributor Author

bartvw commented Jan 16, 2025

That's what we thought too, so we manually removed those from the positions table (because nothing important happens on those domains anyway) but it still gave us the same error.
Only the one with the 0 domain is relevant to us.

Interestingly, with SHOW BINLOG EVENTS on the server, we can see that domain 1 has a newer position than the values in the positions table, even when we look at the oldest binlog file. So for some reason, Maxwell has missed those events?

@osheroff
Copy link
Collaborator

osheroff commented Jan 16, 2025 via email

@bartvw
Copy link
Contributor Author

bartvw commented Jan 16, 2025

What happens if you subtract (or add) 1 to the grid position?

@osheroff We have tried that too. It didn't help. Same error.

The only way we could get Maxwell to connect was by using the complete set of GTIDs that was initially in the positions table, and manually updating one of them to the position listed in the binlog file at the top (where it says 'GTID list') so that all of the GTIDs are actually 'known' by the server, but that led to another issue:

Somehow Maxwell now found that it had to go all the way back to the very first schema in its schemas table to find a matching GTID. It seems to find a match by looking at the GTID that is the oldest out of the GTID set, which seems incorrect. Do you have any tip on how to get out of that schema mismatch? We were thinking to manually update the schemas table but we are worried it might create an even bigger mess.

Did anything change with configs on restart?

Yes, we manually stopped it, changed the filter configuration and then tried to restart, which is when it failed.

Was this Maxwell running fine from this server before?

Yes, it had been running without problems for a few months.

@bartvw
Copy link
Contributor Author

bartvw commented Jan 16, 2025

After digging through MariaDB's source code, it does seem that MariaDB, when it has knowledge of 2 GTID domains and the client requests only one of them, assumes that the client wants to get all transactions (starting from position 0) from the domain that was omitted in the request. This kind of makes sense if it assumes the client is interested in replicating all domains and doesn't say that it has already seen those transactions.

https://github.com/MariaDB/server/blob/cacaaebf01939d387645fb850ceeec5392496171/sql/sql_repl.cc#L1047

In that case it makes sense that MariaDB says it cannot find that transaction. But it's still a mystery how we got to that state in the first place.

@osheroff
Copy link
Collaborator

where/when did the missing domain in question originate? I presume it has had no updates in the entire time maxwell has been running?

for mariadb, maxwell gets its opening GTID position via the gtid_binlog_state variable. After skimming through https://mariadb.com/kb/en/gtid/#gtid_binlog_state I think that gtid_current_pos might be the superior way to capture the initial position. Does the state of your world agree with that assessment? It could be that or it could be gtid_binlog_pos. I'm having a bit of trouble really parsing out the subtle differences here.

@bartvw
Copy link
Contributor Author

bartvw commented Jan 17, 2025 via email

@bartvw
Copy link
Contributor Author

bartvw commented Jan 17, 2025 via email

@osheroff
Copy link
Collaborator

osheroff commented Jan 17, 2025 via email

@bartvw
Copy link
Contributor Author

bartvw commented Jan 17, 2025

We figured it out.

As stated above, it was indeed required to have Maxwell report a GTID-set that covers all the domain-ids known by MariaDB (in gtid_binlog_state). We got this to work when we manually updated the positions table to at least cover each domain ID and to make sure that for the inactive domain id, it matched the position that was known by MariaDB as the last position for that domain. After this, it was at least connecting without an error.

That led us to the schema mismatch problem though, and there seems to be a bug in Maxwell here. We were puzzled at first, because the GTID-set in the schemas row that we expected to match (the latest one) seemed to be contained within the GTID set in the positions table.

It turns out though, that the MariadbGtidSet.isContainedWithin() is behaving very differently from GtidSet.isContainedWithin() (which is for MySQL).

The MySQL implementation will return true if all the GTIDs in the GTID-set of the schema are equal or before the position of the GTIDs in the same domain in the positions table.

However, for the MariaDB implementation, it will only return true if all the GTIDs in the GTID-set of the schema are before those in the positions table. In case one of them is equal rather than before, it returns false!

It seems that Maxwell assumes that they both behave in the MySQL-way though. And because one domain-id has zero activity, the schema actually had the exact same GTID for the "inactive" domain as the current GTID for that domain ID. In other words, the position for that domain id hadn't changed since that schema change, so it would never be before the current position in that domain.

We got this to work by manually running a write query on the domain 1 server, to make sure MariaDB incremented its current GTID position for domain id 1. After this, Maxwell matched the schema successfully.

So it seems that the schema-matching logic in Maxwell works for MySQL, but not for MariaDB in this particular edge-case. To fix this, I think it would have to add 1 to the current GTID position before using isContainedWithin to compare it with the schema in case it's connecting to MariaDB.

Steps to reproduce:

  1. Connect to a MariaDB slave that is aware of 2 separate domain ids (from different masters), one active one and one inactive one.
  2. Execute an ALTER statement on the active master (add a column before the first column of a table, for example).
  3. Stop maxwell
  4. Restart maxwell

Since one of the GTIDs in the GTID set has incremented but the other one hasn't, it will not match the latest schema at restart and use the old schema, leading to a mismatch.

I realize this issue is different from the one I initially reported here though, so let me know if you would like me to create a separate issue for this, @osheroff .

@osheroff
Copy link
Collaborator

nice debugging work. no need for a separate issue, maybe let's just change the title to "MariaDB GTID issues".

@osheroff
Copy link
Collaborator

also, re:

for mariadb, maxwell gets its opening GTID position via the gtid_binlog_state variable. After skimming through https://mariadb.com/kb/en/gtid/#gtid_binlog_state I think that gtid_current_pos might be the superior way to capture the initial position. Does the state of your world agree with that assessment? It could be that or it could be gtid_binlog_pos. I'm having a bit of trouble really parsing out the subtle differences here.

can you confirm that your server's gtid_current_pos does contain the domain in question?

@bartvw
Copy link
Contributor Author

bartvw commented Jan 17, 2025

can you confirm that your server's gtid_current_pos does contain the domain in question?

It does. But only for one server id, whereas in gtid_binlog_state it contains an extra server id for that same domain. That server used to be the master but is now decomissioned.

The remaining one in that domain is the slave itself. No write queries are being executed on it directly normally, and since it is the only server that uses that domain id, the gtid position for that domain never increases but it is still maintained by the server.

@bartvw bartvw changed the title Maxwell couldn't find the requested binlog / Could not find GTID state requested by slave in any binlog files MariaDB GTID Issues Jan 17, 2025
@osheroff
Copy link
Collaborator

ok. in your estimation are any of the gtid_* variables then safe to capture an initial position from? Or is it some franken-merging of the 3 variables?

@bartvw
Copy link
Contributor Author

bartvw commented Jan 17, 2025

From what I learned from this investigation, I now think that using gtid_binlog_state should actually work well and also seems most logical.

But we still haven't really figured out how it's possible that maxwell had a GTID position for domain 1 and server id 321 in its positions table that was way before the current GTID for that server id reported by MariaDB. So old even that it was not even in the binlog anymore.

I'm still trying to figure out how we could have gotten to that state, but I don't think that switching to gtid_current_pos to get the initial position would really improve it.

@osheroff
Copy link
Collaborator

Steps to reproduce:

ok, I can repro this trivially, doesn't even require all this master/slave hoodoo, really just any situation where one of the sets isn't moving is hopelessly broken.

...

ok, well, this is embarassing:

 # Changelog
+## [0.27.6](https://github.com/osheroff/mysql-binlog-connector-java/compare/0.27.6...0.27.5) - 2023-01-19
+
+- minor performance improvements
+- fix isContainedWithin for mariadb events

What version is maxwell on? yeah. 0.27.4.

Such is the dumbness of a maintainer-in-exile.

@osheroff
Copy link
Collaborator

#2152

@bartvw
Copy link
Contributor Author

bartvw commented Jan 18, 2025

ok, well, this is embarassing:

 # Changelog
+## [0.27.6](https://github.com/osheroff/mysql-binlog-connector-java/compare/0.27.6...0.27.5) - 2023-01-19
+
+- minor performance improvements
+- fix isContainedWithin for mariadb events

What version is maxwell on? yeah. 0.27.4.

Damn, I hadn't checked.
I assumed it was deliberate behavior of the library. It seems like such a crucial function..
Oh well. That's a quick fix then. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants