-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MariaDB GTID Issues #2150
Comments
tricky yeah. definitely use
do you think it could be a problem with the other two domains listed maybe? Just guessing here, sometimers my guesses work out tho :) |
That's what we thought too, so we manually removed those from the positions table (because nothing important happens on those domains anyway) but it still gave us the same error. Interestingly, with |
Ok, odd. Did anything change with configs on restart? Was this Maxwell running fine from this server before?What happens if you subtract (or add) 1 to the grid position?On Jan 16, 2025, at 10:50, Bart van Wissen ***@***.***> wrote:
That's what we thought too, so we manually removed those from the positions table (because nothing important happens on those domains anyway) but it still gave us the same error.
Only the one with the 0 domain is relevant to us.
Interestingly, with SHOW BINLOG EVENTS on the server, we can see that domain 1 has a newer value than the values in the positions table, even when we look at the oldest binlog file. So for some reason, Maxwell has missed those events?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
@osheroff We have tried that too. It didn't help. Same error. The only way we could get Maxwell to connect was by using the complete set of GTIDs that was initially in the positions table, and manually updating one of them to the position listed in the binlog file at the top (where it says 'GTID list') so that all of the GTIDs are actually 'known' by the server, but that led to another issue: Somehow Maxwell now found that it had to go all the way back to the very first schema in its schemas table to find a matching GTID. It seems to find a match by looking at the GTID that is the oldest out of the GTID set, which seems incorrect. Do you have any tip on how to get out of that schema mismatch? We were thinking to manually update the schemas table but we are worried it might create an even bigger mess.
Yes, we manually stopped it, changed the filter configuration and then tried to restart, which is when it failed.
Yes, it had been running without problems for a few months. |
After digging through MariaDB's source code, it does seem that MariaDB, when it has knowledge of 2 GTID domains and the client requests only one of them, assumes that the client wants to get all transactions (starting from position 0) from the domain that was omitted in the request. This kind of makes sense if it assumes the client is interested in replicating all domains and doesn't say that it has already seen those transactions. In that case it makes sense that MariaDB says it cannot find that transaction. But it's still a mystery how we got to that state in the first place. |
where/when did the missing domain in question originate? I presume it has had no updates in the entire time maxwell has been running? for mariadb, maxwell gets its opening GTID position via the |
You're probably right that at least one of those missing domains hasn't had
updates for the entire time it has been running, or at least for the time
it retains the binlog.
I think the manual positions adjustment might get us out of this situation
for now (this is a production critical system for us), but I'm hoping you
can advise on the schema sync. I'm pretty certain since the time it has
stopped there have been no schema updates. Does that make it safe to use
"recapture schema"?
…On Fri, Jan 17, 2025, 00:50 Ben Osheroff ***@***.***> wrote:
where/when did the missing domain in question originate? I presume it has
had no updates in the entire time maxwell has been running?
for mariadb, maxwell gets its opening GTID position via the
gtid_binlog_state variable. After skimming through
https://mariadb.com/kb/en/gtid/#gtid_binlog_state I think that
gtid_current_pos might be the superior way to capture the initial
position. Does the state of your world agree with that assessment? It could
be that or it could be gtid_binlog_pos. I'm having a bit of trouble
really parsing out the subtle differences here.
—
Reply to this email directly, view it on GitHub
<#2150 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGCDZCWBNOXVWUYSW2UPW32LBAV3AVCNFSM6AAAAABVKDBCQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJXGE2DAMJZGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think you're right about those variables by the way. It seems
gtid_binlog_state mostly exists so that the master can find out if the
slave missed anything, even if it is requesting really old server ids which
are no longer used.
There is no need for maxwell to "subscribe" to those when it first
connects.
…On Fri, Jan 17, 2025, 00:50 Ben Osheroff ***@***.***> wrote:
where/when did the missing domain in question originate? I presume it has
had no updates in the entire time maxwell has been running?
for mariadb, maxwell gets its opening GTID position via the
gtid_binlog_state variable. After skimming through
https://mariadb.com/kb/en/gtid/#gtid_binlog_state I think that
gtid_current_pos might be the superior way to capture the initial
position. Does the state of your world agree with that assessment? It could
be that or it could be gtid_binlog_pos. I'm having a bit of trouble
really parsing out the subtle differences here.
—
Reply to this email directly, view it on GitHub
<#2150 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGCDZCWBNOXVWUYSW2UPW32LBAV3AVCNFSM6AAAAABVKDBCQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJXGE2DAMJZGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think recapture schema should be ok. I’d have to take time to re-read the schema matching code to understand deeper On Jan 16, 2025, at 22:13, Bart van Wissen ***@***.***> wrote:
I think you're right about those variables by the way. It seems
gtid_binlog_state mostly exists so that the master can find out if the
slave missed anything, even if it is requesting really old server ids which
are no longer used.
There is no need for maxwell to "subscribe" to those when it first
connects.
On Fri, Jan 17, 2025, 00:50 Ben Osheroff ***@***.***> wrote:
where/when did the missing domain in question originate? I presume it has
had no updates in the entire time maxwell has been running?
for mariadb, maxwell gets its opening GTID position via the
gtid_binlog_state variable. After skimming through
https://mariadb.com/kb/en/gtid/#gtid_binlog_state I think that
gtid_current_pos might be the superior way to capture the initial
position. Does the state of your world agree with that assessment? It could
be that or it could be gtid_binlog_pos. I'm having a bit of trouble
really parsing out the subtle differences here.
—
Reply to this email directly, view it on GitHub
<#2150 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGCDZCWBNOXVWUYSW2UPW32LBAV3AVCNFSM6AAAAABVKDBCQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJXGE2DAMJZGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
We figured it out. As stated above, it was indeed required to have Maxwell report a GTID-set that covers all the domain-ids known by MariaDB (in That led us to the schema mismatch problem though, and there seems to be a bug in Maxwell here. We were puzzled at first, because the GTID-set in the It turns out though, that the The MySQL implementation will return However, for the MariaDB implementation, it will only return true if all the GTIDs in the GTID-set of the schema are before those in the positions table. In case one of them is equal rather than before, it returns It seems that Maxwell assumes that they both behave in the MySQL-way though. And because one domain-id has zero activity, the schema actually had the exact same GTID for the "inactive" domain as the current GTID for that domain ID. In other words, the position for that domain id hadn't changed since that schema change, so it would never be before the current position in that domain. We got this to work by manually running a write query on the domain 1 server, to make sure MariaDB incremented its current GTID position for domain id 1. After this, Maxwell matched the schema successfully. So it seems that the schema-matching logic in Maxwell works for MySQL, but not for MariaDB in this particular edge-case. To fix this, I think it would have to add 1 to the current GTID position before using Steps to reproduce:
Since one of the GTIDs in the GTID set has incremented but the other one hasn't, it will not match the latest schema at restart and use the old schema, leading to a mismatch. I realize this issue is different from the one I initially reported here though, so let me know if you would like me to create a separate issue for this, @osheroff . |
nice debugging work. no need for a separate issue, maybe let's just change the title to "MariaDB GTID issues". |
also, re:
can you confirm that your server's |
It does. But only for one server id, whereas in The remaining one in that domain is the slave itself. No write queries are being executed on it directly normally, and since it is the only server that uses that domain id, the gtid position for that domain never increases but it is still maintained by the server. |
ok. in your estimation are any of the |
From what I learned from this investigation, I now think that using But we still haven't really figured out how it's possible that maxwell had a GTID position for domain 1 and server id 321 in its positions table that was way before the current GTID for that server id reported by MariaDB. So old even that it was not even in the binlog anymore. I'm still trying to figure out how we could have gotten to that state, but I don't think that switching to |
ok, I can repro this trivially, doesn't even require all this master/slave hoodoo, really just any situation where one of the sets isn't moving is hopelessly broken. ... ok, well, this is embarassing:
What version is maxwell on? yeah. 0.27.4. Such is the dumbness of a maintainer-in-exile. |
Damn, I hadn't checked. |
We are receiving this error when trying to restart Maxwell:
Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog
However, when we log into the replica server that Maxwell is connecting to, using
mysqlbinlog
we can clearly see that the binlog file that corresponds with the GTID0-1073719201-287920301
, which is listed in thepositions
table, is in one of the binlog files. The transaction is also fairly recent (today), so it is not possible that the server doesn't have it (it is keeping transactions of 2 weeks old).We are puzzled what could cause this, and how we might fix this. We are currently not able to resume Maxwell because of this problem. Any help would be appreciated!
Here's the context in the log output:
The text was updated successfully, but these errors were encountered: