Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely acquire lock on multiple tables #100

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

rkrage
Copy link
Contributor

@rkrage rkrage commented Jun 5, 2024

This functionality will allow us to improve the safety of operations that take out locks on multiple tables (e.g. adding / removing foreign keys).

I also discovered a bug in the nested lock acquisition logic, so this PR fixes that and adds more tests.

I feel like it's tradition at this point for me to include too many changes in a PR and for @jcoleman to ask me to break it up into smaller commits 😆 (I will try to do that tomorrow done).

# we have already acquired the lock so this check is unnecessary.
# In fact, it could actually cause a deadlock if a blocking query
# was executed shortly after the initial lock acquisition.
break if nested_target_tables
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug I was referring to in the PR description

# so we need to check for blocking queries on those tables as well
target_tables_for_blocking_transactions = target_tables.flat_map do |target_table|
target_table.partitions(include_sub_partitions: true, include_self: true)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this here so that we get a fresh list of tables for every iteration of the loop. I think this gives us extra safety if new partitions are added while this method is executing.

begin
method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;")
adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock timeout applies to each individual relation in the query, while statement timeout applies to the entire query. So, if we were to use lock timeout here, the upper limit for the query timeout would be LOCK_TIMEOUT_SECONDS * <number of tables being locked> which seems incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout GUC.

# where the most recent method call is the last element.
def safely_acquire_lock_for_table_history
@safely_acquire_lock_for_table_history ||= []
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems way cleaner than using a thread variable. Not sure why I did that in the first place...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the argument is that you don't have to worry about how the method is invoked.

But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?

@rkrage rkrage force-pushed the multi-table-locks branch 3 times, most recently from 9541a0a to c9fdfaa Compare June 7, 2024 14:33
@rkrage rkrage force-pushed the multi-table-locks branch 8 times, most recently from 4142ddb to 2cfa43d Compare June 20, 2024 12:33
@rkrage rkrage force-pushed the multi-table-locks branch from 2cfa43d to d159713 Compare July 5, 2024 15:10
@rkrage rkrage force-pushed the multi-table-locks branch from d159713 to 0660b33 Compare July 10, 2024 15:34
# we have already acquired the lock so this check is unnecessary.
# In fact, it could actually cause a deadlock if a blocking query
# was executed shortly after the initial lock acquisition.
break if nested_target_tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to run the loop at all in this case? I.e., why break out of the loop when we could make the whole loop conditional?

@@ -30,8 +30,12 @@ def present?
name.present? && schema.present?
end

def ==(other)
other.is_a?(Relation) && name == other.name && schema == other.schema
def eql?(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't new per se, but I'm wondering again why we don't include mode in equality/hashing (I understand why we include it in a special way in conflicts_with?).

@@ -152,4 +156,26 @@ def valid?
SQL
end
end

class TableCollection < Set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from Ruby collection classes is usually considered pretty dangerous now (e.g., lots of discussions about some old ActiveSupport classes) because it's easy to end up with an inconsistent API or internally inconsistent data.

For example, I could construct an instance of this class and then add another table to the set and break the assumptions of the #mode method (namely that all modes are the same).

I think it would probably preferable to just maintain and internal ivar of the set instance and present a very limited external API.

begin
method(adjust_timeout_method).call(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
connection.execute("LOCK #{target_table.fully_qualified_name} IN #{target_table.mode.to_sql} MODE;")
adjust_statement_timeout(PgHaMigrations::LOCK_TIMEOUT_SECONDS) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment to the code explaining why we don't use the (seemingly) obvious lock_timeout GUC.

# where the most recent method call is the last element.
def safely_acquire_lock_for_table_history
@safely_acquire_lock_for_table_history ||= []
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the argument is that you don't have to worry about how the method is invoked.

But maybe that means we should actually query Postgres for what locks we already currently hold instead of trying to track them ourselves?

# The order of the array represents the current call stack,
# where the most recent method call is the last element.
def safely_acquire_lock_for_table_history
@safely_acquire_lock_for_table_history ||= []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we explicitly want to allow nested lock acquisition? If I'm reading the source directly, we didn't explicitly support nested lock acquisition before per se, though we allowed it (as long as it wasn't for the same table).

The nested checks were introduced as a result of #39 to ensure that we didn't upgrade locks, but that was focused on preventing one kind of bug rather than explicitly allowing nested locks. AFAICT it was only incidentally allowed before that.

I can't remember if we need that support for e.g. partitions. But without good justification I'm feeling a bit squeamish about supporting it: it's very easy to get into a bad situation here for multiple reasons:

  1. Nested calls have the same problems vis-a-vis lock timeouts as allowing transactional DDL. You end up with nested timeouts, as well, which breaks our guarantees around timing.
  2. It's easy to order calls in different ways (even with respect to SQL/DML that's executing externally to the migrations!) that could cause deadlocks.

I'm wondering if we should focus instead on safer APIs (like the multiple table locking one here) that target specific use cases directly.

Note: I'm not sure that the concern about deadlocks (2) is actually prevented by this approach either, because I assume Postgres acquires locks on the relations in order of their presentation here...so maybe that's unavoidable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants