Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reload on resetting to defaults #159

Merged
merged 25 commits into from
Jul 21, 2023

Conversation

BrennanPaciorek
Copy link
Collaborator

@BrennanPaciorek BrennanPaciorek commented Jul 18, 2023

Enhancement:
Make resetting to defaults reload instead of restart firewalld

Reason:
Reloading in firewalld should successfully complete the configuration reset, and restarting adds downtime which can be used to open a connection that persists after firewalld has finishes restarting; this connection can be used to bypass firewall rules, since firewalld will not block traffic from active connections.

Result:
Minimal downtime when using previous: replaced

Addresses an issue brought up in #140, where due to the restart on resetting to defaults, the feature may not be suitable for production environments.

@codecov
Copy link

codecov bot commented Jul 18, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (e8e8769) 53.62% compared to head (29a1126) 53.62%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #159   +/-   ##
=======================================
  Coverage   53.62%   53.62%           
=======================================
  Files           2        2           
  Lines         800      800           
=======================================
  Hits          429      429           
  Misses        371      371           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@richm
Copy link
Contributor

richm commented Jul 18, 2023

how can we test this?

@BrennanPaciorek
Copy link
Collaborator Author

BrennanPaciorek commented Jul 18, 2023

how can we test this?

If you mean verifying that this will not introduce bugs into the feature, we could use firewall-cmd with ansible.builtin.command. This would involve:

  • Using the current implementation of previous: replaced to reset the firewall configuration
  • getting some representation of the active firewall implementation (something like a hash of nft list ruleset)
  • restarting firewalld service instead of reloading
  • get the representation of the active firewall implementation again
  • Compare the two stored values
    If nft list ruleset output is less predictable than I think, testing it may be more difficult than this.

If you mean testing minimal downtime, systemctl reload firewalld uses firewalld's reload implementation to reload configuration from files and systemctl restart firewalld turns the firewall off and back on (leaving the firewall off for a few seconds, which is usually a bad idea for a production environment).

@BrennanPaciorek
Copy link
Collaborator Author

  • If nft list ruleset output is less predictable than I think, testing it may be more difficult than this.

It appears that restart and reload produce functionally identical configurations, but the rules tend to be reordered slightly within the same rule chains, so I'll need to find a more accurate way to compare two configurations.

@richm
Copy link
Contributor

richm commented Jul 18, 2023

how can we test this?

If you mean verifying that this will not introduce bugs into the feature, we could use firewall-cmd with ansible.builtin.command. This would involve:

* Using the current implementation of `previous: replaced` to reset the firewall configuration

* getting some representation of the active firewall implementation (something like a hash of `nft list ruleset`)

* restarting firewalld service instead of reloading

* get the representation of the active firewall implementation again

* Compare the two stored values
  If `nft list ruleset` output is less predictable than I think, testing it may be more difficult than this.

If you mean testing minimal downtime, systemctl reload firewalld uses firewalld's reload implementation to reload configuration from files and systemctl restart firewalld turns the firewall off and back on (leaving the firewall off for a few seconds, which is usually a bad idea for a production environment).

I mean the latter - check that the fix ensures minimal downtime - is there a way we can automate this - if not, how do we tell QE what to test?

@BrennanPaciorek
Copy link
Collaborator Author

BrennanPaciorek commented Jul 18, 2023

is there a way we can automate this

Testing service downtime (or uptime) wouldn't work, because with reload, the service technically does not stop.

An idea would to be change a zone to DROP (or set a ICMP block) and place an interface on this zone, then spam a set number of packets (or more accurately, play them at a set interval) that would be dropped under this configuration. If the packets get sent back, it can be taken to mean the firewall is down, if they are lost, you can take it to mean the firewall is up and dropping the packets. Whichever loses the most packets can be understood as having the least downtime.

or one can up a service like a simple http server and block it with the firewall, and do a similar test.

Does this seem like a valid test?

@richm
Copy link
Contributor

richm commented Jul 18, 2023

is there a way we can automate this

Testing service downtime (or uptime) wouldn't work, because with reload, the service technically does not stop.

An idea would to be change a zone to DROP (or set a ICMP block) and place an interface on this zone, then spam a set number of packets (or more accurately, play them at a set interval) that would be dropped under this configuration. If the packets get sent back, it can be taken to mean the firewall is down, if they are lost, you can take it to mean the firewall is up and dropping the packets. Whichever loses the most packets can be understood as having the least downtime.

or one can up a service like a simple http server and block it with the firewall, and do a similar test.

Does this seem like a valid test?

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

@BrennanPaciorek
Copy link
Collaborator Author

BrennanPaciorek commented Jul 19, 2023

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

nc would be better for mimicking production environments, while ping is probably a more convenient tool for getting the necessary metrics (%packet loss, estimated downtime)

One issue so far in implementing this in a tox-friendly manner is that firewalld accepts outbound traffic on the loopback interface before evaluating policies (where outbound traffic is blocked), and disabling this rule takes some time (~100ms), which may result in more "downtime" than a production environment (or even a test environment with less restrictions), where this limitation of needing to use the loopback interface for tests does not exist.

The other issue with these tests is that using previous: replaced feature would undo the firewall rules needed to test the effectiveness of the modified feature.
Would comparisons in downtime between systemctl reload firewalld.service and systemctl restart firewalld.service suffice for testing this? If not, we can use a modified reset script to re-add any required rules before restarting or reloading firewalld.

@richm
Copy link
Contributor

richm commented Jul 19, 2023

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

nc would be better for mimicking production environments, while ping is probably a more convenient tool for getting the necessary metrics (%packet loss, estimated downtime)

One issue so far in implementing this in a tox-friendly manner is that firewalld accepts outbound traffic on the loopback interface before evaluating policies (where outbound traffic is blocked), and disabling this rule takes some time (~100ms), which may result in more "downtime" than a production environment (or even a test environment with less restrictions), where this limitation of needing to use the loopback interface for tests does not exist.

The other issue with these tests is that using previous: replaced feature would undo the firewall rules needed to test the effectiveness of the modified feature. Would comparisons in downtime between systemctl reload firewalld.service and systemctl restart firewalld.service suffice for testing this?

Yes, I think that would be fine.

If not, we can use a modified reset script to re-add any required rules before restarting or reloading firewalld.

tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
@BrennanPaciorek
Copy link
Collaborator Author

BrennanPaciorek commented Jul 20, 2023

Okay I've implemented the described test in a way that is compatible with the integration tests, I can move this to another file, but the test takes some time and probably shouldn't be run every PR, since its not testing any code prone to change, just highlighting the security difference made with the one line change

tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
podman network create --subnet 172.16.1.0/24 --gateway 172.16.1.1 --interface-name podmanbr0 podmanbr0
trap "podman network rm podmanbr0" EXIT
imageid=$(podman build -q /tmp)
podman run -d --privileged --net podmanbr0 --ip 172.16.1.2 --name test-firewalld --rm "$imageid" /usr/lib/systemd/systemd || exit 1
Copy link
Collaborator Author

@BrennanPaciorek BrennanPaciorek Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --rm option was already provided, should I move it?

@richm
Copy link
Contributor

richm commented Jul 20, 2023

[citest]

Comment on lines 35 to 45
ping -c 500 -i 0.01 172.16.1.2 1>/tmp/ping1 2>/dev/null &
pid="$!"
trap "rm -f /tmp/ping1" EXIT
podman exec test-firewalld systemctl restart firewalld.service
wait "$pid"

ping -c 500 -i 0.01 172.16.1.2 1>/tmp/ping2 2>/dev/null &
pid="$!"
trap "rm -f /tmp/ping2" EXIT
podman exec test-firewalld systemctl reload firewalld.service
wait "$pid"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of this sometimes introduces a fail case where a connection is established from systemctl restart firewalld.service, I swapped the two tests locally, but am waiting for CI tests to run.

@BrennanPaciorek
Copy link
Collaborator Author

Locally, I can't seem to reproduce the same error that is showing on these CI tests. Maybe it has to do with the order in which the tests are being run?

@richm
Copy link
Contributor

richm commented Jul 20, 2023

Locally, I can't seem to reproduce the same error that is showing on these CI tests. Maybe it has to do with the order in which the tests are being run?

Yes, that must be what it is

test_ping.sh could carry a connection open if restart is done first, swapping the two fixes this, because there should be no downtime with systemctl reload firewalld
tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
tests/files/test_ping.sh Outdated Show resolved Hide resolved
__firewall_service no longer necessary for files/get_files_checksums.sh, so it was removed
@richm
Copy link
Contributor

richm commented Jul 21, 2023

[citest]

@BrennanPaciorek
Copy link
Collaborator Author

Modified test one more time, as some platforms running the test incorrectly reported packet loss due to the timeout I had set being too short.

@richm
Copy link
Contributor

richm commented Jul 21, 2023

[citest]

@richm richm merged commit cb392b6 into linux-system-roles:main Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants