-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new ignored interfaces to NodeNetworkInterfaceDown Alert #3279
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 13305585006Details
💛 - Coveralls |
promv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1" | ||
"k8s.io/apimachinery/pkg/util/intstr" | ||
"k8s.io/utils/ptr" | ||
) | ||
|
||
var ignoredInterfacesForNetworkDown = []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What interfaces are we actually trying to monitor ?
I'm logging into an existing node, and I see the following interfaces:
8: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0a:58:0a:87:00:02 brd ff:ff:ff:ff:ff:ff
inet 10.135.0.2/23 brd 10.135.1.255 scope global ovn-k8s-mp0
valid_lft forever preferred_lft forever
inet6 fe80::858:aff:fe87:2/64 scope link
valid_lft forever preferred_lft forever
777: 9f17207c3cc0f46@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default
link/ether 82:e5:8e:03:e1:ce brd ff:ff:ff:ff:ff:ff link-netns 5bf0ae73-db72-4bfd-b646-b231f799e15d
inet6 fe80::80e5:8eff:fe03:e1ce/64 scope link
valid_lft forever preferred_lft forever
9: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
link/ether ce:98:9b:60:6c:df brd ff:ff:ff:ff:ff:ff
inet6 fe80::cc98:9bff:fe60:6cdf/64 scope link
valid_lft forever preferred_lft forever
11: 493c31b001d3731@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default
link/ether 7a:0d:d9:56:78:5f brd ff:ff:ff:ff:ff:ff link-netns b1d150aa-91dc-4741-872f-1f71dab1a2da
inet6 fe80::780d:d9ff:fe56:785f/64 scope link
valid_lft forever preferred_lft forever
12: 3d50190a313ba21@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default
link/ether 52:77:4d:f5:8c:4c brd ff:ff:ff:ff:ff:ff link-netns 28e514e3-3b79-495a-b5d6-483d779203c5
inet6 fe80::5077:4dff:fef5:8c4c/64 scope link
valid_lft forever preferred_lft forever
And many more like 493c31b001d3731
(just an example).
Are we interested in throwing alarms when these go down ? These are the host side of the veths connecting the pod to OVS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, regarding ovn-k8s-mp0
... would we want to know about this ? This is the default cluster network management port, which is used for some OVN-K features.
If we think about primary UDN, we will have one of these per primary UDN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea, for example, why in the NodeNetworkInterfaceFlapping
: changes(node_network_up{device!~"veth.+|tunbr",job="node-exporter"}[2m]) > 2
, they only ignore "veth.+", "tunbr"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally clueless.
I guess they - whoever they are - don't run in openshift CI, whose monitor tests would fail at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the openshift monitoring team also provides the flapping alert for openshift customers
hco-e2e-operator-sdk-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-operator-sdk-azure, ci/prow/hco-e2e-operator-sdk-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-consecutive-operator-sdk-upgrades-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-aws, ci/prow/hco-e2e-upgrade-operator-sdk-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-kv-smoke-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-kv-smoke-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@@ -36,7 +48,7 @@ func clusterAlerts() []promv1.Rule { | |||
}, | |||
{ | |||
Alert: "NodeNetworkInterfaceDown", | |||
Expr: intstr.FromString("count by (instance) (node_network_up{device!~\"veth.+|tunbr\"} == 0) > 0"), | |||
Expr: intstr.FromString(fmt.Sprintf("count by (instance) (node_network_up{device!~\"%s\"} == 0) > 0", strings.Join(ignoredInterfacesForNetworkDown, "|"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lrt's get rid of the ugly \"
. either by
Expr: intstr.FromString(fmt.Sprintf("count by (instance) (node_network_up{device!~\"%s\"} == 0) > 0", strings.Join(ignoredInterfacesForNetworkDown, "|"))), | |
Expr: intstr.FromString(fmt.Sprintf(`count by (instance) (node_network_up{device!~"%s"} == 0) > 0`, strings.Join(ignoredInterfacesForNetworkDown, "|"))), |
or by
Expr: intstr.FromString(fmt.Sprintf("count by (instance) (node_network_up{device!~\"%s\"} == 0) > 0", strings.Join(ignoredInterfacesForNetworkDown, "|"))), | |
Expr: intstr.FromString(fmt.Sprintf("count by (instance) (node_network_up{device!~%q} == 0) > 0", strings.Join(ignoredInterfacesForNetworkDown, "|"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sradco The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: João Vilaça <[email protected]>
2a14398
to
d88c529
Compare
|
hco-e2e-operator-sdk-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lgtm |
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-kv-smoke-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-kv-smoke-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@nunnatsa failures are unrelated to this PR, can we override these? |
/retest |
1 similar comment
/retest |
@machadovilaca: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@nunnatsa failed again |
What this PR does / why we need it:
Additional network interfaces should be ignored from the NodeNetworkInterfaceDown alert
count by (instance) (node_network_up{device!~"lo|tunbr|veth.+|ovs-system|genev_sys.+|br-int"} == 0) > 0
/cc @maiqueb
Reviewer Checklist
Jira Ticket:
Release note: