Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new ignored interfaces to NodeNetworkInterfaceDown Alert #3279

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions pkg/monitoring/observability/rules/alerts/cluster_alerts.go
Original file line number Diff line number Diff line change
@@ -1,11 +1,23 @@
package alerts

import (
"fmt"
"strings"

promv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
"k8s.io/apimachinery/pkg/util/intstr"
"k8s.io/utils/ptr"
)

var ignoredInterfacesForNetworkDown = []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What interfaces are we actually trying to monitor ?

I'm logging into an existing node, and I see the following interfaces:

8: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 0a:58:0a:87:00:02 brd ff:ff:ff:ff:ff:ff
    inet 10.135.0.2/23 brd 10.135.1.255 scope global ovn-k8s-mp0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe87:2/64 scope link 
       valid_lft forever preferred_lft forever
777: 9f17207c3cc0f46@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default 
    link/ether 82:e5:8e:03:e1:ce brd ff:ff:ff:ff:ff:ff link-netns 5bf0ae73-db72-4bfd-b646-b231f799e15d
    inet6 fe80::80e5:8eff:fe03:e1ce/64 scope link 
       valid_lft forever preferred_lft forever
9: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether ce:98:9b:60:6c:df brd ff:ff:ff:ff:ff:ff
    inet6 fe80::cc98:9bff:fe60:6cdf/64 scope link 
       valid_lft forever preferred_lft forever
11: 493c31b001d3731@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default 
    link/ether 7a:0d:d9:56:78:5f brd ff:ff:ff:ff:ff:ff link-netns b1d150aa-91dc-4741-872f-1f71dab1a2da
    inet6 fe80::780d:d9ff:fe56:785f/64 scope link 
       valid_lft forever preferred_lft forever
12: 3d50190a313ba21@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default 
    link/ether 52:77:4d:f5:8c:4c brd ff:ff:ff:ff:ff:ff link-netns 28e514e3-3b79-495a-b5d6-483d779203c5
    inet6 fe80::5077:4dff:fef5:8c4c/64 scope link 
       valid_lft forever preferred_lft forever

And many more like 493c31b001d3731 (just an example).
Are we interested in throwing alarms when these go down ? These are the host side of the veths connecting the pod to OVS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, regarding ovn-k8s-mp0 ... would we want to know about this ? This is the default cluster network management port, which is used for some OVN-K features.

If we think about primary UDN, we will have one of these per primary UDN.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea, for example, why in the NodeNetworkInterfaceFlapping: changes(node_network_up{device!~"veth.+|tunbr",job="node-exporter"}[2m]) > 2, they only ignore "veth.+", "tunbr"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally clueless.

I guess they - whoever they are - don't run in openshift CI, whose monitor tests would fail at this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the openshift monitoring team also provides the flapping alert for openshift customers

"lo", // loopback interface
"tunbr", // tunnel bridge
"veth.+", // virtual ethernet devices
"ovs-system", // OVS internal system interface
"genev_sys.+", // OVN Geneve overlay/encapsulation interfaces
"br-int", // OVN integration bridge
}

func clusterAlerts() []promv1.Rule {
return []promv1.Rule{
{
Expand All @@ -23,7 +35,7 @@ func clusterAlerts() []promv1.Rule {
},
{
Alert: "HAControlPlaneDown",
Expr: intstr.FromString("kube_node_role{role=\"control-plane\"} * on(node) kube_node_status_condition{condition=\"Ready\",status=\"true\"} == 0"),
Expr: intstr.FromString("kube_node_role{role='control-plane'} * on(node) kube_node_status_condition{condition='Ready',status='true'} == 0"),
For: ptr.To(promv1.Duration("5m")),
Annotations: map[string]string{
"summary": "Control plane node {{ $labels.node }} is not ready",
Expand All @@ -36,7 +48,7 @@ func clusterAlerts() []promv1.Rule {
},
{
Alert: "NodeNetworkInterfaceDown",
Expr: intstr.FromString("count by (instance) (node_network_up{device!~\"veth.+|tunbr\"} == 0) > 0"),
Expr: intstr.FromString(fmt.Sprintf("count by (instance) (node_network_up{device!~'%s'} == 0) > 0", strings.Join(ignoredInterfacesForNetworkDown, "|"))),
For: ptr.To(promv1.Duration("5m")),
Annotations: map[string]string{
"summary": "Network interfaces are down",
Expand Down
Loading