[04:33:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [05:43:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [10:21:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:21:25] FIRING: [2x] MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:46:25] FIRING: [2x] MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [13:15:32] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#11327066 (10Papaul) @Dwisehaupt yes Wednesday 11/5 is ok with me. Let us do 10:00am CT. Thank you. [13:34:12] 10netops, 06Infrastructure-Foundations, 06SRE: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11327142 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aacadee6-1bf1-45b7-bbed-963884cb38ed) set by cmooney@cumin1003 for 5 d... [13:34:42] There is something that I am not getting - why are we seeing all those MirrorHighLag alerts? [13:34:59] in theory we should receive the notification after the 14h are crossed [13:35:05] and this doesn't seem the case [13:35:23] the warning fires after 8 hours [13:42:47] yes something is off, the alert's description says [13:42:48] Mirrors - /srv/mirrors/debian synchronization lag is behind 10h 32m 22s [13:44:45] the alert itself is set for 8 hours though [13:44:47] expr: time() - node_file_age_timestamp_seconds_total{path=~"/srv/mirrors/(debian|ubuntu)"} > 8 * 3600 [13:45:12] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/mirrors.yaml [13:45:56] or actually... the one below says 14h [13:46:28] aim seems to be warning alert after 8 hours, critical after 14 [13:46:31] yeah the first is the warning, the latter is the critical [13:46:46] I'm not sure how the plumbing works from there though, I guess we are seeing the "warning" ones here [13:46:54] ok my bad I see that https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag is a warning [13:47:09] yeah so why do we get notified for warnings? [13:47:16] there is probably a tunable somewhere [13:47:49] yeah I am not sure, I wasn't really aware we had proper "warning" and "ciritcal" level alerting [13:48:12] I guess though the issue is not in that YAML file, but in how we deal with the warning level alerts more generally? [13:48:35] found it, in puppet we define that this channel gets everything [13:48:46] warnings and criticals, meanwhile operations only criticals [13:48:53] ok cool [13:48:58] I'd argue that we should stop getting warnings here [13:48:59] well maybe that is sensible idk [13:49:00] wdyt? [13:49:16] it is super noisy and not actionable [13:49:27] could be ok... but I wonder what the value of the warning is then? [13:49:33] they will pop up on karma if we need to check, but here it is pointless [13:49:36] perhaps we should only have a critical level after 14h? [13:49:45] ok yeah that's a point [13:49:57] is the puppet config a global one? i.e. it's for _all_ warnings? [13:50:15] nono the ones that match our team as label [13:50:23] modules/alertmanager/templates/alertmanager.yml.erb [13:50:24] I could see some sense in a warning firing here before it goes to -operations [13:50:36] I'm not sure how many other alerts we have set this way with warning/critical threshold [13:51:04] I think the netops ones are all just critical, but they are mostly things we want to know immediately the situation exists, or like after 5 min [13:51:41] slyngs: o/ git blame shows you as the creator of the warning :D [13:52:10] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1130964 [13:52:33] if there is not compelling reason to have that warning, I'd remove it [13:52:58] the critical at 14h is enough in my opinion, and we'd remove some noise from this channel [13:55:11] I think safer maybe to change this to only have critical, rather than adjust the config that sends warnings for our teams here [13:55:38] perhaps that makes sense, but without knowing what other alerts have that kind of setup or why it was done to begin with might be safer to just modify this alert? [13:58:05] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1200068 [13:58:49] more clean and actionable in my opinion, given the current timings for the rsyncs that copy from mirrors etc... [14:27:02] even if the critical were to hit, it's hardly actionable for us [14:27:16] usually it means that the mirror push on the Ubuntu/Debian side failed one way or the other [14:27:34] but dropping the warning is definitely right, having a look at the patch in a bit [14:53:51] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [16:00:49] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#11328131 (10Dwisehaupt) Cool. Let's do it then.