[01:41:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:34] FIRING: DiskSpace: Disk space seaborgium:9100:/ 5.277% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:31:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:25] RESOLVED: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:32:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:34] FIRING: DiskSpace: Disk space seaborgium:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:46:34] RESOLVED: DiskSpace: Disk space seaborgium:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:07:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:29] good morning. moritzm when it's not interfering with other upgrades I have to roll the upgrade for wmflib across the fleet [07:15:31] morning! [07:15:34] https://phabricator.wikimedia.org/T388684#10687015 :( [07:15:42] good morning and go ahead :-) [07:17:21] thanks [07:30:38] and {done} [07:37:24] ack [09:32:51] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10693017 (10ayounsi) Alarms graphing is working well. {F58951374} On this dashboard as well: https://grafana.wikimedia.org/d/fb... [09:52:14] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, and 2 others: Notifications about changes by Oznamovatel sent to Janbery doesn't seem to be reliable - https://phabricator.wikimedia.org/T245762#10693047 (10matej_suchanek) [10:50:02] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Watchlist, 06Moderator-Tools-Team, and 2 others: Notifications about changes by Oznamovatel sent to Janbery doesn't seem to be reliable - https://phabricator.wikimedia.org/T245762#10693151 (10Samwalton9-WMF) 05Open→03Resolved Given the lack of evide... [11:38:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10693273 (10ayounsi) 05Open→03Invalid The alert was too sensitive, I made https://gerrit.wikimedia.org/r/c/operations/alerts/+/1132591 to improve it. [11:40:04] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693284 (10ayounsi) Closing this task as we now have alerting for all the MX running a not too old Junos (and we're upgrading Junos in T364092). [11:40:08] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693287 (10ayounsi) 05Stalled→03Resolved a:03ayounsi [12:38:44] topranks: let's see where that goes https://github.com/openconfig/gnmic/issues/631 :) [12:39:23] ah cool [12:39:29] yeah worth opening, I definitely considered it before [12:39:37] I think the difficulty is gnmic would need access to the model [12:39:43] in theory that is available from the device though?? [12:40:35] yeah [12:40:52] it's prometheus-specific afaik too, I used InfluxDB in a past life and you could store text there [12:41:07] I don't think anything about a tsdb itself would prevent it, just the way prometheus works [12:41:35] but absolutely if gnmic could automatically get the yang model and do this it'd be great [12:42:08] even if no numeric values were explicitly in the model it could simply use 'first item=1, second item=2" etc [12:47:07] topranks: depends, even if the model doesn't have it, the RFC does (or SNMP implementation :) ) [12:47:27] so ideally better to follow that. But yeah if there is nothing defined, sequential would be fine [12:48:28] it's hard to make a programmatic association from RFC - MIB - YANG I think though [12:49:11] yeah for sure [12:49:40] let's see what they say anyway, definitely it would be good [12:49:57] ideally if it was just a flag in the prom_output, like "convert_text_values: true" or something [13:08:49] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10693542 (10cmooney) [14:51:24] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694188 (10cmooney) +1. I think I disabled it on the fasw a while ago as it was unable to connect to them, and I was worried about wasting clock cycles trying. But since their upgrades I t... [14:55:51] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694205 (10joanna_borun) p:05Triage→03Medium [14:55:58] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694206 (10ayounsi) a:03ayounsi [14:58:48] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694219 (10joanna_borun) a:03jhathaway [17:15:25] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694961 (10jhathaway) One lower tech option, is to use multiple simple regexes, e.g. ` node /^sretest1002\.eqiad\./, /^sretest1004\.eqiad\./, /^sretest1006\.eqiad... [17:15:37] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694962 (10jhathaway) p:05Triage→03Low [20:00:56] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695612 (10xcollazo) Hello, I would like to exercise this rule by running a very heavy Presto query. Is t... [21:10:27] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695918 (10BTullis) >>! In T381389#10695612, @xcollazo wrote: > Hello, I would like to exercise this rule...