[00:38:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:12] swfrench-wmf: yes, access is now handled via cn=logstash-access membership and can be requested via idm.wikimedia.org -> Request permission [07:56:29] there's an announcement with additional details in the #engineering-all Slack channel (5th of March) [10:32:56] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932 (10Joe) 03NEW [10:33:35] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673130 (10Joe) [10:34:02] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673134 (10Joe) [10:34:11] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673136 (10FCeratto-WMF) Related to https://phabricator.wikimedia.org/T388127 [11:33:10] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673629 (10cmooney) Things are looking good after the application of the change, an-worker nodes are correctly... [12:41:32] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673820 (10BTullis) 05Open→03Resolved a:03BTullis [13:09:21] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673952 (10elukey) For record keeping, afaics this module is used by w... [13:13:47] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673956 (10elukey) @Volans @dcaro from what I can see the repo got a n... [13:19:11] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673977 (10dcaro) >>! In T354410#10673956, @elukey wrote: > @Volans @d... [13:35:56] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674046 (10Volans) This is a great news! With our current pinning in s... [13:36:12] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10674050 (10Marostegui) [13:46:53] back [13:46:57] er, wrong channel :) [13:47:08] (irssi changed the window numbering!) [13:49:06] lol [13:49:44] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674127 (10elukey) @dcaro have you tried with the current setup.py's c... [13:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674173 (10dcaro) I'm installing wmcs-cookbooks, that should bring the... [14:18:00] moritzm: thanks for confirming! I saw the clinic duty docs had been updated, and recall there being an email about using IDM (e.g., for wmf), but wasn't sure authoritatively what https://wikitech.wikimedia.org/wiki/Logstash#Authentication should list (ops + logstash-access?) [14:18:59] 10:18:39 <+jinxer-wm> FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr1-drmrs (185.15.58.142) - group [14:19:04] what's this about? topranks XioNoX ^ [14:19:31] sukhe: cool, new alerts being ported to prometheus [14:19:39] the links been down due to physical issue [14:19:41] sukhe: it's the same as the Icinga BGP one [14:19:48] ok :) [14:19:50] and yep, probably fired due to the new alerts added [14:19:57] sorry for the noise folks [14:19:59] (on-call hat) [14:20:01] sukhe: so known issue, I'll ack it [14:20:03] swfrench-wmf: ah, that is outdated docs, thanks! I'll fix the wikitech page now [14:20:05] np [14:20:12] sukhe: nah it's cool, thanks for the ping [14:22:30] silenced for 2 days [14:22:59] <3 [14:24:48] topranks: another nice side feature of migrating from Icinga to AM is that we can ACK a specific interface being down. While in icinga it has to be all or nothing, at the risk of missing more issues [14:25:34] Yep cool, or probably a specific BGP peer or OSPF int? [14:25:43] yeah [14:26:16] thanks, moritzm! [14:40:55] FIRING: MaxConntrack: Max conntrack at 80.31% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:45:55] RESOLVED: MaxConntrack: Max conntrack at 80.31% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [14:52:00] .37 [15:17:54] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 (10cmooney) 03NEW p:05Triage→03Medium [15:20:36] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674586 (10cmooney) @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining steps? Unlike the previous attempt I will lea... [15:51:04] 10netops, 06Infrastructure-Foundations, 06SRE: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674734 (10cmooney) Config to be applied in first step - P74416 [17:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:55] FIRING: MaxConntrack: Max conntrack at 80.93% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:14:55] RESOLVED: MaxConntrack: Max conntrack at 80.93% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:15:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed