[00:07:42] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:42] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:12] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:43] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10ayounsi) I worked around the issue by disabling "dhcp-relay" on cr2-eqiad `install1004:~$ sudo tcpdump -i ens13 "host 10.65.0.1"` is the easiest way to dete... [05:45:58] volans: found the issue and worked around it [05:46:00] https://phabricator.wikimedia.org/T337345#8875421 [05:57:48] I have to step away for a bit but I edited my comment with 2 possible longer term fixes [06:14:12] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:12] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:51] thanks a lot XioNoX [08:29:59] (PuppetDisabled) firing: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:44:53] volans, XioNoX: only catching up with this [08:45:19] sorry, almost certain a result of the changes last week. I tested a re-image but didn't anticipate the CR changes might affect relay from the MR [08:45:26] I'll try to dig in see if I can work antyhing out [08:47:07] <3 thanks a lot, I told jclark that he's unblocked now, so if possible try things not in eqiad :D [08:48:00] gotcha [08:54:18] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8875421, @ayounsi wrote: > So it's either a Junos bug or the need for another nerd knob. > Edit: [[ https://www.juniper.net/documenta... [08:56:12] heh for once the constant dhcp-spam on the mgmt network is useful for something [09:01:34] yeah :D [09:01:42] just this time though [09:06:35] Hey team, I have to take an unplanned day off. Sorry for late notice. [09:12:11] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES, 10SRE: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002 (10elukey) 05Open→03Declined The ML team is focusing on https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing, the replacement of ORES... [09:31:19] 10Puppet, 10Infrastructure-Foundations: role_owner.prom not getting updated on (re)installed hosts? - https://phabricator.wikimedia.org/T337375 (10fgiunchedi) [09:34:10] spoke to one of the Ganeti maintainers who's also present at the Mini Debconf, he's creating a repo for prometheus-ganeti-exporter under the https://github.com/ganeti parent org, so that we can import it there (and further maintenance can proceed there) [09:41:05] moritzm: For us to put the exporter we wrote into, or have they create one as well? [09:43:15] for us to import, but I guess when it's up there (and it seems others have voiced interest for prometheus metrics in the past), I suppose people will file issues or MRs for the metrics they need [09:43:27] given that our initial focus is on capacity overview [09:43:53] you should get access for your github handle in a bit [09:44:31] Insert "Mr. Burns: Eeexcellent" meme [09:45:10] I'm also sorting out to get added to the Ganeti org on salsa, when that is resolved, I'll also upload the exporter to Debian [09:47:16] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) The Juniper [[ https://www.juniper.net/documentation/us/en/software/junos/dhcp/topics/topic-map/dhcp-relay-agent-security-devices.html | docs ]] do s... [10:14:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10jbond) >>! In T316358#8469318, @cmooney wrote: > @jbond I've uplaoded a separate patch (above) that makes a stab and working this clos... [10:22:39] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10cmooney) >>! In T337345#8874938, @Volans wrote: > I wonder if this has something to do with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/908346... [10:22:41] nice! [12:29:59] (PuppetDisabled) firing: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:09:55] moritzm: fyi this is you ^^^ [13:30:17] oh, right. that was used for the KDC migration, re-enabling now [13:32:06] fixed [13:34:59] (PuppetDisabled) resolved: Puppet disabled on cuminunpriv1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:36:47] cheers :) [16:12:36] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10ayounsi) Copying the commit message as it have the RFO and fix details: The modern DHCP implementation on Juniper devices forwards ALL DHCP packets to the co... [16:50:00] 10Mail, 10Infrastructure-Foundations, 10Sustainability (Incident Followup): Follow up for mx1001 incident: 2023-05-17 MXQueueHigh on mx1001 - https://phabricator.wikimedia.org/T337257 (10Arnoldokoth) [18:22:13] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10Volans) With T336485 almost completed, we could consider integrating the two things, getting this one off exported in some place and then have the `sre.... [19:38:22] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10ayounsi) My initial guess was to add them to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common.y... [19:50:18] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Jclark-ctr) @ayounsi the provisioning script is still failing in row e/f. dbproxy1026 dbproxy1027 [20:21:42] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:42] (SystemdUnitFailed) resolved: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed