[03:28:33] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:42] I will just relocate and be back in a little bit [07:28:33] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi) [09:08:37] topranks, XioNoX: could either of you please rearm the keyholder on cumin2002? haven't made the change yet, that I have access to the homer passphrase [09:08:51] sure [09:09:18] cheers [09:10:01] moritzm: done [09:10:44] thx [09:13:33] (SystemdUnitFailed) firing: (3) ferm.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:30] I got the following traceback sin the decom cookbook when decommissioning furud.codfw.wmnet, it fails to retrieve some data from JunOS/switches: https://phabricator.wikimedia.org/P52329 [09:28:54] known issue, should I open a task? and does something need to be cleaned out manually that the cookbook missed due to the traceback? [09:52:02] 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) [09:54:23] 10SRE-tools, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) [10:06:54] moritzm: i took a look at the cookbook and AKAICT the the cookbook had an error trying to disable the interface on the switch, so its probably worth checking that manuall. the only other task left to run was the cookbook sre.dns.netbox (im running that now) [10:07:02] moritzm: hmm ok, that's an error when it executed the sre.network.configure_switch_interfaces cookbook [10:07:42] I ran homer against the switch to clear out any config not in sync with netbox [10:08:29] That diff had just the interface removal for the furud host, so nothing jumping out that there is additional extraneous config on the box that tripped up the cookbook [10:08:29] https://phabricator.wikimedia.org/P52330 [10:08:41] From the network side things should be clean now [10:09:04] Might be worth opening a task yes - Arzhel wrote that cookbook the failure make more sense to him nothing jumping out at me as to what went wrong [10:09:05] dns cookbook has also ran now as well [10:10:43] furud was/is a bit of a special case since it's a server which had two additional disk shelves attached to it (so using a total of 5 Us): https://phabricator.wikimedia.org/T176506 [10:10:56] so I suppose that triggers some special case in the connection to the switches [10:11:16] there is one other server of that type (flevorium), which I expect will also be decommissioned soon [10:11:44] not sure - from what I'm reading it just had a normal, single Ethernet connection to switches? [10:12:17] this is the command that failed [10:12:17] node=asw-b-codfw.mgmt.codfw.wmnet, rc=255, command='show config [10:12:21] but yeah - that cookbook is very much designed to deal with the "regular" scenario, if there is something non-standard to do with the networking it perhaps could fail [10:12:21] uration interfaces xe-7/0/6 | display json' [10:12:43] I don't know, about the original racking I onky what's in https://phabricator.wikimedia.org/T176506 [10:12:43] that shouldn't fail tbh [10:13:04] one other thing to note: [10:13:28] looks like the complicated stack cabling is between the disks though, doesn't touch network [10:13:44] these servers were quite ol, bought from 2017 and not sure if they ever saw a firmware update [10:13:52] ok [10:14:17] jbond: thanks for the info on command [10:14:18] we'll likely have a repro case soon when flerovium gets taken down [10:14:21] it works fine right now from CLI [10:14:22] https://phabricator.wikimedia.org/P52331 [10:14:37] (config changed when I ran homer but that show command should basically always work) [10:15:09] even if I run it for a spurious non-existent interface number it "works" (no output but no error) [10:15:25] topranks: i think that rc=255 genrally means that there was a more fundemental issue like e.g. ssh rejected the connection because of a bad fingerprint [10:16:09] ah ok, that makes more sense, yeah could potentially be a connection issue or something [10:16:37] yes [10:16:45] maybe we don't need to worry about it too much for now, can leave it and see if it re-occurs? [10:16:52] id say so [10:17:51] the decom cookbook is run regularly and we don't have similar reports so not urgent yeah [10:18:19] agreed [10:19:34] if it happens again with the flerovium decom, I'll ping the channel [10:26:31] sgtm [10:39:29] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) @colewhite thanks [12:14:34] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) [12:14:48] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) [12:48:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) > to reduce load on LVS hosts My recollection is that it wasn't really about raw load or PPS at the LVSes. It was that our Linux kernel settings ha... [12:49:18] 10Puppet: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05Open→03In progress p:05Triage→03Medium [12:49:38] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) [12:52:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/... [12:57:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10BBlack) Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational pac... [13:35:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @papaul I've done some testing and I'm confident the IP GW moves for the row subnets to the Spines can be done gracefully. I've yet to wo... [13:43:22] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) This did get broken with the migration to the new puppetdbs as we migrated cumin to use... [13:44:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney thanks for the update. I think we can reuse those the MPO [16:03:33] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:33] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:43:33] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:33] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed