[02:05:33] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9933337 (10Papaul) [06:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9933423 (10ayounsi) IPIP encapsulation is a necessary step in the good direction, whatever solution we decide on for load balancing, for th... [08:30:33] https://phabricator.wikimedia.org/T357415#9905563 [08:30:34] * elukey sigh [08:30:51] so we need a special license to use redfish on supermicro [08:30:55] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9933522 (10elukey) Current status: * We are following up with Supermicro to customize the default root password for... [08:31:13] I didn't know they got bought by Juniper :) [08:32:35] but yeah, that sucks... [08:33:19] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9933530 (10cmooney) [08:54:27] we have a lot of things in flight with them, we'll see updates [08:54:52] maybe we should be prudent before buying more nodes [09:13:22] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9933563 (10cmooney) 05Open→03Resolved Thanks all for the help with this one! [09:43:27] Juniper bought Supermicro ? [09:46:43] sorry I realise that was a joke - I guess couldn't be much worse than HPE bought Juniper anyway, I'd forgot about that ! [09:51:58] :) [10:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:50] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9933828 (10cmooney) >>! In T368544#9933423, @ayounsi wrote: > An `ip route 0/0` rule would be needed to "clamp" the outbound MTU or MSS (us... [11:10:33] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:15] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:02] any idea why when reimaging a server with --new, I get a CSR generated, but then the wait_for_csr step fails, and I get no CSR listed on the puppetserver? [12:17:13] anything manual I can do to fix? [12:17:23] wikikube-worker1029.eqiad.wmnet is an example [12:20:46] ok it's even weirder. It reinstalls with the old hostname o_o [12:26:50] claime: Not sure if this is related but I had an issue where I got a Puppet 5 rather than 7. May add -p 7 ? [12:26:59] Again not sure if it's the same issue [12:27:07] it has -p 7 in the reimage command [12:27:17] Then that not it [12:34:21] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9934200 (10cmooney) @fgiunchedi I was perhaps a little cheeky and merged this, but it was clear the volume of new metrics was well withi... [12:39:33] claime: That is really weird and I currently have no answer [12:41:40] mw1412 that was renamed to wikikube-worker1027, this one gets reimaged with the right hostname, but the CSR generation fails, doing it manually doesn't seem to unblock the cookbook either (logging into the server with install-console, cleaning up the certificate request, signing it manually on the puppetmaster) [12:42:03] and mw1417 that was renamed to wikikube-worker1029, that reimages to the old hostname [12:42:15] for that one I'm on the console while it reimages trying to see what went wrong [12:48:45] ok wikikube-worker1029 now reimages to the right hostname, we'll see if that survives [12:49:10] CSR signing passed for 1029, I have no idea what went wrong [12:49:15] computers are fantastic [12:49:28] It's probably computer related yes [12:50:03] They are less deterministic that computer science has us believe [12:51:54] ok something went wrong with the renames between wikikube-worker1027 and wikikube-worker1028 [12:54:48] 1028 works as well? [12:55:23] connecting to wikikube-worker1028.mgmt.eqiad.wmnet lands me on wikikube-worker1027's management interface [12:55:25] and vice versa [12:55:34] DNS records are consistent with netbox [12:55:41] wth [12:58:14] claime: sounds like the mgmt IPs need to be switched around [12:58:17] yeah somehow the ip of wikikube-worker1027's management interface points to the management interface of wikikube-worker1028 and vice-versa [12:58:32] give me 5 mins to wrap up what I'm doing I'll take a look [12:58:55] I must have messed something up running the renames I guess [12:58:56] Sorry, I have to run [12:59:03] no worries [13:16:14] claime: ok I checked and the non-mgmt IPs for those hosts are in the right places [13:16:28] so seems like just the mgmt IPs are the wrong way around [13:17:20] idk how that happened :shrug: [13:21:19] yeah it's odd, this was done with the new cookbook was it? [13:25:48] yeah [13:28:38] claime: ok it's fixed now [13:28:44] ty <3 [13:28:49] I'm tied up so I can't take a deeper look right now [13:29:05] easy mistake if there were manual changes, but if cookbook caused this it's worrying [13:29:11] no worries it may be a one off [13:29:20] or they were already borked [13:30:03] ah indeed yeah perhaps [13:55:37] another rename-related owie: had a timeout during the rename so now I have renamed the instance but not the interfaces 😅 [13:55:41] https://phabricator.wikimedia.org/P65550 [14:02:11] can I just manually change those records, then run the dns, sync hiera and configure interface cookbooks? [14:14:15] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:37] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9934720 (10elukey) The safest bet is to use `python3-confluent-kafka` in my opinion, it is pa... [15:09:15] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:38] hnowlan: having a look now [15:31:17] yeah so the IPs still have the old names in netbox, but the server name itself was changed [15:31:23] we can manually update the "dns_name" on these [15:31:29] https://netbox.wikimedia.org/ipam/ip-addresses/5823/ [15:31:34] https://netbox.wikimedia.org/ipam/ip-addresses/5824/ [15:31:39] https://netbox.wikimedia.org/ipam/ip-addresses/1843/ [15:31:39] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744 (10elukey) 03NEW [15:31:44] and run the dns cookbook, I'll do that now [15:32:35] topranks: thank you! [15:33:03] hnowlan: how is it otherwise? do you think that was the only problem? [15:34:27] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9934956 (10elukey) `docker-reporter-base-images.service` on build2001 reports an issue with the dec-puppet-client image: ` [2024-06-28T04... [15:36:09] topranks: I'm not 100% sure, it bailed out when it couldn't find the renamed host so it more or less stopped before the other cookbooks could run. Do you think running the hiera and interfaces cookbooks should do it? [15:36:38] you're just renaming the host right? [15:36:47] not moving it or anything? [15:37:46] yeah [15:37:50] just renaming [15:42:50] ok well those won't matter much then, but yes we can run them [15:43:17] in fact the dns cookbook just triggered the hiera one so that's now done [15:43:47] grand, I can give a reimage a go now and see [15:43:58] I'll update the switch too, but all it will do is change a port description, so won't have a bearing on the reimage [15:44:05] yeah I'd fire it off again hopefully it'll be ok [16:03:00] ah no dice, it came back up as mw2300 [16:04:08] I have to take off soonish so I'll leave it downtimed for now and come back to it [16:17:53] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935199 (10elukey) On db1195 I see for `emacs-nox`: ` MariaDB [debmonitor]> select * from bin_packages_package where name = 'emacs-nox'... [16:24:00] hnowlan: sorry didn't see till now [16:24:11] when you say "it came back up as"..... [16:25:27] I'm not sure how that is possible unless the system didn't reinstall the OS at all [16:26:04] i.e. it just rebooted as it was, hence having the old name [16:26:46] anyway if it can stay downtimed we can look next week [16:29:36] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9935247 (10elukey) Ok I see, I ran debmonitor inside the dcl image: ` "os": "Debian 12", "uninstalled": [], "update_type": "f... [18:25:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9935568 (10cmooney) I may have spoken too soon when I said things were working fine. It seems in codfw since the change we are only get... [19:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:15] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed