[00:18:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2002.codf... [00:21:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814765 (10Jhancock.wm) @cmooney I put the server in the wrong vlan. can you fix it for me. private1-a8 to private-a-codfw. th... [03:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:19] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9815067 (10Pppery) [06:48:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9815149 (10ayounsi) [07:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:33:56] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9815439 (10cmooney) @Jhancock.wm @Papaul I'd been using the server in b7 for testing already, but I should be able to move over... [09:53:56] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9815860 (10Jelto) [10:20:41] 10SRE-tools, 10Spicerack: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454 (10Volans) 03NEW p:05Triage→03Medium [10:21:20] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655#9815952 (10Volans) Some use case could be covered with this approach: T365454 [10:56:39] hello! I am attempting to rename and reimage a few servers in codfw and eqiad - when running `sudo cookbook sre.network.configure-switch-interfaces wikikube-ctrl2001` I am getting a failure [10:56:46] the error is "error: In routing-instance default-switch vlan private1-b6-codfw configured under interface ge-0/0/12.0 does not exist [10:57:09] not sure if I did something wrong in the rename process or what [11:01:00] the cookbook works for eqiad hosts (wikikube-ctrl100[123]) [11:02:10] can you wait for the rename cookbook? :D [11:02:39] he is unstoppable at this point [11:02:39] I'll defer to top.ranks for the specific error in codfw ;) [11:44:05] topranks: sorry to bother you again but I'm seeing that same error for wikikube-ctrl200[23] also. Can I fix that myself with a homer run? [11:45:26] probably.... [11:45:36] homer lsw*codfw* commit "add missing vlans codfw" [11:45:55] I can run it here though no bother [11:47:05] ah sure I can, thanks! [11:47:54] I'm running it right now :P [11:48:17] almost done, but it may need to be done later it's only when the netbox change is made in vlan that Homer will try to add it [11:48:37] ah okay [11:48:39] thanks! [11:48:42] hnowlan: if that happens you can also run for a given switch if you know what rack it's in (the above error tells you) [11:48:43] i..e [11:48:53] homer lsw1-b6-codfw* commit "add missing vlan" [11:49:48] hnowlan: actually it didn't add anything to those [11:50:27] wikikube-ctrl2003 is in rack A3, so I'd expect it there but nothing... [11:50:28] * topranks checking [11:50:47] ah wikikube-2002's interface is on asw [11:50:53] wikikube-ctrl2002 is in rack C6 which is still on the old network setup shouldn't need it [11:50:55] yep [11:51:40] so yeah, 2002 shouldn't need a change, 2003 is in a new rack but the right vlans were there already [11:51:48] can you post the error you got? [11:52:23] 2003 seems happy now [11:52:32] for 2002 I get: error: In routing-instance default-switch vlan private1-c6-codfw configured under interface ge-60/24.0 does not exist [11:52:53] say what? [11:52:55] ok [11:53:34] god damn it there is just always something with this stuff... [11:54:04] an oversight plus the fact we're prepping the migration to the new setup in row C means your host has been tried to be added to an as-yet non-existant vlan [11:54:09] give me a moment [11:55:23] ahhh oops [11:57:13] hnowlan: you can run the cookbook to configure the switch ports for wikikube-2002 now [11:57:17] I'm just updating the dns [11:57:27] thanks! [11:59:56] I've changed those vlans to 'reserved' from status 'active' in Netbox now, so they won't get picked again in error [12:00:10] all sorted on that front, thanks! [12:00:16] I'll also adjust the automation for adding them so they're added in that state, mistake was they were added as 'active' [12:00:43] cool - authdns is being pushed out now so you're good to proceed [12:01:02] however, I unfortunately have another, possibly unrelated issue with sre.hosts.provision :D [12:02:14] getting `cookbooks.sre.hosts.provision.ProvisionRunner._config: Not all changes were applied successfully, see the ones reported above that starts with "Updated value..."` [12:02:43] on first run I get `Unable to auto-detect NIC with link. Pick the one to set PXE on` which is a little concerning, but it fails no matter which interface is used [12:03:22] let me look - that was with 2002? [12:03:56] any of them unfortunately [12:04:52] ok [12:05:07] I was looking at 2002, the switch port is showing up (which obstensibly it's not based on error message) [12:06:19] ok something odd is happening, the iDRAC web-gui for wikikube-ctrl2002 does indeed show both its network ports as DOWN [12:06:27] despite the connected switch showing 'up' [12:07:12] I've been following https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging so there's a chance something got messed up along the way [12:07:36] I don't think so to be honest [12:11:54] hnowlan: so you are just reimaging these and renaming right? [12:12:10] they are in the same rack locations? no cabling etc would have been changed? [12:12:55] 2002 port came up after I forced 'power on' the system, but only when it started the PXEboot bit [12:13:09] can you try the sre.hosts.provision cookbook again? [12:13:47] yeah no cabling changes or anything [12:13:51] sure [12:14:00] any specific host? [12:14:02] ok yeah - so we should 100% expect the link here to work [12:14:07] try 2002 as that's the one I powered on [12:14:13] (and the port is showing 'up' on now) [12:14:39] this may just be a discrepancy in the workflow [12:15:22] volans: have you seen that error before? [12:15:40] sre.hosts.provision cookbook reports "Unable to auto-detect NIC with link. Pick the one to set PXE on" [12:15:54] the host was powered-down and iDRAC showing both ports as no link [12:16:01] "Detected link on 2 interfaces" - looks better [12:16:09] although still not sure which to pick heh [12:16:10] I think that's bad too :P [12:16:50] bios only sees one up [12:16:52] https://usercontent.irccloud-cdn.com/file/4GXqI2BP/image.png [12:17:31] gonna try NIC.Embedded.1-1-1 and see what happens [12:17:46] hnowlan: we can probably skip running the provision cookbook tbh [12:18:07] it sets the BIOS settings we need etc., these realistically should already be configured from before [12:18:24] NIC.Embedded.1-1-1 is the correct one (always) yes [12:19:15] it is doing some important-sounding stuff (setting DNS for the iDRAC) [12:25:03] got another "Raised while handling: Not all changes were applied successfully, see the ones reported above that starts with "Updated value..."" [12:25:20] I do realise I am running this from eqiad for a codfw host, would that affect things? [12:25:26] I don't actually see any failures in the logs [12:25:46] in the settings updates that is [12:25:48] I do see "First attempt to load the new configuration failed, auto-retrying once" [12:30:00] https://phabricator.wikimedia.org/P62772 [12:40:11] https://salsa.debian.org/installer-team/netcfg/-/commit/f2c500af14c2487e8354dc243a373d20867980fe [13:06:43] and the bug to ask backporting the change in bookworm https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071574 [13:28:12] topranks: sorry missed the ping, reading backlog [13:30:47] XioNoX: nice! [13:31:12] hnowlan: sorry I missed your responses earlier [13:32:59] seing both active is weird, it checks which one has a signal according to redfish api [13:33:02] IIRC [13:34:18] yeah exactly [13:34:26] I guess my question was if the host has to be powered on for that [13:34:43] once I did so manually the iDRAC GUI showed port 1 as "up" [13:34:58] but then bizarrely the cookbook reported two as up 🤷‍♂️ [13:36:09] might depend also on the idrac version : [13:36:10] :/ [13:37:42] yeah that's a point [13:38:09] tbh I'm not sure for the typical "rename" workflow if we need to re-run the provisioning cookbook [13:38:34] in theory all should be set from before, so maybe it shouldn't be on the list of steps [13:38:49] I guess ideally it's best to run just to be sure [13:55:59] the idrac hostname must be changed [13:56:16] running the provision is overkill, the upcoming cookbook from arzhel will have just one patch call [13:56:24] as I suggested 1h ago :D [13:56:53] yeah, let's wait a bit for the cookbook [14:36:10] so I can roll ahead without? [14:39:14] I'd love to use the cookbook but we're a little time pressured on these :( [14:41:54] hnowlan: which step did you do? [14:42:58] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9816912 (10Arian_Bozorg) Is this still still required here, trying to find a good spot for this task on the board during triage [14:43:16] XioNoX: I'm at the point of doing `sudo cookbook sre.hosts.provision --no-dhcp --no-users $host` but it's failing with the errors in the paste above [14:43:29] $host is one of wikikube-ctrl[12]00[123] [14:44:40] hnowlan: I'd say try a re-image and see what's up [14:45:11] if it works, we can manually update the idrac hostname [14:46:22] and even that seems to already be correct `NIC.1#DNSRacName, has already the correct value: wikikube-ctrl2002` [14:46:22] cool, thanks! [14:49:27] mgmt dns is correct for all of them yes [14:49:38] so just proceed to the reimage step I think [14:55:14] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9817031 (10AndrewTavis_WMDE) Hey @Arian_Bozorg 👋 Yes, we do still need to check this out. I was thinking that @Lucas_Werkmeister_WMDE and I could disc... [14:58:05] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9817071 (10AndrewTavis_WMDE) Ah looking at this, I'm realizing I restated myself as the work that's left in {T364965} is a duplicate of what we want t... [15:01:02] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9817102 (10AndrewTavis_WMDE) So basically removing the wdcm.pp related file on GitHub and its Puppet workflows will close both tasks :) [15:01:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817104 (10Jclark-ctr) a:03Jclark-ctr [15:06:13] no joy on the reimage. I see "Boot to PXE Boot Requested by iDRAC", it does a firmware check and then blank screen. No polling, no waiting to PXE. mgmt IP address looks okay etc [15:07:14] hnowlan: it boots directly to the disk? [15:07:56] XioNoX: not fully sure, it appears to want to PXE by all appearances but it never seems to try. There's nothing on the disk to boot to so probably? [15:09:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817181 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [15:09:21] hnowlan: can you share a screenshot of the screen it's stuck at ? [15:09:39] oh, blank screen.. [15:09:55] I'd suggest upgrading idrac if it hasn't been done first [15:10:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817194 (10VRiley-WMF) Checked the switch, and reseated the cable. It seems to have come back up with no issues. Everything running normally. [15:11:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9817199 (10VRiley-WMF) 05Open→03Resolved [16:05:03] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9817621 (10MoritzMuehlenhoff) [18:04:26] 10CAS-SSO, 06collaboration-services, 06Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Radar): Add GitLab to offboarding workflow - https://phabricator.wikimedia.org/T339843#9818338 (10Pppery) [18:04:56] 10CAS-SSO, 06Infrastructure-Foundations, 06SRE, 13Patch-Needs-Improvement: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235#9818340 (10Pppery) [18:05:13] 10CAS-SSO, 10netbox, 06Infrastructure-Foundations: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002#9818341 (10Pppery) [18:09:00] 07Puppet, 10Cloud-VPS, 13Patch-Needs-Improvement: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059#9818362 (10Pppery) [18:12:16] 07Puppet, 06Release-Engineering-Team, 13Patch-Needs-Improvement: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277#9818380 (10Pppery) [18:13:01] 07Puppet, 06Infrastructure-Foundations, 10Puppet-Core, 06SRE, 07Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#9818384 (10Pppery) [18:13:55] 10netbox, 07Puppet, 06Infrastructure-Foundations, 10observability, 06SRE: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272#9818385 (10Pppery) [20:23:56] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:17] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9819125 (10jhathaway)