[09:02:48] fyi, in the netbox-deploy Gerrit repo: the master branch has been deleted, the wmf-next branch has been renamed main, and a new dev branch has been created. deploy1002 checkout has been updated accordingly [09:31:16] 10SRE-tools, 06SRE: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9858600 (10Ladsgroup) Hi, clinic duty again. Can you tag it with a team? Wouldn't I/F be okay here? [09:38:02] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858611 (10MoritzMuehlenhoff) [10:21:03] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858744 (10MoritzMuehlenhoff) [11:07:14] XioNoX: Hey :) I created https://phabricator.wikimedia.org/T366583 for the firmware upgrade issue I encountered yesterday while renaming mw1358, there's the version information in there if you want to add a check to the rename cookbook [11:08:43] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:49] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:07] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9859201 (10MoritzMuehlenhoff) [13:06:20] hi foundations friends, does anyone know about number_of_facts_soft_limit in the puppet agent config? [13:07:55] no idea :( [13:08:12] cdanis: is it ok if I roll reboot aux k8s for the kernel updates? [13:08:16] elukey: always [13:08:24] and thank you <3 [13:08:35] all right will do it now.. anything strange/fancy that I need to be aware? [13:08:38] nah [13:08:58] "After a while, in the middle of a fire" [13:09:04] okok proceeding :D [13:12:36] cdanis: I do yes, there's an older patch of mine https://gerrit.wikimedia.org/r/c/operations/puppet/+/972357, but the currentl line of thinking is to: [13:12:44] a) keep the default value [13:12:59] b) make it overrideable in Hiera where/if needed [13:13:23] c) check those role where it triggers whether these facts should get filtered in puppetdb ingestion [13:13:36] it's just a harmless warning, there's no functional impact [13:13:57] I think we have a task for it, need to dig in older tasks around the start of the Puppet 7 migration [13:15:17] thanks, that's very helpful [13:25:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:57] 10SRE-tools, 06SRE: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859741 (10Volans) @Ladsgroup no, not really. It should be the one of the owners of the systems with raid0 that are interested in automating this step. So I guess `o11y` in this... [14:23:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:55] 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859944 (10Ladsgroup) Done. Thanks. [14:36:56] 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859947 (10Ladsgroup) [14:39:49] aux is of course not supported (yet) in the k8s reboot nodes cookbook, tried to add support via https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1038782 [14:53:28] volans: https://wikitech.wikimedia.org/wiki/Ncmonitor [15:14:37] qq - trying to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1037573 (sre.host.provision, no op refactoring) and using sretest1001 I managed to run the cookbook but with --no-dhcp -no-users, since the status is active [15:14:42] (in netbox) [15:14:59] tried to change the status to planned, and started again the cookbook, but the test-connection step fails [15:15:47] is there another way that you can think of to test the whole cookbook, without the need of a new host? [15:16:07] (super ignorant about these workflows so it may be that the issue is a PEBCAK) [15:19:53] elukey: your aux cookbook patch lgtm [15:20:02] <3 [15:51:51] elukey: unless we reset the idrac/bios to factory reset no, but your changes were mostly for the central part of the cookbook not involving dhcp or the users, so don't worry [15:52:05] if it works with --no-users --no-dhcp that's good enough for me [15:53:30] brett: that wikipage doesn't tell which operations you're trying to automate, why you want to automatically make modifications to manual repos like the puppet and dns one, etc... A bit more context would be useful [15:54:04] volans: ack perfect, I wanted to know if there was a trick for future tests :) [15:54:08] merging then [15:54:10] usually is a bad idea to have mixed automatic/manual changes in the same repos, for example all the automatic dns bits that come from netbox are in a different repo that is automatically generated and their files included in the main dns repo [15:58:04] volans: it's creating CRs, not submitting them. There will be manual approval. As for mixing, operations/dns is just getting symlinks added/removed, nothing in terms of formatting. acme-chief's hieradata entries have been moved to their own files and will be automatically managed. nc_redirects.dat is on its way to being automatically managed and won't be manually managed any more [15:58:27] So the gerrit user needs CR creation privs if you're the one to approach for this [16:01:58] any gerrit user can create patches to any repo AFAIK [16:28:35] hello world, I probably broke something... needed to move wikikube-ctrl1001 to a new rack and now I can't ssh to the mgmt interface and ipmitool is failing [16:30:42] I believe netbox is correct and I ran sre.dns.netbox, dcops says it looks okay on their side [16:35:47] (status is it's in the new rack, netbox is updated and I was going to reimage it next) [16:41:04] kamila_: from where to where got it moved? [16:41:44] volans: D6(?) to D7, because it needs 10G network [16:41:53] did it change IP? [16:41:57] yes [16:41:58] I mean mgmt IP [16:42:06] I don't think that was supposed to change [16:42:13] but lemme doublecheck with dcops [16:42:39] yes it shouldn't [16:42:43] bug I'm asking if it changed [16:44:47] volans: yes, actually [16:44:50] looks like it did [16:44:53] why? [16:44:59] the idrac has the old IP ofc [16:45:06] that's why you can't connect [16:45:12] if noone changed it [16:45:12] well that'd explain it :D [16:45:18] can the provision script fix that? [16:45:23] s/script/cookbook/ [16:46:37] right now in a hacky way... potentially, knowing the old IP [16:46:56] but the problem si that moving from one rack to naother shouldn't change mgmt IP, period [16:46:59] so I'd revert the change [16:47:04] and it will work again [16:47:31] wdym revert the change, move it back to the old rack? [16:47:36] oh, IP in netbox? [16:47:37] sorry, brain fried [16:47:48] (I probably owe someone a beer) [16:47:51] volans: you can try to ping the old IP first to make sure it's still live [16:50:01] kamila_: why the host was decommissioned? [16:50:16] that's what removed the IP [16:50:29] oh [16:50:31] and then you got a new one [16:50:34] because the wikitech page told me to? [16:50:40] which one? [16:50:44] this wasn't a rename [16:50:54] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs [16:51:17] I had the --keep-mgmt-dns , I double-checked [16:55:56] 2024-06-04 14:01:14,210 kamila 509280 [INFO decommission.py:57 in update_netbox] Skipping removal of DNS names on interface mgmt [16:56:10] but then was removed in the dns.netbox run [16:56:22] hm [16:56:23] dunno, but I have to step out right now, sorry [16:56:34] can check it later on [16:56:45] ok, thank you volans [16:56:56] if you can restore the previous mgmt IP in netbox and run the sre.dns.netbox cookbook I *think* it should be enough [16:57:01] ok, thanks a lot [16:57:02] but I'd liek to understand what happens also [16:57:04] yeah [16:57:11] well I have 5 more moves to go :D [17:12:30] effie: thanks for looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528. If I can answer any questions about it or the broader context, please let me know [17:33:20] kostajh: will merge tomorrow, I just had a very quick look before I signed off [17:57:56] ack [18:15:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:23] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861173 (10cmooney) [18:19:44] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9861174 (10cmooney) [18:29:00] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861186 (10cmooney) [18:30:09] volans: Are you the person to talk to regarding getting that sort of account set up? [18:46:03] brett: in general it's the releng team that owns gerrit. I know there are some other bots already [18:46:46] damn, I'm so bad at getting the right people. Thanks so much for noticing the question, cdanis :) [19:13:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed