[06:39:34] FIRING: DiskSpace: Disk space seaborgium:9100:/ 5.585% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:49:34] RESOLVED: DiskSpace: Disk space seaborgium:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=seaborgium - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:52:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:26] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10439908 (10elukey) 05Open→03Resolved a:03elukey I think that we can declare this task completed, we are s... [08:44:02] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10439961 (10Gehel) [10:16:21] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10440092 (10elukey) @Volans is this currently an active issue? [11:41:13] I have a move-vlan issue with wikikube-worker2022.codfw.wmnet. I used the reimage cookbook with --vlan. During the move-vlan part netbox had a timeout and the cookbook failed. A retry says no vlan migration is needed (cookbook sre.hosts.move-vlan preflight wikikube-worker2022 returns the same). [11:41:14] Can somebody double check this? The IP address and subnet looks old to me (/22) but it's already in on lsw1-c6-codfw connection. So I guess the cookbook failed before updating the IP? [11:41:14] I also don't get a DHCP response when trying to PXE boot which might be related. [13:30:40] jelto: o/ still having the issue or did you manage to fix? [13:34:42] I continued with other wikikube-worker hosts (which work fine so far). Last time wikikube-worker2022 was not getting a DHCP response and was still in the wrong subnet (I think). I can retry the host if you want [13:38:47] nono lemme check its status, I am not very familiar with the move-vlan but I can check [13:40:45] great thanks a lot! It's not super urgent, there are a few hundred other wikikube nodes. But at some point this host has to be reimaged properly as well. [13:41:08] jelto: then let's open a task if you don't mind, so we have it tracked [13:42:10] I'll open a task in a sec and subscribe you [13:52:12] from the changelog nothing changed https://netbox.wikimedia.org/dcim/devices/2491/changelog/ [13:53:08] I'll run the cookbook to check [13:57:36] ah I see what is happening [13:57:44] when the move is not required, we do RuntimeError('The host is not suitable for the migration, see above.') [13:58:05] that in turn causes the reimage cookbook to fail [13:58:31] I found the python stack trace in the cookbook logs, let me create the task, one sec [14:05:13] no ok something is weird [14:05:19] `sudo cookbook sre.hosts.move-vlan reimage wikikube-worker2022` does PASS [14:05:34] it is the pre_flight() that fails [14:05:56] better: it was probably the pre_flight() that failed during reimage, due to the netbox timeout [14:06:11] I opened https://phabricator.wikimedia.org/T383228 :) [14:06:52] so wikikube-worker2022 does not need a move-vlan at all? And the /22 subnet is fine for the host? [14:07:11] jelto: yes I think you can kick off the reimage again, it should work just fine [14:07:27] I'll retry, give me a moment [14:08:33] super [14:08:44] I updated the task, if it works we can close [14:22:59] reimage is failing too, see my update in T383228#10441028. Probably it's more a PXE issue instead of move-vlan? The cookbook is still running but it failed to PXE boot and the host is already booted the old OS from disk [14:22:59] T383228: wikikube-worker2022 move vlan failed due to netbox timeout - https://phabricator.wikimedia.org/T383228 [14:31:54] jelto: trying to reprovision it, I am wondering if the NIC is not set up correctly for PXE boot [14:32:02] but can you confirm that the move-vlan issue is gone? [14:34:44] move-vlan returns "Server not in a vlan requiring a migration, nothing to do. 👍", so no issue on that side [14:34:55] okok [14:35:01] in provision i see [14:35:01] ==> Detected link on 2 interfaces. Pick the one to set PXE on: [14:35:04] ['NIC.Embedded.1-1-1', 'NIC.Embedded.2-1-1'] [14:35:09] that is already strange [14:35:20] link on two interfaces? [14:36:40] I'd have to compare that with the other machines. I'm not sure what the default config looks like [14:36:40] jayme do you know how the default looks like. I think you poked around in PXE settings for a few stuck hosts as well? [14:37:25] yeah we did it for the wikikube control planes a month ago IIRC [14:37:37] but in that case, the PXE settings were on the wrong NIC [14:37:47] maybe this is the case, but usually the link in provision is not found [14:37:54] never seen it in two NICs :D [14:38:11] setting PXE now on NIC.Embedded.1-1-1 [14:38:18] then I'll manually check via console [14:39:55] Skipped set of attribute NIC.Embedded.1-1-1 -> LegacyBootProto, has already the correct value: PXE [14:39:58] Skipped set of attribute NIC.Embedded.2-1-1 -> LegacyBootProto, has already the correct value: NONE [14:40:04] so maybe it is 2-1-1 [14:40:24] we can also try that :) [14:41:53] I want to check the console first, and then possibly ping the dcops folks to ask for a quick verification of cabling [14:45:51] ok thanks! let me know if you need a second pair of eyes. I'll not modify pxe settings or reboot the host any more if you are checking the config. My cookbook is still running though and waiting for the debian installer to start [14:46:14] jelto: you can kill it [14:47:06] jelto: sorry, was afk. But you figured it out I guess [14:47:51] elukey: you mean ctrl+c the reimage cookbook? [14:48:26] yes exactly, if it is stuck waiting for d-i yes [14:48:31] it will trigger a rollback [14:48:59] I stopped the cookbook [14:53:43] thanks! for some reason I can't run the provision cookbook, it always return a failure when uploading the new config [14:53:46] weird [15:05:15] ok so only one NIC shows a link up, and it is the one set for PXE (1-1-1) [15:05:24] so far it looks good, not sure what's wrong [15:05:46] jelto: can you kick off another reimage so I can check from the console? [15:06:15] yes sure, I'll start the reimage now [15:09:43] the cookbook forced a PXE boot and is waiting at the moment [15:10:59] yeah it failed to dhcp yes, nothing new [15:11:01] you can kill [15:11:28] ack, I stopped the cookbook [15:18:09] jelto: ah! I recalled that we had some issue in the past, and I found [15:18:12] "FYI, I had to use --use-http-for-dhcp for wikikube-worker13[21-27] that are plugged in on their embedded gigabit NIC." [15:18:24] since we now default to TFTP only [15:18:30] can you try with the extra setting? [15:19:18] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441312 (10Andrew) I would very much like this to work and I also don't immediately know how to do it :( [15:19:53] 07Puppet, 06cloud-services-team, 10Cloud-VPS: Preserve formatting and comments etc. in ENC Hiera - https://phabricator.wikimedia.org/T250622#10441315 (10joanna_borun) p:05Triage→03Medium [15:21:58] a good hint, I'll try with --use-http-for-dhcp now [15:22:40] I don't recall why the embedded gigabit nic doesn't support tftp [15:22:53] but in case, we may want to rethink about tftp only defaults [15:43:11] 10Mail, 06Infrastructure-Foundations, 10observability, 10Observability-Logging, 10Sustainability (Incident Followup): Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171#10441474 (10herron) 05Open→03Resolved a:03herron Cleaning up old tasks [15:58:13] anyone using sretest2001? I would like to re-image to try and reproduce the supermicro tcp bug [15:58:36] o/ go ahead! [16:00:12] thanks! [16:02:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:56] jelto: did it work? [16:59:04] (curious, otherwise we can do it tomorrow) [16:59:22] no unfortunately not, same error. I answered in the task [17:00:01] ok I'll try to recheck tomorrow, very strange [17:00:36] thanks a lot :) And yes as I mentioned it's not super urgent and luckily the only host in wikikube codfw so far. [17:02:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:26] 10netops, 10Ceph, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10441947 (10dcaro) @cmooney what's needed to get this rolling? I'll make time whenever you are able :) [20:30:41] o/, I would like to run homer in eqiad but I'm seeing `- as-path NTT-VERIZON "^2914 701$";` in the diff, is it OK to proceed? [20:34:37] hmm, I'm not sure why that was added kamila_, cdanis do you know? [20:34:55] jhathaway: I don't, I suspect it was probably a netops doing it, as it looks like outbound traffic engineering [20:35:05] no they don't :D [20:35:26] :) [20:35:42] I uh [20:35:47] do you have router access kamila_ ? [20:36:03] how do I find out? :D [20:36:05] no, I don't [20:36:05] there is a read-only mode you can get :) [20:36:13] or I didn't last time I asked if I do :D [20:36:59] you could for instance copy jayme's stanza for read-only access https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/config/common.yaml#129 [20:37:01] if you wanted :) [20:37:15] kamila_: which crX-eqiad is that on? [20:37:39] cr1 [20:39:22] cdanis@re0.cr1-eqiad> show system commit [20:39:24] 0 2024-12-29 22:02:42 UTC by cmooney via cli commit synchronize [20:39:34] yeah okay that makes sense [20:39:45] one of the N network things that went wrong over the holidays [20:39:52] oh, right [20:40:22] cdanis: yeah good catch, this should be the Jimmy Carter thing [20:40:51] kamila_: if you want access send me a homer-public patch adding a key and I'll stamp [20:41:08] I'm worried that by having router access, I'll end up using it XD [20:41:25] ...ok thanks :D [20:41:36] cdanis@re0.cr1-eqiad> show configuration |compare rollback 1 [20:41:38] [edit policy-options as-path-group AVOID-PATHS] [20:41:40] as-path TI { ... } [20:41:42] + as-path NTT-VERIZON "^2914 701$"; [20:41:44] so yeah, it was that [20:41:47] mhm [20:41:59] (note that adding a new key is quite the pain) [20:42:18] sukhe: is it really? don't you just do it from cumin? [20:42:29] so probably have cdanis or me do the immediate thing [20:42:50] cdanis: last I did it, I had to sit there and type yes for a lot of devices [20:42:54] haha hmm [20:43:02] I think we probably don't need that traffic engineering in place anymore and we can just roll forward with the from-homer config [20:43:11] I can also leave this for tomorrow, I can just add silences for my hosts, but I'd prefer not to because eqiad is already a bit tight [20:43:43] and yes, I would also assume that we don't need that anymore [20:43:53] but it's my night, so :D [20:44:16] I'll keep an eye and re-add it if necessary [20:44:26] ok, thanks a ton cdanis <3 [20:46:01] thanks folks [20:46:21] haha sorry for summoning you! [20:46:42] sorry that had slipped my mind, I'll re-add it tomorrow via a homer patch it's probably worth having there but not at all important [20:46:50] topranks: if you'd just stop showing up at all hours we wouldn't've had this problem today ;) [20:47:20] :D [20:47:35] haha [20:47:49] nor if I'd started my host juggling at a reasonable hour, sorry about that '^^ [20:51:12] no probs Kamila, apologies for the inconvenience [20:52:15] no worries, thanks for saving things over the holidays <3 [21:08:08] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Mail from Bishzilla to emergency@wikimedia.org is possibly getting lost - https://phabricator.wikimedia.org/T338032#10442771 (10mdaniels5757) This seems to still be a problem: my emails (10:38am and 11:50am today, US Eastern Time) got seemed to get l... [23:07:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:17:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:17:45] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:27:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting