[03:15:06] FIRING: [9x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [07:15:06] FIRING: [9x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [08:49:51] FIRING: [9x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:52:06] moritzm: o/ [09:52:36] I am trying to reimage sretest2006, the new node with a BOSS card exposing two nvmes as a single raid1 dev [09:52:59] still getting into the "no root partition definited" during d-i [09:53:38] I did some manual hacks on apt1002 to tune a bit the hwraid1-1dev config, replacing /dev/sda with /dev/nvme0 etc... [09:53:44] but I end up in the same error [10:00:44] is the host will up? I can SSH into it and see if I can see anything obvious [10:05:27] it is in d-i at the moment [10:05:38] in the d-i's shell [10:05:53] if you want to jump to the mgmt xonsole I can log out [10:07:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:16] I am checking /var/log/partman and syslog [10:11:21] you can stay on the console, having alook via SSH now [10:12:21] the device is called /dev/vnme0n1, not nvme0, though? [10:12:51] influenced by the great NIC naming scheme of systemd or so :-) [10:12:52] I tried also that one too IIRC, but it failed [10:13:02] I can retry, maybe in my hacks I missed it [10:13:24] let me quickly check the logs first [10:13:36] does apt1002 still have the hacked config from the boot attempt? [10:14:33] yep I called it the same as hwraid1-1dev but with -nvme at the end [10:14:41] I just changed it with your suggestion [10:14:49] ok [10:15:42] let's try it with the updated partman recipe [10:17:24] ack doing it [10:22:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:18] no same error [10:23:55] let me check via SSH with the fresh config [10:27:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10978400 (10cmooney) 05Open→03Resolved Link remains stable, closing task. [10:33:09] elukey: I think I found the error, can you reboot to see if it works now? [10:33:36] or kick off the reimage cookbook, not sure how it was initiiated [10:34:34] hwraid-1dev-nvme.cfg on apt1002 set grub-installer/bootdev to "nvme0n1", but it needs to be "/dev/nvme0n1" [10:34:46] I edited it on the apt1002 copy [10:36:04] my bad, I misedited the file, testing [10:40:10] I cleaned up a few timers on puppetserver2003, the alert from above should settle soon [10:41:00] moritzm: progress! Now it complains about a missing EFI partition [10:41:15] I'll have a look over SSH [10:42:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:57] I think the root problem is that on sretest2006 /var/lib/partman/devices is empty, it's the file which is read by partman-efi eventually [10:57:18] totally ignorant about it, TIL [10:58:12] going to get a quick lunch, I'll try to follow up in a bit, thanks a lot for the hints [10:58:26] it is not super urgent, but it would be nice to make it work [10:58:32] no idea if we plan to use BOSS cards more [10:59:16] enjoy lunch, we can pick this up when you're back [11:28:06] back! [11:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [11:37:02] https://tracker.debian.org/news/1605955/accepted-partman-efi-108-source-into-unstable/ fixed various EFI partman bugs, we could try to reimage sretest2006 with trixie to rule these out [11:37:37] our config seems all correct, so it seems like something mis-detects the EFI partition internally [11:41:33] lemme try [11:51:40] I need to upgrade the trixie image to rc2, the current one is no longer compatible with the udebs fetched [11:56:05] ah okok, I got stuck in d-i [11:56:22] I'll wait [12:00:47] I've updated the image, but we'll need a puppet run on apt1002, which files did you edit, then we can copy them around and back after Puppet ran? [12:01:40] ah only preseed for 2006, nothing big [12:02:20] ok, then I'll re-enable puppet now on apt1002 [12:02:57] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978626 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 depool for host wikikube-worker2042.c... [12:03:51] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978627 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 depool for host wikikube-worker2046.c... [12:03:52] 10netops, 06Infrastructure-Foundations: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978629 (10akosiaris) >>! In T398433#10974433, @ayounsi wrote: > Sweet, what about 12:00UTC on Monday 7th ? wikikube-worker204[26] have been d... [12:04:47] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978630 (10akosiaris) [12:08:08] 10netops, 06Infrastructure-Foundations, 06SRE: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690#10978636 (10cmooney) 05Open→03Declined Gonna close this one for now, we only have a small number of these switches left and we are planning t... [12:09:26] elukey: puppet has run on apt1002,you can re-add the config change and then we can kick off a fresh reimage [12:13:06] doing it [12:15:16] started [12:20:36] meh, same error sadly [12:22:05] same error, but I tried to keep going with the reimage, to see if it eventually fails at boot or not [12:22:09] just as as a try [12:22:13] ok [12:26:02] hmmh, grub-install says "cannot find EFI directory" [12:30:36] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978750 (10ayounsi) 05Open→03Resolved a:03ayounsi {F63349871} Much better. Thanks for the depool, you can repool th... [12:34:55] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978765 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 pool for host wik... [12:35:36] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978766 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1003 pool for host wik... [12:46:01] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978804 (10akosiaris) wikikube workers repooled. [12:47:32] 10netops, 06Infrastructure-Foundations, 06serviceops: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433#10978810 (10Ladsgroup) db2146 is also repooling [12:50:06] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:08:48] with production debmonitor also moved to 0.6.6, should I go ahead with the change that moves the docker-report to build2002? [13:12:14] moritzm: fine for me, just a heads up that I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166819 (still pcc-testing/reviewing it) that absents the old timers [13:12:24] the registry-based [13:12:31] but you can go ahead [13:30:46] very weird, I am getting unauthorized now when I try to run docker report for staging-eqiad and ml-staging-codfw [13:31:05] ok, I'll merge this tomorrow! [13:47:56] interesting [13:47:57] authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2025-07-07T13:25:12Z is after 2025-07-05T11:36:00Z, [13:48:27] (7297 minutes ago). Puppet is disabled. elukey - testing docker-repor [13:48:32] * elukey cries in a corner [13:48:49] found the bug! [14:00:10] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10979139 (10Arnoldokoth) [15:11:25] FIRING: SystemdUnitFailed: sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:44:28] ^ also cleaned up sync-puppet-ca.timer on puppetserver2003 [15:46:25] RESOLVED: SystemdUnitFailed: sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:26] 10CAS-SSO, 06Infrastructure-Foundations: CAS not letting new Toolsbeta-logging developer account log in - https://phabricator.wikimedia.org/T397651#10979946 (10SLyngshede-WMF) > There is a second cn=toolsbeta-logging object in ou=projects,dc=wikimedia,dc=org for the service project for the same purpose, but th... [16:08:08] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10979971 (10Jgreen) 05Duplicate→03Resolved [16:50:06] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:42:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980408 (10Dzahn) [18:42:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980413 (10Dzahn) [18:42:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980415 (10Dzahn) [18:43:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980417 (10Dzahn) [18:43:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980419 (10Dzahn) [18:43:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980421 (10Dzahn) [18:43:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980425 (10Dzahn) [19:28:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980634 (10Jhancock.wm) looks like part of the problem was a tripped breaker in D3. still investigating the rest and checking ser... [19:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [19:44:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980766 (10Jhancock.wm) reset the tripped breaker in D3. On the secondary switch. No indiciation of a simiilar issue in D8. possi... [20:00:56] FIRING: MaxConntrack: Max conntrack at 80.14% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:05:55] RESOLVED: MaxConntrack: Max conntrack at 84.11% on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:50:06] FIRING: [8x] PuppetConstantChange: Puppet performing a change on every puppet run on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:32:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10981528 (10Dzahn) [23:33:33] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/core/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts