[01:25:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [01:33:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [07:23:18] hello folks [07:24:01] there are some cp40xx nodes that failed to reload vcl, and I see correspondent pybal alerts for pooled but down [07:26:28] ah I see some of them are in the decom process [07:26:31] (just checked puppet) [07:27:36] the confd alerts are related to 4037,4045,4047,4049 [07:29:29] I see new nodes, but for example 4049 should be in service (afaics from https://gerrit.wikimedia.org/r/c/operations/puppet/+/845550) [07:33:37] ahhh [07:33:37] Unused backend be_cp4049_ulsfo_wmnet, defined: [07:33:41] ('/etc/varnish/wikimedia_upload-frontend.vcl' Line 209 Pos 9) [07:33:44] backend be_cp4049_ulsfo_wmnet { [07:37:04] and on cp4037 [07:37:05] Backend host '"cp4029.ulsfo.wmnet"' could not be resolved to an IP address: [07:37:23] maybe a temporary weird state? [07:38:29] seems so yes [07:50:35] so two current issues 1) lvs/pybal is not happy about the deprecated nodes, at least for Icinga [07:50:45] 2) vcl-reload failed on 4 nodes in ulsfo [07:51:26] vgutierrez: --^ [07:57:18] pybal looks strange [07:57:19] elukey@lvs4006:~$ curl http://localhost:9090/alerts [07:57:20] CRITICAL - uploadlb6_80: Servers cp4033.ulsfo.wmnet.... [07:57:23] but in the logs [07:58:11] Removing server cp4033.ulsfo.wmnet (no longer found in new configuration) [07:58:16] (in the corresponding service) [08:00:28] Thanks elukey [08:00:33] I'll take a look asap [08:12:29] vgutierrez: lemme know if I can help! Really curious now :) [08:24:47] elukey: so I'm gonna let stay like that and lets b.black and s.ukhe handle it [08:24:53] elukey: ulsfo is currently depooled [08:25:54] work related to https://phabricator.wikimedia.org/T317247 [08:26:13] ack! [08:27:10] and I've acked the alerts on icinga [08:27:12] sorry about the noise [09:05:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10AlexisJazz) [09:29:11] 10HTTPS, 10Traffic, 10SRE, 10serviceops, and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) acme-chief will deploy the unified cert shipping `wikifunctions.org` and `*.wikifunctions.or... [09:29:49] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) If someone could purge `https://upload.wikimedia.beta.wmflabs.org/wikipedia/en/thumb/1/13/Bert_Self-portrait2.... [09:41:31] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) I've followed the steps mentioned by @TheresNoTime but sadly it didn't help at all. Please consider that varnish... [10:01:22] 10HTTPS, 10Traffic, 10SRE, 10serviceops, and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) [11:06:12] sukhe: ^^ [11:06:21] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [11:06:32] sukhe: mainly my conversation with elu.key earlier today [11:06:32] what did I do now!? :P [11:06:40] icinga was screaming for lvs4xxx and several cp4xxx instances [11:06:44] luca got worried [11:06:52] oh [11:06:56] I've acked the alerts on icinga [11:06:57] sorry luca [11:07:46] I wonder if this is related to the cookbook decomm failures [11:07:48] but I will check shortly [11:11:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4046.ulsfo.wmnet with OS buster [11:11:38] were the hosts powered off? [11:12:54] volans: yeah, so for say cp4035 [11:12:57] Failed to power off, manual intervention required: Remote IPMI for cp4035.mgmt.ulsfo.wmnet failed (exit=1): b'' [11:13:12] but they were powered off manually, at least as per my conversation with rob [11:13:18] ok [11:24:07] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4046.ulsfo.wmnet with OS buster executed with errors: - cp4046 (**FA... [11:43:22] volans: if you were looking to host to check for API changes, I think I have one for you :) [11:43:31] that is, if you are interested and have the time [11:43:36] it's failing to boot up: cp4046 [11:44:04] at least the console COM2 settings look correct [11:45:04] sukhe: can it wait afer lunch? [11:45:09] of course yep [11:45:18] I am going to move on to another host in the meantime [11:45:19] great, then I'll look at it right after [11:45:20] thanks [11:45:25] so I will reboot it and leave it at the BIOS, you can attach and check [11:45:33] ok [11:45:35] the cookbook is stalling because the host is not booting anyway [11:45:42] ack [11:45:51] take your time, not urgent and thanks! [12:18:25] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) >>! In T321547#8343951, @BCornwall wrote: > Perhaps this is because the severity is set to warning rather than critical? For the IRC notifications to -traffic you are correct re: the severity, h... [12:44:10] sukhe: so I can do anything to cp4046 correct? [12:50:13] yes please [12:50:38] ulsfo is depooled and additionally I reverted the cp4046 patch too so all yours [12:50:44] ack thx [13:02:55] sukhe: do you happen to have a host where the reimage worked? [13:03:03] just to compare [13:03:07] I don't have to disrupt it [13:04:23] 10Traffic, 10SRE: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10fgiunchedi) [13:04:23] volans: cp4038 worked perfectly [13:04:31] great, thx [13:37:38] sukhe: o/ no problem at all, I didn't notice uslfo depooled (my bad) and saw lvs + confd alerts so I thought to ping :) [13:39:36] sukhe: so, cp4038 has the same mgmt config that cp4046, the only "diff" is that cp4046 has an additional top level config for PCIeSSD.Slot.3-1 with only one item: {'PCIeSSDsecureErase': 'False'} [13:39:41] that cp4038 has not [13:40:22] for the rest they are identical across 3191 config settings [13:40:36] excluding of course the MAC addresses ;) [13:44:22] looks like cp4046 has 2 PCIeSSD slots and cp4038 just one [13:44:42] that's expected, 4046 is upload so two disks and 4038 is text so one [13:44:48] volans: I have a new data point, cp4039 [13:45:08] I am logged in to the BIOS and you can connect to the console but it has the incorrect serial COM settings [13:45:21] it has "on without console redirection", should be "with" [13:45:28] port is COM1, should be COM2 [13:45:35] redirection after boot should be disabled [13:45:45] so yeah, these three, at the first glance [13:45:56] all those are set manually by rob.h, as those settings are different from all the other servers we have and the cookbook doesn't yet know how to set them [13:46:05] interesting [13:46:36] I'm checking if it's an idrac version problem or an r450 problem or the combination of both [13:46:44] thank you <3 [13:46:51] you can check cp4039 too [13:46:52] I won't manually change the settings [13:46:57] till you are done [13:47:16] (cp4039 is sitting in the BIOS) [13:49:29] let me grab real quick he settings and you can modify them [13:49:54] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Manual_steps I have been following these [13:51:25] sukhe: cp4039 all yours, I have the config, feel free to modify [13:52:13] thanks volans! [13:59:58] ok those new settings seems specific to r450, another r440 with even newer idrac firmware has the old ones [14:02:10] yeah the COM ones should be specific to the r450 [14:02:17] though I am pretty sure we are missing something else as well [14:02:23] I am looking [14:02:33] the weird thing is of course how smooth cp4038 booted up [14:03:29] the capability to boot and you to see the boot via console are independent though [14:04:04] so I'd say to go sttep by step, first have all with the same bios configs [14:04:15] yep, just following the list [14:04:22] then see on case-by-case the ones that are not reimaging and why [14:04:26] do they get DHCP? [14:04:44] is pxelinux sent by the install server? [14:05:47] right now for example with cp4039, I am at [14:05:51] [52/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for cp4039.ulsfo.wmnet [14:05:59] and the last update on the serial console is: [14:06:09] boot: [14:06:09] Loading debian-installer/amd64/linux... ok [14:06:10] Loading debian-installer/amd64/initrd.gz...ok [14:06:20] which it does a take while but yeah, not ten minutes [14:06:36] so I am now wondering why this is failing. any suggestions on where to look? [14:08:24] so lpxelinux.0 was sent to cp4039 and you saw it [14:08:44] the next thing could be the firmware [14:08:50] are you using the buster+5.10? [14:08:53] *image [14:08:54] yep [14:09:13] everything is the same as cp4038 in that sense [14:09:18] but I guess the firmware, not sure [14:09:24] robh said he updated it [14:10:03] so yeah, I am guessing we can start there [14:10:47] host is pingable, but port 22 is closed [14:11:16] sukhe: there is a cookbook to upgrade firmwares, but not sure if they need a specific version [14:11:47] yeah I have no idea about that all so I might as well wait for dcops/robh to come online and let them double check :D [14:15:30] because ping works and telnet 22 not that would suggest that d-i has not been fully loaded [14:15:40] but basic connectivity is ok [14:46:03] sukhe: the firmware version of all teh drivers in the System -> Inventory -> Firmware page are eactly the same between 4038, 4039 and 4046 [14:58:01] they have the 1.6.5 BIOS and I see there is the 1.7.5 on the dell website, but not sure if it could be part of the problem or not [14:58:17] that's a totally different version from the one for the R440 [14:59:03] ok! thanks volans for the debugging [14:59:12] I am doing cp4039 right now and will move on to the others [14:59:14] 1.6.5 is from 24 May 2022 [14:59:24] 1.7.5 30 Sep 2022 [14:59:46] yeah but we haven't gone for 1.7.5 on the other cases that have worked, AFAIK [15:00:17] the main issue in the 4039 case is for the manually-fixed serial redirection settings were not persisting when rebooting to reimage. [15:00:32] they were fixed+saved again and stuck this time, but we don't really know why yet [15:00:49] let's see how the other hosts go... [15:00:55] I guess, since we really don't know why [15:01:13] I suspect there could be some interaction going on between the manual fixups and the various automatons [15:01:16] anyway I'm sending a patch to adapt the cookbook for the R450 special case [15:01:21] we can try with that [15:01:29] (re: settings to persist changes for one boot, and/or re-setting bad serial settings, not sure?) [15:11:52] sukhe, bblack: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/849597 [15:23:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Thanks, looking at the config, and reading some docs, we have it set up so it should not have any impact: ` c... [15:43:13] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS buster [15:44:30] sukhe: I'm ready to re-run the provision cookbook with the patch, on which hosts should I run it? [15:45:46] volans: cp4046 is a good candidate [15:45:50] and thanks [15:47:42] sukhe: according to the new cookbook cp4046 has already the correct values [15:47:45] BIOS.Setup.1-1 -> SerialComm, has already the correct value: OnConRedir [15:47:48] BIOS.Setup.1-1 -> SerialPortAddress, has already the correct value: Com2 [15:48:57] :] [15:49:03] any other to try? [15:49:10] I am reimaging cp4040 [15:49:16] so maybe try cp4041, just to check [15:49:41] I think cp4040 might fail, so probably a good candidate to check [15:49:42] let's see [15:49:43] I can do both cp4038 and cp4039 if it doesn't disrupt anythng [15:49:58] both have already been reimaged and are working fine [15:50:35] given that they were set manually ideally it should be run on all, to be sure we are consistent [15:51:38] ok cp4041 it is [15:52:01] ok :) [15:52:52] cp4041 needs the fix for the serial... the cookbook is changing it [15:52:57] ah! [15:53:10] cp4040 probably as well but let's wait for cp4041 to finish [15:53:15] you can test cp4040 next please [15:53:23] [43/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for cp4040.ulsfo.wmnet [15:53:27] don't think this is coming up either [15:54:22] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4040.ulsfo.wmnet with OS buster executed with errors: - cp4040 (**FA... [15:55:06] ok, doing cp4040 in parallel [15:55:43] sukhe: ok to proceed with cp4040 too? [15:56:15] just a minute, I will let you know [15:56:18] ok [15:57:00] ok I was booting into BIOS [15:57:04] the serial settings look fine [15:57:12] but let's try what the cookbook says [15:57:14] go ahead wth cp4040 [15:57:26] ack [15:58:27] confirmed, cp4040 no changes [15:59:39] cp4041 finished now and was supposedly fixed [15:59:57] hmm ok [16:00:05] thanks [16:00:11] I will continue to debug cp4040 then [16:00:16] maybe something else is missing [16:00:22] thanks for the help! [16:00:27] it's been a jounry [16:00:29] journey [16:17:40] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10BBlack) a:03BBlack Updates! Since this ticket was last active, there's been progress on various fronts with th... [16:28:56] (HAProxyEdgeTrafficDrop) firing: 31% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:38:56] (HAProxyEdgeTrafficDrop) resolved: 63% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:43:20] 10Traffic, 10SRE: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10BCornwall) Seems reasonable to me. It looks like the alert fired as you demonstrated. [17:32:03] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) a:03cmooney [17:48:08] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) Hmm... so I went back a moment ago to look at this when I got some time, and of course the report has re-run and completed ok.... [20:05:48] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10Volans) This seems related to the Netbox slowness that we've seen recently that @ayounsi was looking at, but no smoking gun was found s... [20:16:29] 10netops, 10Infrastructure-Foundations: Investigate why frmon-codfw.wikimedia.org is not accessible from untrust zone. - https://phabricator.wikimedia.org/T321735 (10Jgreen) [22:50:08] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4048.ulsfo.wmnet with OS buster [23:23:50] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4048.ulsfo.wmnet with OS buster executed with errors: - cp4048 (**FA...