[06:21:50] 10serviceops, 10Shellbox: Code in Shellbox specific to WMF production - https://phabricator.wikimedia.org/T357949#9571044 (10Legoktm) >>! In T357949#9566576, @Joe wrote: > I don't see a good solution for than other than maintaining separate branches and backporting changes to the main branch to a `wmf` branch,... [07:39:35] 10serviceops, 10Similarusers: Remove similar-users service from k8s - https://phabricator.wikimedia.org/T345274#9571082 (10Joe) >>! In T345274#9267674, @kostajh wrote: >>>! In T345274#9267646, @JMeybohm wrote: >>>>! In T345274#9131624, @kostajh wrote: >>> @Niharika @Tchanders any concerns with this? >> >> Ca... [07:58:56] 10serviceops, 10Shellbox: Code in Shellbox specific to WMF production - https://phabricator.wikimedia.org/T357949#9571094 (10Joe) Before we move into finding solutions, I'd like to understand better what is the goal we want to accomplish: * If the goal is to make what we upload to packagist cleaner, `composer.... [07:59:14] 10serviceops, 10Shellbox: Code in Shellbox specific to WMF production - https://phabricator.wikimedia.org/T357949#9571095 (10Joe) p:05Triage→03Low [10:24:52] 10serviceops, 10MW-on-K8s, 10SRE, 10Scap, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9571471 (10Clement_Goubert) >>! In T358117#9570020, @thcipriani wrote: > ... > Running httpbb against an mwdebug server before roll... [11:55:55] 10serviceops, 10Datacenter-Switchover: SRE comms for Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T358286#9571724 (10jijiki) [12:15:47] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye [12:15:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye [12:16:00] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye [12:16:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye [12:16:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye [12:16:11] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye [12:16:13] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye [12:52:32] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye completed: - mw2353 (**PASS**)... [12:55:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye completed: - mw2444 (**PASS**)... [12:57:43] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye completed: - mw2351 (**PASS**)... [13:04:52] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye completed: - mw2382 (**WARN**)... [13:08:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye completed: - mw2426 (**WARN**)... [13:08:44] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye completed: - mw2428 (**PASS**)... [13:10:04] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye completed: - mw2419 (**WARN**)... [13:19:43] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2394.codfw.wmnet with OS bullseye completed: - mw2394 (**WARN**)... [15:32:00] o/ hi from ml-team, I need some help with a 500 error when CI is pushing the model image. here is the log: https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/69/execution/node/59/log/  [15:32:13] the new image size increased ~2G, wondering if the error is due to hitting a layer limit [15:35:30] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Unstewarded-production-error, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9572316 (10Clement_Goubert) p:05High→03Unbreak! [15:36:26] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Unstewarded-production-error, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#6414422 (10Clement_Goubert) Raising this to UBN, we're definitely losing too many jobs. [15:40:46] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Unstewarded-production-error, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9572342 (10Clement_Goubert) 22.8k errors in the last 3 days, for reference https://log... [15:44:40] aiko: do you know the ~ max layer size? [16:06:47] jayme: ~4.5G [16:06:57] ouch :| [16:07:16] what is the layer limit? [16:08:39] there is a limit to the "filesize" that can be uploaded to the nginx in front of the registry, basically enforced by the size of a tmpfs, which is 2GB [16:14:07] ok I see. can it be adjusted? [16:14:25] I'm also looking if i can reduce the layer size [16:16:43] 10serviceops, 10Machine-Learning-Team: docker-pkg fails to upload big Docker images to the registry - https://phabricator.wikimedia.org/T335177#9572447 (10JMeybohm) reference {T288198} for posterity [16:16:53] for reference, this was the issue we had for this https://phabricator.wikimedia.org/T288198 [16:17:21] yes we could increase, but not without effort (as it's a tmpfs - we'd need more ram) [16:17:44] alternatively we could move that cache to disk (slow) which would need more testing (and a bigger disk :)) [16:19:23] 10serviceops, 10Data-Engineering, 10WMF-JobQueue, 10Patch-For-Review, and 3 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9572462 (10Clement_Goubert) [16:39:20] aiko: are you in a rush with this or can it wait until next week? [16:43:12] no I'm not in a rush. thanks for the info! [16:44:23] I'll see if I can reduce the layer size first. will let you know if we really need a bigger size (open a ticket for further discussion etc) [16:44:54] ok, cool [16:45:49] thank you :) [17:13:30] 10serviceops, 10MW-on-K8s, 10SRE, 10Scap, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9572625 (10thcipriani) so ` httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444` from... [17:16:21] 10serviceops, 10MW-on-K8s, 10SRE, 10Scap, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9572630 (10dancy) >>! In T358117#9572625, @thcipriani wrote: > so ` httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug... [18:11:56] 10serviceops, 10ops-codfw: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9572779 (10Jhancock.wm) drive has been replaced. Physically I don't have any alarms, but let me know if the you are still having issues with the RAID.