[00:05:11] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10706433 (10Scott_French) I was originally imagining we would serialize the pushes in `build-images.py`, but @dduvall pointed this afternoon that setting `max-concurrent-uploads` to 1 in dockerd's conf... [08:00:18] 06serviceops, 06Abstract Wikipedia team, 06Traffic, 13Patch-For-Review, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10706790 (10akosiaris) The issue was found. It's effectively a race condition. We figured out tha... [08:43:35] 06serviceops, 06Abstract Wikipedia team, 06Traffic, 13Patch-For-Review, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10706902 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll resolve this. The fix has... [10:32:01] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10707251 (10elukey) Thanks a lot for the detailed investigation @Scott_French! I applied the suggested fix to deploy1003, let's see if it improves the situation. A couple of high level thoughts: 1) I... [10:33:26] 06serviceops: Harmonise configs between API gateway and REST gateway - https://phabricator.wikimedia.org/T390946 (10hnowlan) 03NEW [12:29:29] 06serviceops, 06MW-Interfaces-Team: Migrate mw-interfaces-team jobs to mw-cron - https://phabricator.wikimedia.org/T388541#10707646 (10HCoplin-WMF) Thanks for the info! We missed the end of March, but can pull it into an early April sprint. [13:45:03] 06serviceops, 06Abstract Wikipedia team, 06Traffic, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10707998 (10Jdforrester-WMF) >>! In T390854#10706902, @akosiaris wrote: > I 'll resolve this. The fix has worked, t... [13:49:12] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 10SRE-tools: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10708019 (10ABran-WMF) a:03ABran-WMF [14:05:36] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1420 to wikikube-worker1166 completed: - mw1420 (**PASS**) - ✔️ Down... [14:19:36] 06serviceops, 06MW-Interfaces-Team: Migrate mw-interfaces-team jobs to mw-cron - https://phabricator.wikimedia.org/T388541#10708157 (10Clement_Goubert) @HCoplin-WMF There isn't much to do on your side, except tell us if we can run that job manually outside of its normal schedule, and how critical it is. We wil... [14:42:41] 06serviceops, 10MW-on-K8s: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972 (10Clement_Goubert) 03NEW [15:06:41] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972#10708461 (10Clement_Goubert) `podFailurePolicy` isn't available on the kubernetes version currently running our production wikikube clusters. [15:06:49] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm [15:07:28] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972#10708468 (10Clement_Goubert) [15:07:34] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984#10708469 (10Clement_Goubert) [15:11:00] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1437 to wikikube-worker1167 completed: - mw1437 (**PASS**) - ✔️ Down... [15:15:01] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1438 to wikikube-worker1168 completed: - mw1438 (**PASS**) - ✔️ Down... [15:17:07] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm [15:17:28] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm [15:41:31] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm completed: - wikik... [15:52:18] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm completed: - wikik... [15:55:29] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm completed: - wikik... [16:06:22] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709114 (10ops-monitoring-bot) pool host wikikube-worker[1166-1168].eqiad.wmnet by hnowlan@cumin1002 with reason: None [16:06:31] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by hnowlan@cumin1002 pool for host wikikube-worker[1166-1168].eqiad.wmnet completed: - wik... [16:06:41] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998 (10hnowlan) 03NEW [16:07:34] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709138 (10hnowlan) [16:11:25] 06serviceops, 10decommission-hardware: decommission mw2278, mw2279 - https://phabricator.wikimedia.org/T391001#10709171 (10hnowlan) [17:40:43] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709614 (10VRiley-WMF) [17:40:49] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709617 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This has been completed [18:08:49] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10709844 (10Scott_French) Thank you very much, @elukey! >>! In T390251#10707251, @elukey wrote: > Thanks a lot for the detailed investigation @Scott_French! I applied the suggested fix to deploy1003,... [18:12:16] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10709850 (10dancy) >>! In T390251#10709844, @Scott_French wrote: >> 1) It would be useful if the issue reproduced via curl, [...] > > While I've not been able to do so since, I //think// @dancy was in... [21:47:11] 06serviceops, 06Abstract Wikipedia team, 07Wikimedia-production-error: function-orchestrator-main-orchestrator pods down in codfw due to issue in envoy config(?) - https://phabricator.wikimedia.org/T391047 (10Jdforrester-WMF) 03NEW [23:20:26] 06serviceops: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057 (10Scott_French) 03NEW [23:20:42] 06serviceops: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057#10711152 (10Scott_French) p:05Triage→03Medium [23:23:20] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10711155 (10Scott_French) [23:30:24] 06serviceops, 13Patch-For-Review: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917#10711175 (10Scott_French) Since there are no further actions explicitly tracked here, and indeed I've manually tested the new 8.1-default and 7.4-fallback, I'm going to close this out. Removing the 7... [23:30:32] 06serviceops, 13Patch-For-Review: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917#10711177 (10Scott_French) 05In progress→03Resolved [23:35:10] 06serviceops: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057#10711181 (10Scott_French)