[01:03:53] 06serviceops, 10MW-on-K8s, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11471277 (10thcipriani) Recapping my understanding: - We deploy a change that changes a large number of files -- either a new version deploy (e.g., T408272#1136991... [03:08:57] 06serviceops, 07Epic, 06MediaWiki-Platform-Team (Kanban Board): Migrate Wikimedia production from PHP 8.1 to PHP 8.3 - https://phabricator.wikimedia.org/T360995#11471442 (10Krinkle) [08:21:43] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471758 (10elukey) ` root@ms-fe2009:~# swift stat docker_registry_codfw Account: AUTH_docker... [08:22:35] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471761 (10MatthewVernon) I understand why #sre-swift-storage got tagged, but: replication between eqiad an... [08:23:51] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471763 (10MatthewVernon) p:05Triage→03High [09:03:09] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471835 (10MatthewVernon) OK, the above turns out not to be true, it's just what I thought was true for my... [10:15:48] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472099 (10MatthewVernon) At least one of the problems is that the container is damaged - there are objects... [10:18:32] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472114 (10MatthewVernon) And we can see on that server that the sync is going nowhere... ` background.log:... [10:19:35] 06serviceops, 10Ceph, 10SRE-swift-storage, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11472116 (10MatthewVernon) Whatever we do, it should not involve trying to get swift to sync betw... [10:21:13] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472133 (10MatthewVernon) And the summary: ` Dec 18 08:51:02 ms-be2081 container-sync: Since Thu Dec 18 07:... [10:43:56] 06serviceops, 10observability: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663#11472173 (10MLechvien-WMF) A new metric `wmf_dnsdiscovery_service_active_active` (value = 1 for Active/Active, 0 for Active/Passive) is exported by Cumi... [11:17:44] 06serviceops, 10observability: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663#11472295 (10Clement_Goubert) Looks good to me, we may want to try and tune a more compact viz but I haven't been able to find a form that'd work. On po... [11:36:28] 06serviceops, 10observability: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663#11472378 (10MLechvien-WMF) Yes indeed, easy change and would be cleaner on the dashboard. I'll test what the script gathers for those excluded services... [12:01:59] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472466 (10elukey) ` elukey@ms-be2081:~$ sudo journalctl -u swift-container-sync.service| egrep "\.db " | e... [12:03:40] 06serviceops, 13Patch-For-Review: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861#11472468 (10Clement_Goubert) >>! In T390861#11463543, @JMeybohm wrote: > Two questions/suggestions in this regard: > * I see that we also have wikikube-ctrl2006 racked (T406596), would it... [12:18:04] 06serviceops, 13Patch-For-Review: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861#11472476 (10JMeybohm) >>! In T390861#11472468, @Clement_Goubert wrote: > I think that would be the first wikikube nodes that we use UEFI on I think, so we may want to pay a little more at... [13:12:54] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472651 (10MatthewVernon) There are false-positives in that list (e.g. the last one is a good object, but t... [13:17:11] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472671 (10MatthewVernon) Earlier, we tested an approach (used before with ghost swift objects cf T327253)... [13:22:27] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472688 (10MatthewVernon) A few other notes (beyond "we should stop using swift_container_sync already"): L... [13:22:29] 06serviceops, 10Page Content Service: Production error: worker died, restarting - https://phabricator.wikimedia.org/T394659#11472691 (10Jgiannelos) 05Open→03Invalid [13:40:55] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080 (10BTullis) 03NEW [14:09:45] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472929 (10elukey) >>! In T413008#11472688, @MatthewVernon wrote: > I think the pragmatic next step is to d... [14:16:41] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11472986 (10akosiaris) Thanks for this Ben. Overall, this is similar to what I... [14:18:26] 06serviceops, 10observability: Create a visual representation of where each service is active from, any given time - https://phabricator.wikimedia.org/T327663#11472996 (10MLechvien-WMF) It seems it's producing valid data for the missing services, giving us the types for about 20% more services. We'll still ha... [14:22:58] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473061 (10Volans) There is also an opportunity here to try to consolidate the... [14:29:18] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473120 (10akosiaris) >>! In T413080#11473061, @Volans wrote: > There is also a... [14:36:49] 06serviceops, 10MW-on-K8s, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11473183 (10Scott_French) @thcipriani - Thanks for pulling together T412265#11471277. Indeed, your understanding here is correct. //Cause// We believe this is an... [14:42:54] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473214 (10BTullis) [14:50:30] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473253 (10BTullis) [14:56:51] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473278 (10BTullis) >>! In T413080#11472986, @akosiaris wrote: > I would sugges... [15:07:55] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473341 (10Scott_French) Many thanks for investigating this @MatthewVernon and @elukey. It's interesting h... [15:08:54] 06serviceops, 10MW-on-K8s, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11473354 (10MatthewVernon) Adding or removing hosts from the swift rings will create more "churn" - you have to make incremental changes to the swift rings, deploy... [15:17:38] 06serviceops, 10Ceph, 06Data-Platform-SRE, 06Infrastructure-Foundations, and 2 others: Design and build the next generation of container-registry service for the WMF production realm - https://phabricator.wikimedia.org/T413080#11473377 (10akosiaris) Thanks for amendments! [15:19:53] 06serviceops, 07Epic, 06MediaWiki-Platform-Team (Kanban Board): Migrate Wikimedia production from PHP 8.1 to PHP 8.3 - https://phabricator.wikimedia.org/T360995#11473410 (10Krinkle) 05Open→03Resolved Closing this task as the checklist is done, with the below remaining points delined this time. > DX:... [15:21:59] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473425 (10elukey) I've executed the following from ms-fe2009, deleting the objects that Matthew highlighte... [15:23:44] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473438 (10MatthewVernon) My unfounded suspicion is that the bad objects were trying to be uploaded during... [15:34:09] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473485 (10elukey) After the container-sync restart on ms-be2081, I noticed the following errors and I trie... [15:43:57] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473497 (10elukey) ` [16:21:48] 06serviceops, 07Epic, 06MediaWiki-Platform-Team (Kanban Board): Migrate Wikimedia production from PHP 8.1 to PHP 8.3 - https://phabricator.wikimedia.org/T360995#11473594 (10Krinkle) >>! In T360995#11437108, @Krinkle wrote: > Proposed changes for the checklist so far: > > * ([diff](https://wikitech.wiki... [16:32:18] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473631 (10MatthewVernon) Yes, I think we're at "give it some time", but I think we've unblocked replicatio... [17:28:30] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473749 (10MatthewVernon) ` Dec 18 14:36:26 ms-be2081 container-sync: Since Thu Dec 18 13:36:25 2025: 12 sy... [17:34:07] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473764 (10MatthewVernon) Found some more with `journalctl -o cat -u swift-container-sync.service -g 'Unkno... [18:25:47] 06serviceops, 06Infrastructure-Foundations, 06SRE, 10SRE-swift-storage, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473918 (10MatthewVernon) ` Dec 18 18:24:54 ms-be2081 container-sync: Since Thu Dec 18 17:24:44 2025: 12 sy... [18:49:07] 06serviceops, 10MediaWiki-extensions-Score, 10Wikimedia-SVG-rendering, 07Upstream: Deploy Lilypond 2.24 with cairo support to shellbox containers - https://phabricator.wikimedia.org/T385404#11473984 (10AnthonyFok) The Debian LilyPond package with the Cairo backend enabled was finally backported to Debian 1... [19:01:15] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11474018 (10VRiley-WMF) [19:02:52] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11474039 (10VRiley-WMF) wikikube-worker1360 B2 U18 wikikube-worker1361 B4 U36 wikikube-worker1362 C3 U37 wikikube-worker1363 C4 U28 wikikube-worker1364 C5 U31 wikikube... [19:30:07] 06serviceops, 10MW-on-K8s: Restart CronJobs on failure of the service mesh - https://phabricator.wikimedia.org/T390972#11474129 (10matmarex) [19:30:08] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability: MediaWiki periodic job startupregistrystats-mediawikiwiki failed - https://phabricator.wikimedia.org/T410764#11474130 (10matmarex) [19:30:28] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability: MediaWiki periodic job startupregistrystats-mediawikiwiki failed - https://phabricator.wikimedia.org/T410764#11474133 (10matmarex)