[07:14:01] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Remove recommendation-api from the REST API offerings - https://phabricator.wikimedia.org/T390517 (10akosiaris) 03NEW [07:23:53] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Remove recommendation-api from the REST API offerings - https://phabricator.wikimedia.org/T390517#10692680 (10akosiaris) [07:26:09] FYI wikikube-worker1039 seems to have gone down [07:28:09] during the weekend I think, but it's not present in the grafana host overview dashboard [07:29:48] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Remove recommendation-api from the REST API offerings - https://phabricator.wikimedia.org/T390517#10692686 (10akosiaris) Per https://w.wiki/DeQh, we are well within our estimations about the day amount of requests still reaching this service. It's in... [07:34:07] volans: and 0 repercussions? That's nice. [07:34:38] I dunno about repercussions :D I just noticed running debdeploy :) [07:36:59] yeah, I expect 0 tbh as far as the services go [07:37:16] we lost some cluster level capacity, but the services haven't. Everything got rescheduled [07:37:50] nice [07:47:48] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10692713 (10akosiaris) I think we just saw this again today. @dcausse just tried a scap deployment and got errors on `Started sync-testservers-k8s` scap step. ` 9m32s Warning Failed... [08:13:07] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Remove recommendation-api from the REST API offerings - https://phabricator.wikimedia.org/T390517#10692763 (10akosiaris) p:05Triage→03Medium requestctl rules set, the API now responds 403. Let's keep it like this for a week or so to see if anyone... [08:13:14] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10692766 (10elukey) This is what I see on rdb2009: ` 127.0.0.1:6382> KEYS *65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb* 1) "blobs::sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff... [08:37:53] 06serviceops, 06SRE, 10Wikidata, 10Wikimedia-Site-requests, and 2 others: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10692848 (10seanleong-WMDE) [08:38:40] 06serviceops, 06SRE, 10Wikidata, 10Wikimedia-Site-requests, and 2 others: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10692855 (10seanleong-WMDE) a:03seanleong-WMDE [09:20:46] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10692974 (10akosiaris) Digging into this a lot more with @elukey today, in the blob `65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb`, I finally found a culprit at: `lang=bash deploy10... [09:29:29] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10692999 (10elukey) I found also: ` # Define a cache for immutable blobs and manifests # inactive time here probably needs to match what is # set in proxy_cache_valid below. proxy_cache_path /var/cache... [10:28:44] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, and 2 others: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10693101 (10BTullis) [10:30:23] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, and 2 others: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10693104 (10BTullis) All snapshot hosts have now been upgraded to PHP 8.1. [10:34:00] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10693111 (10elukey) ` elukey@cumin1002:~$ sudo cumin --force 'registry*' 'curl -s -k http://localhost:5000/v2/restricted/mediawiki-webserver/blobs/sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e926... [10:54:51] 06serviceops, 06MediaWiki-Engineering, 06SRE-OnFire, 10Sustainability (Incident Followup): Reduce the amount of messages sent through channel:Memcached during failures - https://phabricator.wikimedia.org/T390529 (10jijiki) 03NEW [10:54:55] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10693166 (10elukey) p:05Triage→03High [11:33:02] wikikube-worker1039 hard reset from idrac and back up n ow [11:37:31] 06serviceops, 06Release-Engineering-Team, 06SRE-OnFire, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531 (10jijiki) 03NEW [11:38:22] 06serviceops, 06Release-Engineering-Team, 06SRE-OnFire, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10693275 (10jijiki) [11:55:29] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10693329 (10elukey) Disabled the nginx blob cache (but left the auth one intact), so far I see the correct layer being served by all nodes. Time to test another deployment :) [12:05:03] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10693365 (10Clement_Goubert) >>! In T384970#10688038, @Jhancock.wm wrote: > @Clement_Goubert i finished all but one server (2331). Luca is trying to... [12:30:12] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10693459 (10dcausse) >>! In T390251#10693329, @elukey wrote: > Disabled the nginx blob cache (but left the auth one intact), so far I see the correct layer being served by all node... [12:54:16] 06serviceops, 06MediaWiki-Platform-Team: Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team) - https://phabricator.wikimedia.org/T388540#10693501 (10Clement_Goubert) `testwiki` periodic job migrated to kubernetes: `lang=bash kubectl get cronjobs.batch mediawiki-main-star... [13:31:29] 06serviceops, 06MediaWiki-Platform-Team: Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team) - https://phabricator.wikimedia.org/T388540#10693643 (10Clement_Goubert) Because of a misconfiguration in the `mediawiki` chart, the 13:10 UTC run for testwiki was not successful.... [13:49:41] 06serviceops, 06Commons, 10MediaWiki-File-management, 06SRE, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#10693912 (10Ladsgroup) >>! In T266155#9766334, @Bawolff wrote: > I think if we did deliver the wrong thu... [13:53:38] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10693939 (10elukey) Summary: my theory is that some corruption happened in /var/cache/nginx-docker-registry (on the root partition, no dedicated space) when registry2* started to show > 95% of root part... [14:15:56] 06serviceops, 06MediaWiki-Platform-Team: Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team) - https://phabricator.wikimedia.org/T388540#10694039 (10Clement_Goubert) The fixed chart was deployed in time for the 14:10 UTC run, which seems to have been successful. [[ https:... [14:51:01] 06serviceops, 13Patch-For-Review: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917#10694185 (10MSantos) This is great! I have a few questions: Will that be a standard moving forward every time we migrate to a new version? Is this work enough to create the desired structure when we upgr... [15:05:46] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10694251 (10elukey) @Scott_French I think that we can close and re-open if the issue re-surface, what do you think? [15:14:37] 06serviceops, 06Release-Engineering-Team, 06SRE-OnFire, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10694284 (10jijiki) [15:34:36] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10694351 (10Clement_Goubert) >>! In T385782#10563236, @Urbanecm_WMF wrote: >>>! In T385782#10563189, @Clement_Goubert wrote: >> Thanks for the cle... [15:52:47] 06serviceops, 10Sustainability (Incident Followup): Consider removing envvars.inc from MediaWiki images - https://phabricator.wikimedia.org/T390573#10694454 (10jijiki) [16:44:45] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10694780 (10Scott_French) Many thanks, @elukey and @akosiaris - this is consistent with what we were seeing last week, i.e., the bad blob seemed to only exist in the nginx cache (whereas missing in the... [16:46:45] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10694794 (10akosiaris) 05Open→03Resolved a:03akosiaris I am drafting an incident response here for what is worth https://wikitech.wikimedia.org/wiki/Incidents/2025-03-31_docker-registry_corrup... [17:34:38] 06serviceops, 10Discovery-Search (2025.03.22 - 2025.04.11): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10695049 (10EBernhardson) I ran a test invocation to see how it would work and it seems to have worked as expected: ` mwscript-k8s --attach extensions/CirrusSearch/m... [18:46:54] I tried to deploy a helmfile change on k8s-aux to admin_ng, to add a namespace. I am seeing a new release "ceph-csi-rbd" when doing that. Would that be expected? [18:47:25] to add ceph csi block device support? [18:47:55] docs say that serviceops should do this but it's not wikikube, it's aux [18:53:48] aux is i/f afaik [18:53:50] not serviceops [18:54:33] I am not familiar with that change though, aside from the very high level thing that it is used to provide storage [18:55:15] ah! ok, I will ask in IF and people I see in git history of admin_ng/values/aux-k8s.yaml [18:56:17] I expected to affect namespaces and namespace-certificates. there is also kube-state-metrics, helm-state-metrics and ceph-csi-rbd [18:57:49] maybe it's related to this revert: Revert "benthos-mw-accesslog-metrics: [18:59:45] mutante: not related to that, I believe [19:00:40] *nod* ok, thanks [19:12:57] (P.S. looks I can run helmfile -e aux-k8s-codfw -i diff -l name=namespaces -l name=namespace-certific [19:13:09] so to deploy my new namespace and leave the rest alone) [19:19:59] yea, I did that, namespace releases updated.. metrics and ceph skipped. also the diff was only codfw, not eqiad. [19:26:32] 06serviceops: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251#10695479 (10dancy) 05Resolved→03Open This is happening to me right now: ` Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-03-31-190719-publish-81":... [19:27:10] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10695492 (10dancy) [19:28:21] * Raine deployed {helm,kube}-state-metrics in aux-k8s-codfw (I believe I had made them appear earlier today with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132587), so only ceph things remain, serviceops things are cleaned up [19:28:43] Raine: :) cool, thanks [19:29:18] yes, next time something is happening with k8s-aux won't use this channel:) [19:31:59] mutante: which docs exactly, https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service ? [19:32:16] if they pointed you here, then somebody should update them [19:33:09] yes, https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service#Deploy_changes_to_helmfile.d/admin_ng [19:33:30] ack, thanks, I'll rephrase it [19:33:39] is bold and edits: [19:33:41] Commit, and ask '''somebody from Service Ops (for wikikube) or IF (for aux cluster) to validate and merge.''' [19:34:38] excellent, thank you mutante <3 [19:41:16] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10695555 (10Scott_French) This is a Heisenbug of the most irritating kind: I just pulled the `128b91e8163d40642d2bdd410f8544bee05ee9cb6a28190d0eca8a79f5bd2e8c` blob for all of {registry2004, registry20... [19:51:26] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10695580 (10Dzahn) We aren't the only ones who report they setup their own docker registry and then ran into unknown blobs .. once they added nginx or another reverse proxy in front of it. A lot of tal... [19:58:46] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10695585 (10Dzahn) Our nginx config template has: ` proxy_set_header X-Forwarded-Proto $scheme; ` Others claim it fixed it for them to set it to a hard https, based on " it seems to be an issue with... [20:15:02] 06serviceops: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#10695674 (10Scott_French) I've checked the logs on registry2004 and 2005, and unlike the first instance of this, I do not see any upload failures for this blob. This looks like the "normal" sequence of... [21:18:54] 06serviceops, 13Patch-For-Review: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917#10695968 (10Scott_French) >>! In T387917#10694185, @MSantos wrote: > This is great! I have a few questions: > > Will that be a standard moving forward every time we migrate to a new version? Is this work... [23:34:39] 06serviceops, 06Release-Engineering-Team, 06SRE-OnFire, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10696405 (10Scott_French) A couple of thoughts: I think it would make a lot of s...