[05:50:58] <wikibugs>	 06serviceops, 10MW-on-K8s, 06ServiceOps new, 10ServiceOps-SharedInfra, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11544011 (10jasmine_)
[09:13:46] <wikibugs>	 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544252 (10MLechvien-WMF)
[09:23:55] <wikibugs>	 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544264 (10Aklapper)
[09:24:03] <wikibugs>	 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544267 (10Aklapper) Thanks everyone for looking into this / creating pa...
[10:10:19] <wikibugs>	 06serviceops: Informal chats with software engineers around wmf - https://phabricator.wikimedia.org/T400272#11544408 (10jijiki) This work has been done and captured on google docs. Please ping me if you would like to chat about this or have a look at my anonymised notes:)
[10:10:27] <wikibugs>	 06serviceops: Informal chats with software engineers around wmf - https://phabricator.wikimedia.org/T400272#11544409 (10jijiki) 05In progress→03Resolved
[10:23:43] <wikibugs>	 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544421 (10TheDJ) >>! In T415169#11543567, @aaron wrote: > [EDIT] I mean...
[11:11:18] <wikibugs>	 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544544 (10MoritzMuehlenhoff) p:05Triage→03High
[11:11:27] <wikibugs>	 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544546 (10MoritzMuehlenhoff) 05Open→03Resolved All cleaned up
[11:19:17] <wikibugs>	 06serviceops, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board), 07OKR-Work: api-gateway helm chart: rest routes should return retry-after when a rate limit applies. - https://phabricator.wikimedia.org/T405636#11544549 (10daniel) 05In progress→03Resolved
[11:53:31] <wikibugs>	 06serviceops: low rate of mw-memcached errors - https://phabricator.wikimedia.org/T371881#11544630 (10jijiki) 05Open→03Resolved I am closing this as it has not manifested itself for a long time
[11:57:13] <wikibugs>	 06serviceops, 06ServiceOps new: Experiment with Memcached Proxy - https://phabricator.wikimedia.org/T363723#11544672 (10jijiki)
[12:24:29] <wikibugs>	 06serviceops, 06Community-Tech, 10MinT, 10Wishlist intake gadget (Translations): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#11544760 (10jijiki) 05Open→03Resolved Closing due to inactivity. Feel free to reopen if needed.
[12:25:11] <Raine>	 A_smart_kitten: I've deployed the restbase patch (sorry I didn't get to it yesterday), but tbh I don't know how to test it '^^ I think it's probably alright, since restbase isn't crashing :D but lmk if there's a problem
[12:26:02] <A_smart_kitten>	 Raine: Thank you! and no worries about not getting to it yesterday, AFAICS the wikis themselves haven't been created yet so no issues caused by the delay
[12:26:15] <Raine>	 yay :D 
[12:26:23] <A_smart_kitten>	 https://kai.wikipedia.org/api/rest_v1/?spec is loading so that seems good maybe? :D
[12:26:43] <Raine>	 fair, that's probably sufficient :D
[12:27:12] <claime>	 That's not answered by RESTbase anymore afaik but maybe the patch makes it work idk
[12:27:20] <claime>	 Mystery boxes
[12:29:13] <hnowlan>	 if mathoid works for those wikis it should be good, that's all it enables these days
[12:30:16] <claime>	 true
[12:36:34] <A_smart_kitten>	 claime: out of interest, how would one know for certain if a route is being answered by RESTBase?
[12:36:37] <A_smart_kitten>	 just asking as e.g. https://kai.wikipedia.org/api/rest_v1/media/math/check/type is returning a header of the format "Server: restbase10xx" for me just now - would that e.g. be a sign that things are working as intended post-deploy?
[12:36:54] <claime>	 Yep
[12:37:03] <A_smart_kitten>	 yay!
[12:37:24] <claime>	 If it's not restbase but mediawiki responding, you'd get Server: mw-api-ext...
[12:46:24] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11544817 (10jijiki) a:03jijiki
[12:46:30] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11544821 (10jijiki) p:05Triage→03Medium
[13:28:26] <wikibugs>	 06serviceops, 06MediaWiki-Platform-Team (Radar): Enable extstore to a subset of memcached servers (experiment) - https://phabricator.wikimedia.org/T352885#11544978 (10jijiki) 05Stalled→03Resolved I am closing this as we do not have anything actionable atm.
[13:29:41] <wikibugs>	 06serviceops: Identify areas covered by the Production Readiness checklist - https://phabricator.wikimedia.org/T400476#11544984 (10jijiki) 05Open→03Resolved This work has been done and documented on asana under WE6.2
[13:38:52] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: kube-scheduler failed to start during sre.k8s.wipe-cluster - https://phabricator.wikimedia.org/T406201#11545007 (10jijiki)
[13:42:56] <wikibugs>	 06serviceops, 06ServiceOps new, 10GitLab (CI & Job Runners): failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid - https://phabricator.wikimedia.org/T406392#11545017 (1...
[13:45:56] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: charlie wiped cluster redeployment use-case - https://phabricator.wikimedia.org/T406212#11545027 (10jijiki)
[13:46:36] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: charlie wiped cluster redeployment use-case - https://phabricator.wikimedia.org/T406212#11545030 (10jijiki) p:05Triage→03High
[13:47:39] <wikibugs>	 06serviceops, 10MW-on-K8s: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545037 (10jijiki) p:05Triage→03Medium
[13:49:31] <wikibugs>	 06serviceops, 10MW-on-K8s, 06ServiceOps new: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545048 (10jijiki)
[13:50:12] <wikibugs>	 06serviceops, 10MW-on-K8s, 10Prod-Kubernetes, 06ServiceOps new: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545055 (10jijiki)
[13:53:22] <wikibugs>	 06serviceops, 06MediaWiki-Engineering, 06ServiceOps new, 07Epic, 07Performance Issue: Limit the number of expensive API queries a user can perform - https://phabricator.wikimedia.org/T405472#11545068 (10jijiki)
[13:56:44] <wikibugs>	 06serviceops, 06DC-Ops, 06SRE: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11545085 (10MLechvien-WMF) @jasmine_ are you doing this task? Please ask others if you don't find the capacity
[14:13:14] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: MW deployments shouldn't need a hard-coded kubernetesVersion - https://phabricator.wikimedia.org/T388969#11545164 (10MLechvien-WMF) p:05Medium→03High
[15:37:20] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11545643 (10MLechvien-WMF) @jijiki does this need to be scheduled this quarter and why? I'm inclined to move it to Backlog until next quarter
[15:38:01] <nemo-yiannis>	 swfrench-wmf: I just deployed the node22 enabled mobileapps image with the new flags for memory limits
[15:39:05] <swfrench-wmf>	 nemo-yiannis: ah, thanks for the heads-up! how are things looking? :)
[15:39:11] <nemo-yiannis>	 dunno, checking now
[15:40:32] * swfrench-wmf thumbs up
[15:41:12] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#11545669 (10MLechvien-WMF) @JMeybohm @Clement_Goubert  this sounds like something we may need to do before next Kubernetes upgrade (or at leas...
[15:41:57] <nemo-yiannis>	 swfrench-wmf: i don't see anything problematic on grafana, but i assume if we encounter the same latency issue, it needs a bit of time for memory limits to kick in
[15:42:05] <swfrench-wmf>	 reading through the task, it looks like pod unavailability [0] and latency as seen by wikifeeds [1] were two good signals for badness.
[15:42:05] <swfrench-wmf>	 [0] https://grafana.wikimedia.org/goto/hnqW3LIDg?orgId=1
[15:42:05] <swfrench-wmf>	 [1] https://grafana.wikimedia.org/goto/j6o7qLIDR?orgId=1
[15:42:50] <swfrench-wmf>	 ... but IIRC it took a while for those to creep back up after a restart, so this will probably need to soak for a while
[15:43:05] <nemo-yiannis>	 👍
[15:43:16] <swfrench-wmf>	 thanks again for moving this forward!
[15:43:34] <claime>	 <3 nemo-yiannis 
[16:10:06] <wikibugs>	 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11545813 (10Scott_French)
[16:18:29] <wikibugs>	 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11545841 (10Scott_French) I've merged {T412265} into this task, as we believe it's another manifestation of the same class of failure modes discussed here.  One key point of note...
[22:59:15] <wikibugs>	 06serviceops, 06Content-Transform-Team, 10Wikifeeds, 06Wikipedia-Android-App-Backlog: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 - https://phabricator.wikimedia.org/T410296#11547120 (10Scott_French) A couple of hours in after @Jgiannelos set `--max-old-space-size`...