[05:50:58] 06serviceops, 10MW-on-K8s, 06ServiceOps new, 10ServiceOps-SharedInfra, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11544011 (10jasmine_) [09:13:46] 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544252 (10MLechvien-WMF) [09:23:55] 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544264 (10Aklapper) [09:24:03] 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544267 (10Aklapper) Thanks everyone for looking into this / creating pa... [10:10:19] 06serviceops: Informal chats with software engineers around wmf - https://phabricator.wikimedia.org/T400272#11544408 (10jijiki) This work has been done and captured on google docs. Please ping me if you would like to chat about this or have a look at my anonymised notes:) [10:10:27] 06serviceops: Informal chats with software engineers around wmf - https://phabricator.wikimedia.org/T400272#11544409 (10jijiki) 05In progress→03Resolved [10:23:43] 06serviceops, 06MW-Interfaces-Team, 06Reader Growth Team, 06ServiceOps new, and 3 others: Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') - https://phabricator.wikimedia.org/T415169#11544421 (10TheDJ) >>! In T415169#11543567, @aaron wrote: > [EDIT] I mean... [11:11:18] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544544 (10MoritzMuehlenhoff) p:05Triage→03High [11:11:27] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11544546 (10MoritzMuehlenhoff) 05Open→03Resolved All cleaned up [11:19:17] 06serviceops, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board), 07OKR-Work: api-gateway helm chart: rest routes should return retry-after when a rate limit applies. - https://phabricator.wikimedia.org/T405636#11544549 (10daniel) 05In progress→03Resolved [11:53:31] 06serviceops: low rate of mw-memcached errors - https://phabricator.wikimedia.org/T371881#11544630 (10jijiki) 05Open→03Resolved I am closing this as it has not manifested itself for a long time [11:57:13] 06serviceops, 06ServiceOps new: Experiment with Memcached Proxy - https://phabricator.wikimedia.org/T363723#11544672 (10jijiki) [12:24:29] 06serviceops, 06Community-Tech, 10MinT, 10Wishlist intake gadget (Translations): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#11544760 (10jijiki) 05Open→03Resolved Closing due to inactivity. Feel free to reopen if needed. [12:25:11] A_smart_kitten: I've deployed the restbase patch (sorry I didn't get to it yesterday), but tbh I don't know how to test it '^^ I think it's probably alright, since restbase isn't crashing :D but lmk if there's a problem [12:26:02] Raine: Thank you! and no worries about not getting to it yesterday, AFAICS the wikis themselves haven't been created yet so no issues caused by the delay [12:26:15] yay :D [12:26:23] https://kai.wikipedia.org/api/rest_v1/?spec is loading so that seems good maybe? :D [12:26:43] fair, that's probably sufficient :D [12:27:12] That's not answered by RESTbase anymore afaik but maybe the patch makes it work idk [12:27:20] Mystery boxes [12:29:13] if mathoid works for those wikis it should be good, that's all it enables these days [12:30:16] true [12:36:34] claime: out of interest, how would one know for certain if a route is being answered by RESTBase? [12:36:37] just asking as e.g. https://kai.wikipedia.org/api/rest_v1/media/math/check/type is returning a header of the format "Server: restbase10xx" for me just now - would that e.g. be a sign that things are working as intended post-deploy? [12:36:54] Yep [12:37:03] yay! [12:37:24] If it's not restbase but mediawiki responding, you'd get Server: mw-api-ext... [12:46:24] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11544817 (10jijiki) a:03jijiki [12:46:30] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11544821 (10jijiki) p:05Triage→03Medium [13:28:26] 06serviceops, 06MediaWiki-Platform-Team (Radar): Enable extstore to a subset of memcached servers (experiment) - https://phabricator.wikimedia.org/T352885#11544978 (10jijiki) 05Stalled→03Resolved I am closing this as we do not have anything actionable atm. [13:29:41] 06serviceops: Identify areas covered by the Production Readiness checklist - https://phabricator.wikimedia.org/T400476#11544984 (10jijiki) 05Open→03Resolved This work has been done and documented on asana under WE6.2 [13:38:52] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: kube-scheduler failed to start during sre.k8s.wipe-cluster - https://phabricator.wikimedia.org/T406201#11545007 (10jijiki) [13:42:56] 06serviceops, 06ServiceOps new, 10GitLab (CI & Job Runners): failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid - https://phabricator.wikimedia.org/T406392#11545017 (1... [13:45:56] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: charlie wiped cluster redeployment use-case - https://phabricator.wikimedia.org/T406212#11545027 (10jijiki) [13:46:36] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: charlie wiped cluster redeployment use-case - https://phabricator.wikimedia.org/T406212#11545030 (10jijiki) p:05Triage→03High [13:47:39] 06serviceops, 10MW-on-K8s: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545037 (10jijiki) p:05Triage→03Medium [13:49:31] 06serviceops, 10MW-on-K8s, 06ServiceOps new: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545048 (10jijiki) [13:50:12] 06serviceops, 10MW-on-K8s, 10Prod-Kubernetes, 06ServiceOps new: Support shell to mw-experimental pod - https://phabricator.wikimedia.org/T405688#11545055 (10jijiki) [13:53:22] 06serviceops, 06MediaWiki-Engineering, 06ServiceOps new, 07Epic, 07Performance Issue: Limit the number of expensive API queries a user can perform - https://phabricator.wikimedia.org/T405472#11545068 (10jijiki) [13:56:44] 06serviceops, 06DC-Ops, 06SRE: Reimage sretest2009 as a wikikube worker and assess performance - https://phabricator.wikimedia.org/T400871#11545085 (10MLechvien-WMF) @jasmine_ are you doing this task? Please ask others if you don't find the capacity [14:13:14] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: MW deployments shouldn't need a hard-coded kubernetesVersion - https://phabricator.wikimedia.org/T388969#11545164 (10MLechvien-WMF) p:05Medium→03High [15:37:20] 06serviceops, 10Prod-Kubernetes, 06ServiceOps new: Update app.job module in deployment-charts - https://phabricator.wikimedia.org/T356885#11545643 (10MLechvien-WMF) @jijiki does this need to be scheduled this quarter and why? I'm inclined to move it to Backlog until next quarter [15:38:01] swfrench-wmf: I just deployed the node22 enabled mobileapps image with the new flags for memory limits [15:39:05] nemo-yiannis: ah, thanks for the heads-up! how are things looking? :) [15:39:11] dunno, checking now [15:40:32] * swfrench-wmf thumbs up [15:41:12] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Fix thumbor discovery records and make swift use them - https://phabricator.wikimedia.org/T397618#11545669 (10MLechvien-WMF) @JMeybohm @Clement_Goubert this sounds like something we may need to do before next Kubernetes upgrade (or at leas... [15:41:57] swfrench-wmf: i don't see anything problematic on grafana, but i assume if we encounter the same latency issue, it needs a bit of time for memory limits to kick in [15:42:05] reading through the task, it looks like pod unavailability [0] and latency as seen by wikifeeds [1] were two good signals for badness. [15:42:05] [0] https://grafana.wikimedia.org/goto/hnqW3LIDg?orgId=1 [15:42:05] [1] https://grafana.wikimedia.org/goto/j6o7qLIDR?orgId=1 [15:42:50] ... but IIRC it took a while for those to creep back up after a restart, so this will probably need to soak for a while [15:43:05] 👍 [15:43:16] thanks again for moving this forward! [15:43:34] <3 nemo-yiannis [16:10:06] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11545813 (10Scott_French) [16:18:29] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11545841 (10Scott_French) I've merged {T412265} into this task, as we believe it's another manifestation of the same class of failure modes discussed here. One key point of note... [22:59:15] 06serviceops, 06Content-Transform-Team, 10Wikifeeds, 06Wikipedia-Android-App-Backlog: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 - https://phabricator.wikimedia.org/T410296#11547120 (10Scott_French) A couple of hours in after @Jgiannelos set `--max-old-space-size`...