[08:17:40] 06serviceops, 07sre-alert-triage: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037 (10LSobanski) 03NEW [08:18:42] 06serviceops, 07sre-alert-triage: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038 (10LSobanski) 03NEW [08:46:59] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian): Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10640577 (10elukey) I already started seeing some OOM kills for various pods due to the kartotherian-main container exceeding the allowe... [08:51:21] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian): Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10640602 (10elukey) I can also confirm that Kartotherian on bare metal doesn't receive any more traffic: ` elukey@cumin1002:~$ sudo cum... [08:56:17] 06serviceops, 07sre-alert-triage: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037#10640633 (10MoritzMuehlenhoff) 05Openβ†’03Declined This is caused by WIP setup nodes for the parallel Bookworm cluster, but not affecting any production workloads. [09:32:27] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update wikikube-staging-eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T389045 (10JMeybohm) 03NEW [10:15:08] hey folks! [10:15:51] Kartotherian's pods started to be "recycled" by the OOM killer since they reached (after some days of serving requests) their maximum allowed memory [10:16:41] We have clearly a memory leak, we kinda know how to fix it (see https://phabricator.wikimedia.org/T386926 for more info) but it may take a bit since some dev time is needed to add the necessary code to Kartotherian. [10:17:16] I am inclined to leave things as they are and proceed with the removal of kartotherian from bare metals, since the impact of OOMs is currently limited for external users [10:17:43] it is a compromise that I don't like a lot, but it seems a good-enough solution for the moment [10:17:53] lemme know if you think otherwise [11:00:13] this is alright indeed, thank you for letting us know luca [11:07:22] thanks :) [11:08:02] dcausse: o/ re: kartotherian and wdqs, IIUC kartotherian uses it for the geoshapes endpoint (from the config: {{ .Values.app.wdqs.endpoint }}/bigdata/namespace/wdq/sparql) [11:11:29] <_joe_> elukey: tbh letting these things crash and restart when OOM is ok [11:11:36] <_joe_> as long as they don't all crash at the same time [11:13:09] _joe_ I feel very sad but yeah I agree [11:13:30] <_joe_> elukey: the sadness begins where karthoterian does, though [11:14:47] 06serviceops, 06MediaWiki-Platform-Team: Migrate mediawiki-platform-team jobs to mw-cron - https://phabricator.wikimedia.org/T388540#10641161 (10Tgr) The non-ResourceLoader-related ones have been discussed in {T385866}. [11:16:42] elukey: ack, but still unclear what would be a good test scenario, happy to be around during the deploy tho to monitor things on the wdqs side [11:49:48] elukey: if it is every couple of days, it's not perfect, but it's probably quite ok. [12:30:00] πŸ‘‹ Is it OK if i deploy changeprop now? There was a pending patch from Friday that hnowlan already reviewed [12:50:12] nemo-yiannis: sure [12:50:31] πŸ‘ [12:54:29] done [13:00:54] 06serviceops, 07sre-alert-triage: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038#10641477 (10Clement_Goubert) β†’14Duplicate dup:03T383032 [13:00:54] 06serviceops, 10Abstract Wikipedia team (25Q3 (Jan–Mar)): wikifunction httpbb tests fail because of title case issue - https://phabricator.wikimedia.org/T383032#10641479 (10Clement_Goubert) [13:01:15] cheers yiannis [13:02:42] 06serviceops, 10Abstract Wikipedia team (25Q3 (Jan–Mar)): wikifunction httpbb tests fail because of title case issue - https://phabricator.wikimedia.org/T383032#10641502 (10Clement_Goubert) Downtiming again. [13:03:27] 06serviceops, 10MediaWiki-extensions-ReadingLists, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Switchover plan from RESTbase to REST Gateway for Reading Lists endpoints - https://phabricator.wikimedia.org/T384891#10641503 (10Dbrant) [13:06:58] 06serviceops, 06MediaWiki-Platform-Team: Migrate mediawiki-platform-team jobs to mw-cron - https://phabricator.wikimedia.org/T388540#10641549 (10Clement_Goubert) [13:10:00] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10641561 (10Jdforrester-WMF) [13:10:06] 06serviceops, 10MediaWiki-extensions-ReadingLists, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Switchover plan from RESTbase to REST Gateway for Reading Lists endpoints - https://phabricator.wikimedia.org/T384891#10641563 (10HCoplin-WMF) Timeline update based on mobile team capacity: - If possible, let's... [13:12:34] 06serviceops, 10MediaWiki-extensions-ReadingLists, 06MW-Interfaces-Team, 10RESTBase Sunsetting: Switchover plan from RESTbase to REST Gateway for Reading Lists endpoints - https://phabricator.wikimedia.org/T384891#10641575 (10HCoplin-WMF) [13:13:49] 06serviceops, 07Datacenter-Switchover, 07User-notice: MoveComms support for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T387444#10641588 (10Trizek-WMF) Than you both for your help! [13:15:50] 06serviceops, 06MediaWiki-Platform-Team: Migrate mediawiki-platform-team jobs to mw-cron - https://phabricator.wikimedia.org/T388540#10641598 (10Tgr) The remainder is for [[https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaMaintenance/+/5b73dccad5df76c042c19746a1bbe1556915667b/blameStartupRegistry.p... [13:34:46] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 07Code-Health-Objective, and 2 others: Move PCS endpoints behind API Gateway - https://phabricator.wikimedia.org/T264670#10641681 (10Dbrant) [14:31:18] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Fix dependencies between admin_ng deployments - https://phabricator.wikimedia.org/T389080 (10JMeybohm) 03NEW [14:50:41] 06serviceops, 07Kubernetes, 13Patch-For-Review: Add pod ip address blocks to staging-eqiad - https://phabricator.wikimedia.org/T386232#10642039 (10JMeybohm) 05Openβ†’03Resolved staging-eqiad switched to the new ip pool today (T389045) [14:50:44] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Update wikikube-staging-codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T384450#10642045 (10JMeybohm) 05In progressβ†’03Resolved [14:50:47] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Update wikikube-staging-eqiad to kubernetes 1.31 - https://phabricator.wikimedia.org/T389045#10642047 (10JMeybohm) 05Openβ†’03Resolved [14:51:48] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984#10642050 (10JMeybohm) [14:53:28] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Check/update grafana dashboards for k8s 1.31 - https://phabricator.wikimedia.org/T389084 (10JMeybohm) 03NEW [14:56:06] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Check/update grafana dashboards for k8s 1.31 - https://phabricator.wikimedia.org/T389084#10642087 (10JMeybohm) p:05Triageβ†’03High [14:56:12] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Fix dependencies between admin_ng deployments - https://phabricator.wikimedia.org/T389080#10642089 (10JMeybohm) p:05Triageβ†’03Medium [14:59:19] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: wipe-cluster cookbook should check if systemd services have started properly - https://phabricator.wikimedia.org/T389086 (10JMeybohm) 03NEW [14:59:28] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: wipe-cluster cookbook should check if systemd services have started properly - https://phabricator.wikimedia.org/T389086#10642114 (10JMeybohm) p:05Triageβ†’03Medium [15:09:41] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408#10642162 (10JMeybohm) [15:11:35] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update kube-state-metrics for k8s 1.31 - https://phabricator.wikimedia.org/T388387#10642176 (10JMeybohm) a:03kamila [15:13:20] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Ensure the correct helm version is used for each cluster - https://phabricator.wikimedia.org/T388390#10642197 (10JMeybohm) a:03kamila [15:19:05] 06serviceops, 07Kubernetes, 13Patch-For-Review: Add pod ip address blocks to staging - https://phabricator.wikimedia.org/T386232#10642250 (10JMeybohm) [16:03:19] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Ensure all required kubectl versions are installed on deploy hosts - https://phabricator.wikimedia.org/T388388#10642466 (10JMeybohm) Defining an explicit require for the apt-get update call does also not work.. As... [16:23:42] 06serviceops, 10Discovery-Search (2025.03.01 - 2025.03.21): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10642577 (10Gehel) [16:35:07] 06serviceops, 06Data-Engineering, 06Discovery-Search, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Search Update Pipeline requests to Action API are logged as coming from 127.0.0.1 - https://phabricator.wikimedia.org/T388855#10642657 (10Gehel) [16:51:00] 06serviceops, 10Discovery-Search (2025.03.01 - 2025.03.21): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10642790 (10dcausse) @Clement_Goubert the move of CirrusSearch maint scripts to mwscript-k8s is blocked on T382398 and I suspect that mw-cron might have the same issu... [16:53:15] 06serviceops, 10Discovery-Search (2025.03.01 - 2025.03.21): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10642800 (10dcausse) [16:55:00] 06serviceops, 10Discovery-Search (2025.03.01 - 2025.03.21): Migrate discovery-search jobs to mw-cron - https://phabricator.wikimedia.org/T388538#10642811 (10Clement_Goubert) >>! In T388538#10642790, @dcausse wrote: > @Clement_Goubert the move of CirrusSearch maint scripts to mwscript-k8s is blocked on T382398... [16:57:13] 06serviceops: Mediawiki maint scripts using service proxied by the tls proxy might fail when running with mwscript-k8s - https://phabricator.wikimedia.org/T382398#10642830 (10Clement_Goubert) {T387208} should work around that issue until we have a proper sidecar. Basically `MwScript.php` checks that the tls-prox... [17:03:16] Hey, regarding T388140: We are running cache prewarming for another round of wikis rollout for the PCS/RESTBase deprecation. Prewarming should be done today, and changeprop rules are enabled. Next step is to send production traffic. Is there anyone who can help us with that this week? [17:04:32] 06serviceops: Mediawiki maint scripts using service proxied by the tls proxy might fail when running with mwscript-k8s - https://phabricator.wikimedia.org/T382398#10642895 (10dcausse) @Clement_Goubert the script mentioned in this ticket now runs properly, will mark this task as dup of T387208, thanks! [17:04:53] 06serviceops: Mediawiki maint scripts using service proxied by the tls proxy might fail when running with mwscript-k8s - https://phabricator.wikimedia.org/T382398#10642901 (10dcausse) β†’14Duplicate dup:03T387208 [17:04:56] 06serviceops, 10MW-on-K8s: Ensure tls-proxy container is started before launching main container - https://phabricator.wikimedia.org/T387208#10642903 (10dcausse) [17:05:22] nemo-yiannis: it's DC switchover week. Expect capacity to be way lower this week. Also, if you can delay it for like a few days, it would be awesome! [17:08:39] ok sounds good [17:45:56] 06serviceops, 06Security-Team: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10643214 (10sbassett) A couple of things: # Yes, we still need to keep `mediawiki_job_generatecaptcha.timer` for now, I believe. @Reedy could confirm that as well. # I'm not personally aware of... [17:46:20] 06serviceops, 06Security-Team, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10643218 (10sbassett) [18:41:17] 06serviceops, 06Security-Team, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10643437 (10Reedy) Yeah, we (unfortunately) need to keep the job around, and almost certainly for quite a while now. Depending on how {T250227} all goes, we may not need to f... [19:00:24] hi folks. relaying a question from the web team here. specifically, T387881. are there any restrictions that prevent them from serving .md|.json|.txt files from /w/ under app servers? [19:00:37] I don't think I see any but I want to be sure about it before giving a response [19:01:08] or is mediawiki-config wmgProhibitedFileExtensions it? [19:02:02] the wmg is for stuff via Special:Upload [19:02:11] Depends what they mean by... serving sukhe [19:02:40] Reedy: ah. [19:03:15] well they are running an A/B test and they want to put 8000 static files (txt,md,json) under /w/ [21:13:22] 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10644093 (10Scott_French) 05Openβ†’03Resolved Alright, that should complete the remaining cleanup. Thanks, all!