[02:39:30] 06serviceops, 06MW-Interfaces-Team: Migrate mw-interfaces-team jobs to mw-cron - https://phabricator.wikimedia.org/T388541#10644978 (10HCoplin-WMF) Hey -- could we get a little bit more information about the origin and urgency of this request? Is there a deadline by which everything must be migrated? [07:46:25] 06serviceops, 10Deployments, 06Release-Engineering-Team: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 (10hashar) 03NEW [07:47:32] 06serviceops, 10Deployments, 06Release-Engineering-Team: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#10645316 (10hashar) That has happening again when doing the backport this morning. I have filed another task T389169 since... [07:47:48] 06serviceops, 10Deployments, 06Release-Engineering-Team: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645320 (10hashar) [07:48:59] 06serviceops, 10Deployments, 06Release-Engineering-Team: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645321 (10hashar) p:05Triage→03Unbreak! I am marking this an {nav Unbreak Now!} since the test failed repeatedly and I don't... [07:53:29] 06serviceops, 10Deployments, 06Release-Engineering-Team: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645344 (10jnuche) Noting this also happened last night during the train presync: ` 03:32:44 Executing check 'check_testservers_... [08:24:43] 06serviceops, 10Deployments, 06Release-Engineering-Team: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645425 (10hashar) I have looked at logstash https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 with `message:500` .... [08:29:13] 06serviceops, 10Deployments, 06Release-Engineering-Team, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645444 (10hashar) [08:40:06] 06serviceops, 10Deployments, 06Release-Engineering-Team, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645475 (10hashar) I am pretty sure tha... [08:54:42] 06serviceops, 10Deployments, 06Release-Engineering-Team, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645505 (10Ladsgroup) a:03Ladsgroup [09:07:43] 06serviceops, 10Deployments, 06Release-Engineering-Team, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645539 (10Aklapper) [09:18:18] 06serviceops, 06DBA, 10Deployments, 06Release-Engineering-Team, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645562 (10Ladsgroup) [09:48:14] 06serviceops, 06MW-Interfaces-Team: Migrate mw-interfaces-team jobs to mw-cron - https://phabricator.wikimedia.org/T388541#10645657 (10akosiaris) I 'll add that this is related to the PHP 8.1 upgrade goal. PHP 8.1 is available only on Kubernetes and thus stalling this will stall the PHP 8.1 goal. As far as dea... [09:53:41] 06serviceops, 06DBA, 10Deployments, 06Release-Engineering-Team, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645681 (10hashar) 05Open→03Resolved Th... [10:01:33] 06serviceops, 06DBA, 10Deployments, 06Release-Engineering-Team, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645701 (10Ladsgroup) Sorry for breaking it... [10:15:54] 06serviceops, 06Security-Team, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10645740 (10Clement_Goubert) >>! In T388531#10643437, @Reedy wrote: > Yeah, we (unfortunately) need to keep the job around, and almost certainly for quite a while now. > > De... [10:16:24] 06serviceops, 10MW-on-K8s: Convert captchaloop to kubernetes CronJob - https://phabricator.wikimedia.org/T380167#10645744 (10Clement_Goubert) [10:16:25] 06serviceops, 06Security-Team, 07SecTeam-Processed: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531#10645743 (10Clement_Goubert) [10:29:08] 06serviceops, 06DBA, 10Deployments, 06Release-Engineering-Team, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645801 (10Ammarpad) [10:36:52] 06serviceops, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182 (10Clement_Goubert) 03NEW p:05Triage→03High [10:38:06] 06serviceops, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10645853 (10Clement_Goubert) [10:41:23] 06serviceops, 06MediaWiki-Platform-Team, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10645860 (10Clement_Goubert) [10:43:45] 06serviceops, 06MediaWiki-Platform-Team, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10645866 (10jnuche) p:05High→03Unbreak! [11:22:12] hnowlan: Hiii, I'm bumping thumbnail steps to 20% (since it's bumping the default thumbsize, a lot is being re-generated). The graphs look okay for thumbor: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&from=now-30d&to=now but do you want us to bump replicas or anything? [11:25:20] Amir1: prrrrroably okay - when was the last bump, just out of curiosity? [11:25:53] the percentage? I've been doing that 5% every day. The default thumbsize? Never [11:26:41] another few replicas couldn't hurt [11:28:45] could we hold until after the services switchover later? [11:31:04] my main problem is that it'll take more than a month already (5% in any day you can deploy = four days in every week) so I rather not keep it blocked for long [11:31:10] but one or two days is definitely fine [11:33:25] it's probably fine if we just bump replicas https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1128840 [11:34:25] Thanks! [12:51:29] 06serviceops, 10MW-on-K8s: Convert captchaloop to kubernetes CronJob - https://phabricator.wikimedia.org/T380167#10646376 (10Reedy) If as mentioned in the other tasks, the unlimit isn’t in place, the captchaloop script should be redundant… [12:52:23] 06serviceops, 06MediaWiki-Platform-Team, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10646379 (10Jdforrester-WMF) Live thoughts. * Error coming from this job: ` profile::mediawiki::periodic_job { 'startupregist... [12:57:19] 06serviceops, 06MediaWiki-Platform-Team, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10646387 (10Tgr) Is this happening now or are these errors in the past? {T388646} caused a bunch of temporary localization cache i... [13:51:48] 06serviceops, 06MediaWiki-Platform-Team, 07Wikimedia-production-error: startupregistrystats-testwiki fails to run on php-1.44.0-wmf.21 - https://phabricator.wikimedia.org/T389182#10646674 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Seems fixed now, but the errors I posted were from th... [15:05:10] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155#10647033 (10ops-monitoring-bot) hnowlan@cumin2002 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T385155 s... [15:35:00] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155#10647196 (10ops-monitoring-bot) hnowlan@cumin2002 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T385155 c... [15:46:35] 06serviceops, 07Datacenter-Switchover: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10647257 (10Scott_French) [15:53:54] 06serviceops: Align mw-on-k8s alerts with capacity pools - https://phabricator.wikimedia.org/T389224 (10Scott_French) 03NEW [15:54:05] 06serviceops: Align mw-on-k8s alerts with capacity pools - https://phabricator.wikimedia.org/T389224#10647329 (10Scott_French) [15:57:07] 06serviceops: Align mw-on-k8s alerts with capacity pools - https://phabricator.wikimedia.org/T389224#10647352 (10Clement_Goubert) Silence `7112e3a2-4430-401a-b5d5-f43b42a2f5ed` on `alertname="PHPFPMTooBusy", release="canary"` created for 6 days (next Monday) [16:49:53] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian): Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10647682 (10elukey) 05Open→03Resolved a:03elukey Today I moved the kartotherian svc fully to wikikube, changing the LVS config... [17:22:33] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Ensure the correct helm version is used for each cluster - https://phabricator.wikimedia.org/T388390#10647879 (10kamila) 05Open→03Stalled [17:39:45] 06serviceops, 06Discovery-Search, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Search Update Pipeline requests to Action API are logged as coming from 127.0.0.1 - https://phabricator.wikimedia.org/T388855#10647971 (10Ottomata) [17:41:06] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10647983 (10Scott_French) [18:21:58] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243 (10mszabo) 03NEW [18:22:14] ^^^ this is kinda important as it is causing performance.wikimedia.org to show junk data [18:23:38] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648347 (10mszabo) Example PHP 8.1 flamegraph with broken excimer: https://performance.wikimedia.org/arclamp/svgs/daily/2025-03-17.excimer-k8s-php8.all.svgz?s=ChangeTags Com... [18:36:34] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10648384 (10Novem_Linguae) [19:13:01] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10648478 (10Jdforrester-WMF) [20:19:52] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648692 (10Krinkle) [20:25:30] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648713 (10Krinkle) > This is causing performance.wikimedia.org flamegraphs to show bogus data. Specifically, it means we have no telemetry on the first few whole seconds of... [20:26:06] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648717 (10Krinkle) [20:26:14] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10648719 (10Krinkle) [20:26:22] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648720 (10Krinkle) [20:26:36] 06serviceops, 06Data-Engineering, 06Data-Engineering-Radar, 10Dumps-Generation, 06MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432#10648721 (10Krinkle) [20:48:44] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648807 (10Scott_French) Thanks for flagging. There are a couple of moving parts to coordinate, but I'll aim to get a build with 1.2.3 out this week. [20:49:13] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10648808 (10Scott_French) a:03Scott_French [21:38:18] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-eqiad, 06SRE: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10649090 (10VRiley-WMF) kafka-main1001, kafka-main1002, kafka-main1003, kafka-main1004 have been... [21:39:18] 06serviceops, 06DC-Ops, 10decommission-hardware, 10ops-eqiad, 06SRE: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10649096 (10VRiley-WMF) [22:58:30] 06serviceops, 10Excimer, 06MediaWiki-Platform-Team: Update Excimer to 1.2.3 in production - https://phabricator.wikimedia.org/T389243#10649436 (10Scott_French) This turned out to be a bit more involved than expected, in part because the Debian PHP team has not uploaded 1.2.3 to unstable yet (e.g., associated...