[04:23:31] Beta cluster down for anyone else? It seems apart from a few URLs with varnish hits, everhythign else is timing out [04:24:28] e.g. https://en.wikipedia.beta.wmflabs.org/wiki/Special:Blankpage [04:24:46] I note that https://www.wikimedia.beta.wmflabs.org/ does respond (HTML/CSS, images time out) [04:25:12] Request served via deployment-cache-text08 deployment-cache-text08, Varnish XID 35356738 [04:25:12] Error: 503, Backend fetch failed at Tue, 18 Mar 2025 04:24:46 GMT [04:25:21] https://www.wikimedia.beta.wmflabs.org/portal/wikimedia.org/assets/img/Wikinews-logo_sister.svg [04:53:19] 10WikimediaDebug: Increase WikimediaDebug session length - https://phabricator.wikimedia.org/T389129#10645081 (10Krinkle) I suspect the 15min timer might the issue here, although if you're operating on the assumption that this timer is working correctly, I can certainly see why that seems the problem. I noticed... [05:04:58] Project beta-code-update-eqiad build #539652: 04FAILURE in 1 min 57 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/539652/ [05:09:06] Some requests get a response after 62-65s, and others presumably longer / hit http 503. [05:09:25] Mayve fixed itself meanwhile.. [05:09:31] Spoke too soon... [05:15:04] Yippee, build fixed! [05:15:04] Project beta-code-update-eqiad build #539653: 09FIXED in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/539653/ [07:05:26] (03update) 10aklapper: Include login.wm.o and auth.wm.o in OAuth CSP rule [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/49 (https://phabricator.wikimedia.org/T376803) [07:15:33] 10Scap: scap backport should only fetch the deployed branch to avoid spammy output - https://phabricator.wikimedia.org/T389167 (10hashar) 03NEW [07:46:25] 10Deployments, 06Release-Engineering-Team, 06serviceops: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169 (10hashar) 03NEW [07:47:32] 10Deployments, 06Release-Engineering-Team, 06serviceops: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#10645316 (10hashar) That has happening again when doing the backport this morning. I have filed another task T389169 since... [07:47:46] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 05Release, 05Train Deployments: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216#10645319 (10hashar) [07:47:48] 10Deployments, 06Release-Engineering-Team, 06serviceops: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645320 (10hashar) [07:48:59] 10Deployments, 06Release-Engineering-Team, 06serviceops: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645321 (10hashar) p:05Triageβ†’03Unbreak! I am marking this an {nav Unbreak Now!} since the test failed repeatedly and I don't... [07:53:29] 10Deployments, 06Release-Engineering-Team, 06serviceops: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645344 (10jnuche) Noting this also happened last night during the train presync: ` 03:32:44 Executing check 'check_testservers_... [08:24:43] 10Deployments, 06Release-Engineering-Team, 06serviceops: Deployment fails due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645425 (10hashar) I have looked at logstash https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 with `message:500` .... [08:29:12] 10Deployments, 06Release-Engineering-Team, 06serviceops, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645444 (10hashar) [08:35:24] (03PS3) 10Arthur taylor: Update list of phpunit config files to copy to log directory [integration/quibble] - 10https://gerrit.wikimedia.org/r/1113983 (https://phabricator.wikimedia.org/T378797) [08:35:33] (03CR) 10Arthur taylor: Update list of phpunit config files to copy to log directory (031 comment) [integration/quibble] - 10https://gerrit.wikimedia.org/r/1113983 (https://phabricator.wikimedia.org/T378797) (owner: 10Arthur taylor) [08:40:06] 10Deployments, 06Release-Engineering-Team, 06serviceops, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645475 (10hashar) I am pretty sure tha... [08:43:17] (03CR) 10CI reject: [V:04-1] Update list of phpunit config files to copy to log directory [integration/quibble] - 10https://gerrit.wikimedia.org/r/1113983 (https://phabricator.wikimedia.org/T378797) (owner: 10Arthur taylor) [08:54:42] 10Deployments, 06Release-Engineering-Team, 06serviceops, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645505 (10Ladsgroup) a:03Ladsgroup [09:07:43] 10Deployments, 06Release-Engineering-Team, 06serviceops, 07Wikimedia-production-error: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645539 (10Aklapper) [09:18:18] 10Deployments, 06Release-Engineering-Team, 06DBA, 06serviceops, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645562 (10Ladsgroup) [09:42:34] Krinkle: I am still seeing slow responses and 503 errors from beta [09:52:52] 10Phabricator, 06Project-Admins: Allow the use of team projects as representation of teams (restrict their project membership) - https://phabricator.wikimedia.org/T126055#10645676 (10Aklapper) [09:53:34] 10Phabricator, 06Project-Admins: Allow the use of team projects as representation of teams (restrict their project membership) - https://phabricator.wikimedia.org/T126055#10645680 (10Aklapper) [09:53:41] 10Deployments, 06Release-Engineering-Team, 06DBA, 06serviceops, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645681 (10hashar) 05Openβ†’03Resolved Th... [09:59:39] who knows really [10:00:10] there are bunch of spam from `deployment-jobrunner05` which has `program: php7.2-fpm` [10:00:15] PHP Warning: preg_match(): Compilation failed: unrecognised compile-time option bit(s) at offset 0 in /srv/mediawiki/php-master/includes/libs/http/MultiHttpClient.php on line 722 [10:01:19] we have dropped 7.2 back in fall 2022 [10:01:33] 10Deployments, 06Release-Engineering-Team, 06DBA, 06serviceops, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645701 (10Ladsgroup) Sorry for breaking it... [10:03:43] also [10:03:52] mpm_worker | deployment-mediawiki14 | AH00288: scoreboard is full, not at MaxRequestWorkers[mpm_worker:error] [pid 3242082:tid 3242082] AH00288: scoreboard is full, not at MaxRequestWorkers [10:06:00] Apache also often spams `AH01079: failed to make connection to backend: 127.0.0.1` or `Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:8000 (127.0.0.1:8000) failed` [10:06:04] which sounds like php-fpm is borked [10:13:04] that is from php-fpm restart [10:19:30] pff [10:20:03] looks like nobody bothered to file a task [10:20:06] so I'd do it :) [10:29:08] 10Deployments, 06Release-Engineering-Team, 06DBA, 06serviceops, and 2 others: UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests - https://phabricator.wikimedia.org/T389169#10645801 (10Ammarpad) [10:38:09] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 05Release, 05Train Deployments: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216#10645854 (10Clement_Goubert) [11:28:28] thanks hashar [11:46:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "looks good to me but I can’t +2 here anyways ^^" [integration/quibble] - 10https://gerrit.wikimedia.org/r/1113983 (https://phabricator.wikimedia.org/T378797) (owner: 10Arthur taylor) [11:50:02] maintenance-disconnect-full-disks build 685137 integration-agent-docker-1044 (/: 26%, /srv: 96%, /var/lib/docker: 33%): OFFLINE due to disk space [11:55:02] maintenance-disconnect-full-disks build 685138 integration-agent-docker-1044 (/: 26%, /srv: 80%, /var/lib/docker: 32%): RECOVERY disk space OK [13:03:52] 10Phabricator (phabricator-next), 10Release-Engineering-Team (Doing 😎): Unable to preview MP4 video in Phabricator task comments and descriptions - https://phabricator.wikimedia.org/T309222#10646407 (10Aklapper) 05Openβ†’03Resolved Should work now: ` aklapper@phab1004:/srv/phab/phabricator$ sudo ./bin/co... [13:12:51] 10WikimediaDebug: Increase WikimediaDebug session length - https://phabricator.wikimedia.org/T389129#10646422 (10Tgr) I use the extension in Chrome; as far as I can tell the timeout seems to work correctly there. (The specific reproduction steps you give certainly don't work for me.) 15 minutes is just too short... [13:31:14] Krinkle: dwalden: beta should be fixed now :) The glory details are in https://phabricator.wikimedia.org/T389181 [13:31:51] there is another issue which is that when php-fpm is restarted, connections are apparently dropped [13:32:05] or at least Apache does not buffer the incmoing connections which end up being dropped [13:32:20] which might be an issue in production as well [13:32:46] but since we have moved to kubernetes, I guess we no more have to restart php-fpm [13:33:10] and instead have the traffic switched from containers running image N toward containers having image N+1 [13:38:50] o/ we're seeing some gitlab jobs stuck, e.g. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/465299 [14:31:14] (03merge) 10aklapper: Include login.wm.o and auth.wm.o in OAuth CSP rule [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/49 (https://phabricator.wikimedia.org/T376803) [15:18:42] !log run CommunityUpdates config schema migration `foreachwikiindblist growthexperiments extensions/CommunityConfiguration/maintenance/migrateConfig.php CommunityUpdates` (T387737) [15:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:18:46] T387737: Community updates module: allow to set a white background for images in dark mode - https://phabricator.wikimedia.org/T387737 [15:21:25] 10Phabricator (phabricator-next), 06Release-Engineering-Team, 06collaboration-services: Deploy Phabricator/Phorge 2025-03-18 - https://phabricator.wikimedia.org/T389220 (10brennen) 03NEW [15:21:50] (03open) 10brennen: update submodules for 2025-03-18 release [repos/phabricator/deployment] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/64 (https://phabricator.wikimedia.org/T389220) [15:22:35] 10Phabricator (Upstream), 07Upstream: Add another way to add two factor auth than application (SMS, email, etc) - https://phabricator.wikimedia.org/T187256#10647143 (10TheDJ) [15:22:54] 10Phabricator (Upstream), 07Upstream: Add another way to add two factor auth than application (SMS, email, etc) - https://phabricator.wikimedia.org/T187256#10647146 (10TheDJ) [15:47:23] (03merge) 10brennen: update submodules for 2025-03-18 release [repos/phabricator/deployment] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/64 (https://phabricator.wikimedia.org/T389220) [15:52:07] 10Phabricator (phabricator-next), 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: Deploy Phabricator/Phorge 2025-03-18 - https://phabricator.wikimedia.org/T389220#10647298 (10brennen) Deployed to https://phabricator.wmcloud.org/ in #vps-project-devtools. [16:13:55] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Jobs are not being enqueued in beta - https://phabricator.wikimedia.org/T387631#10647478 (10Ottomata) I don't know what is wrong, but some details to help you search: - job events are posted to [[ https://github.com/wikimedia/operations-mediawiki-config/blob/mast... [16:16:04] 10Phabricator (2025-03-18), 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: Deploy Phabricator/Phorge 2025-03-18 - https://phabricator.wikimedia.org/T389220#10647489 (10Aklapper) 05Openβ†’03Resolved a:03brennen [16:17:16] 10Phabricator (2025-03-18), 10Release-Engineering-Team (Doing 😎): Uninstall Phlux (Phabricator application) - https://phabricator.wikimedia.org/T389117#10647498 (10Aklapper) 05Openβ†’03Resolved [16:17:46] 10Phabricator (2025-03-18), 10Release-Engineering-Team (Doing 😎): Disallow crawling /project/reports/ - https://phabricator.wikimedia.org/T388961#10647500 (10Aklapper) 05Openβ†’03Resolved [16:17:51] 10Phabricator (2025-03-18), 10Release-Engineering-Team (Doing 😎), 10Wikimedia-Phabricator-Extensions, 07Technical-Debt: SecurityPolicyEnforcerAction: !empty($forced_policies) is always falsy - https://phabricator.wikimedia.org/T385872#10647512 (10Aklapper) 05Openβ†’03Resolved [16:19:17] 10Phabricator (2025-03-18), 10Wikimedia-Phabricator-Extensions: Literal newlines on displayed text - https://phabricator.wikimedia.org/T389024#10647531 (10Aklapper) 05Openβ†’03Resolved a:03Aklapper This isn't reachable anymore in the UI since deploying https://gitlab.wikimedia.org/repos/phabricator/ext... [16:19:40] (03update) 10aklapper: Remove "Burnup Graph" project menu item code [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/53 (https://phabricator.wikimedia.org/T388664) [16:19:46] (03update) 10aklapper: Remove "Burnup Graph" project menu item code [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/53 (https://phabricator.wikimedia.org/T388664) [16:21:41] (03merge) 10aklapper: Remove "Burnup Graph" project menu item code [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/53 (https://phabricator.wikimedia.org/T388664) [16:22:09] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Jobs are not being enqueued in beta - https://phabricator.wikimedia.org/T387631#10647547 (10bd808) >>! In T387631#10647478, @Ottomata wrote: > Some [[ https://wikimedia.slack.com/archives/C05H0JYT85V/p1742312280257499 | discussion in Slack ]] seems to indicate thi... [16:22:17] 10Phabricator (phabricator-next), 10Release-Engineering-Team (Doing 😎), 10Wikimedia-Phabricator-Extensions, 07Technical-Debt: Remove "Burnup Graph" project menu item and custom ProjectBurnupGraphProfileMenuItem code - https://phabricator.wikimedia.org/T388664#10647548 (10Aklapper) 05Stalledβ†’03Open [16:25:50] 10Phabricator (2025-03-18), 10Release-Engineering-Team (Doing 😎), 10Wikimedia-Phabricator-Extensions, 07Browser-Support-Google-Chrome: Phab login via SUL works only on second time with Chrome (due to CSP and redirect) - https://phabricator.wikimedia.org/T376803#10647587 (10Aklapper) 05Openβ†’03Resolve... [16:28:30] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Jobs are not being enqueued in beta - https://phabricator.wikimedia.org/T387631#10647607 (10bd808) > Also, the jobqueue is empty: I think that test is red herring. The `showJobs.php` maintenance script does not work the https://wikitech.wikimedia.org/wiki/MediaWi... [16:38:32] (03update) 10oblivian: Allow multiple kubernetes clusters to be used [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/681 (https://phabricator.wikimedia.org/T388761) [16:38:41] (03update) 10oblivian: Allow multiple kubernetes clusters to be used [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/681 (https://phabricator.wikimedia.org/T388761) [16:42:30] 10Continuous-Integration-Config, 10Testing Support, 07Browser-Tests, 10Test-Platform (The Next One): Remove wdio-video-reporter from all repositories - https://phabricator.wikimedia.org/T294341#10647659 (10SDunlap) a:03zeljkofilipin [16:51:06] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Jobs are not being enqueued in beta - https://phabricator.wikimedia.org/T387631#10647693 (10bd808) Enqueuing a job with `eval.php` seems to work: `lang=shell-session bd808@deployment-mwmaint03:~$ mwscript eval.php enwiki --d 1 DEPRECATION WARNING: Maintenance scri... [16:55:03] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Jobs are not being enqueued in beta - https://phabricator.wikimedia.org/T387631#10647724 (10Daimona) Looks like the job I was complaining about also gets enqueued: `lang=shell-session daimona@deployment-kafka-main-6:~$ kafkacat -b localhost:9092 -C -t eqiad.media... [16:56:50] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631#10647729 (10bd808) [17:07:35] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631#10647782 (10bd808) `lang=shell-session root@deployment-changeprop-1:~# systemctl status changeprop --no-pager -l ● changeprop.serv... [17:10:11] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631#10647801 (10bd808) And suddenly I remember working on {T388043} recently. [17:47:35] (03open) 10dancy: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) [17:47:37] (03update) 10dancy: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) [17:48:01] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: PHP on Beta cluster fails due to mismatching PCRE - https://phabricator.wikimedia.org/T387276#10648022 (10Daimona) 05Resolvedβ†’03Open These seem to be occurring still, but only on deployment-jobrunner05. The rate is ~800 errors per minute. [[https:/... [17:50:14] (03update) 10dancy: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) [18:05:10] 10Beta-Cluster-Infrastructure, 10observability: Bring beta cluster logstash to a readable state - https://phabricator.wikimedia.org/T389239 (10Daimona) 03NEW [18:09:43] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: PHP on Beta cluster fails due to mismatching PCRE - https://phabricator.wikimedia.org/T387276#10648249 (10Daimona) Also noting that these entries have `program: php7.2-fpm` which is confusing. Presumably a reference to update somewhere ([[https://gerri... [18:17:24] (03open) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:17:26] (03update) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:32:16] (03update) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:35:54] (03update) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:35:58] (03update) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:37:08] (03merge) 10dancy: spiderpig: Drop queued column from JobCard [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/690 (https://phabricator.wikimedia.org/T383835) [18:53:14] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure): CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10648413 (10Daimona) 3 out of 3 merged patches in CampaignEvents today failed due timeouts: -... [19:12:58] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure): CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10648479 (10Daimona) Just happened again in gate-and-submit for [[https://gerrit.wikimedia.org... [19:23:40] (03approved) 10thcipriani: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) (owner: 10dancy) [19:23:59] (03update) 10thcipriani: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) (owner: 10dancy) [19:26:47] (03merge) 10thcipriani: spiderpig: Admin users are automatically authorized [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/689 (https://phabricator.wikimedia.org/T383947) (owner: 10dancy) [20:13:47] (03open) 10dancy: Release 4.141.2 [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/691 [20:15:56] (03merge) 10dancy: Release 4.141.2 [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/691 [20:19:49] 10Beta-Cluster-Infrastructure, 10observability: Bring beta cluster logstash to a readable state - https://phabricator.wikimedia.org/T389239#10648691 (10bd808) As written I don't think this task is actionable. "Fix all of the code so it stops logging errors" is not within the scope of the tiny handful of folks... [20:25:03] !log Rebooting deployment-jobrunner05 because things just seem weird (T387631, T387276) [20:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:25:07] T387631: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631 [20:25:07] T387276: PHP on Beta cluster fails due to mismatching PCRE - https://phabricator.wikimedia.org/T387276 [20:25:34] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10ci-test-error (WMF-deployed Build Failure): CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10648714 (10brennen) [20:51:35] 10Beta-Cluster-Infrastructure, 10observability: Bring beta cluster logstash to a readable state - https://phabricator.wikimedia.org/T389239#10648814 (10Daimona) Yeah, that's why I was being vague and said "ideally". I don't think it needs to be a "fix everything" either, it's fine if there are some errors. It'... [20:54:44] 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team: PHP on Beta cluster fails due to mismatching PCRE - https://phabricator.wikimedia.org/T387276#10648820 (10bd808) 05Openβ†’03Resolved Rebooting deployment-jobrunner05 seems to have fixed it's damage. I don't see any smoking gun in the /var/log/apt... [20:59:45] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631#10648835 (10Daimona) 05Openβ†’03Resolved a:03bd808 The above seems to have done it, thank you! \o/ I was able to generate... [21:04:24] 10Beta-Cluster-Infrastructure: deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud configured to talk to parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T389252 (10bd808) 03NEW [21:04:59] 10Beta-Cluster-Infrastructure: deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud configured to talk to parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T389252#10648880 (10bd808) 05Openβ†’03In progress p:05Triageβ†’03High a:03bd808 [21:05:25] 10Beta-Cluster-Infrastructure, 10WMF-JobQueue: Are eqiad.mediawiki.job.CampaignEventsFindPotentialInvitees jobs being processed in beta? - https://phabricator.wikimedia.org/T387631#10648888 (10Daimona) Also, double-checking: `lang=shell-session daimona@deployment-mwlog02:~$ wc -l /srv/mw-log/JobExecutor.lo... [21:20:06] thoughts welcome re: https://phabricator.wikimedia.org/T388416 [21:20:15] https://phabricator.wikimedia.org/T388416 [21:20:33] ::sigh:: - i am good at linux clipboards - T388416 [21:20:34] T388416: CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416 [21:26:13] brennen: does a "computers suck" help? ;P [21:26:39] Reedy: at least it's emotionally relevant :P [21:44:32] So, I'm not seeing any patterns on the CI side [21:44:59] I also did not find anything interesting while quickly looking at the MW artifacts of a single failed build. [21:46:00] Hmmmmm wait, could it be that we're hitting a MW ratelimit?! https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74/39759/artifact/log/mw-ratelimit.log/*view*/ [21:46:20] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10ci-test-error (WMF-deployed Build Failure): CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10649190 (10brennen) See also: - {T380061} - {T371913} [21:46:26] Or in other words: we're definitely hitting it, but can it be what causes the slowness, and why is it only happening now [21:53:23] Actually, I think I just misread the message. The code in question increments the ratelimit counter, but doesn't actually prevent the action. [22:04:32] 10GitLab (Pipeline Services Migration🐀), 06collaboration-services, 10Wikidata, 10Wikidata Query UI, and 3 others: move query.wikidata.org to kubernetes - https://phabricator.wikimedia.org/T350793#10649326 (10EBernhardson) [22:31:16] 10Beta-Cluster-Infrastructure, 10Wikifunctions: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274 (10Daimona) 03NEW [22:34:13] 10Beta-Cluster-Infrastructure, 10observability: Bring beta cluster logstash to a readable state - https://phabricator.wikimedia.org/T389239#10649375 (10Daimona) >>! In T389239#10648814, @Daimona wrote: > I'm also curious to see what happens once the current issues with deployment-jobrunner05 are resolved, as... [22:37:18] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10ci-test-error (WMF-deployed Build Failure): CI jobs failing with various timeouts (March 2025) - https://phabricator.wikimedia.org/T388416#10649382 (10Daimona) So, the "Failed to wait for mediawiki.base"... [23:21:38] >22:38:14 [0-0] PASSED in chrome - /tests/selenium/specs/content_editable.js [23:21:59] >23:21:42 Build timed out (after 60 minutes). Marking the build as failed. [23:21:59] >23:21:42 Build was aborted [23:22:00] gah [23:22:10] silly browser tests