[06:51:44] 10Continuous-Integration-Infrastructure, 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, and 2 others: Fetches from Gerrit aborted due to: GnuTLS recv error (-54): Error in the pull function - https://phabricator.wikimedia.org/T420865#12027747 (10ABran-WMF) 05In progress→03Open a:03ABr... [08:59:59] hi, dcausse and me run into a problem during yesterday mw-config change where the new opensearch backend did different response in comparison to old one. i was looking for an option to debug it, and it seems that https://wikitech.wikimedia.org/wiki/Mw-experimental is the easiest one (even tho it feels like an overkill) [09:00:25] what is the policy for using mw-experimental? how to make sure noone is using mw-experimental already? [09:09:20] (03PS1) 10DCausse: CirrusSearch: add a dependency to WikibaseLexemeCirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) [09:28:57] (03PS2) 10DCausse: CirrusSearch: add a dependency to WikibaseLexemeCirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) [09:50:18] (03CR) 10Hashar: "FAILURE https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-selenium/61653//console" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [09:54:36] (03PS3) 10DCausse: CirrusSearch: add a dependency to WikibaseLexemeCirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) [10:02:42] atsukoito: I don't know about mw-experimental but you could reach #wikimedia-serviceops (and maybe Effie, I think she is the one that worked on it) [10:03:12] some in our team certainly know about it but are on US west cost and will show up in 5/6 hours [10:05:11] atsukoito: very few folks use it, and usually for very specific work that is not accommodated by mw-debug, so just go ahead [10:40:20] 10Phabricator, 06Release-Engineering-Team (Doing 😎): Requesting permissions to bulk move tasks from one project to another - https://phabricator.wikimedia.org/T429378#12028574 (10DMburugu) Thanks for returning the permissions. I understand how this would be caught in a routine inactivity sweep. I've also e... [10:50:51] 10Phabricator, 10Wikimedia-Phabricator-Extensions: Add an author column to the "Related Changes in Gerrit" table - https://phabricator.wikimedia.org/T429105#12028615 (10Aklapper) Hi, that code is located in https://gitlab.wikimedia.org/repos/phabricator/extensions/-/blob/wmf/stable/src/customfields/GerritPatch... [11:08:01] 10Phabricator, 06collaboration-services: Add cache policy to static resources in phab.wmfusercontent.org - https://phabricator.wikimedia.org/T429019#12028712 (10Aklapper) FYI, CDN/caching related stuff in the Phab codebase itself is around `getCacheHeaders()` in https://we.phorge.it/source/phorge/browse/master... [11:17:41] (03CR) 10Lucas Werkmeister (WMDE): "> FAILURE https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-selenium/61653//console" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [11:40:13] (03CR) 10Hashar: "Thanks Lucas! We went to add it and retriggered the job manually which failed again ;)" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [11:46:46] 10Continuous-Integration-Config, 10Wikidata, 10Wikidata Lexicographical data, 10wmde-wikidata-tech, 13Patch-For-Review: Move Wikibase Lexeme's New Lexeme Special Page component's repository to Wikimedia Gerrit - https://phabricator.wikimedia.org/T424098#12028818 (10Aklapper) [12:02:07] hashar, effie: thanks, we managed to reproduce it in shell so far! [12:07:20] 10Continuous-Integration-Config, 10Diffusion, 10Phabricator: integration-agent-docker machines excessively pull some Wikibase related Git repos in Diffusion - https://phabricator.wikimedia.org/T349921#12028938 (10Aklapper) FYI numbers are slowly decreasing: ` mysql:phstats@m3-slave.eqiad.wmnet [phabricator_m... [13:17:48] 10GitLab (Account Approval), 06Release-Engineering-Team: Requesting GitLab account activation for pushpaktiwari - https://phabricator.wikimedia.org/T429477 (10Pushpaktiwari) 03NEW [13:23:14] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team (Doing 😎): Add PyPy 3.11 to Wikimedia CI - https://phabricator.wikimedia.org/T423607#12029182 (10Xqt) [13:23:19] 10Beta-Cluster-Infrastructure, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 13Patch-For-Review: Write lightweight OCI-image-based Puppet plans for beta cluster - https://phabricator.wikimedia.org/T425585#12029180 (10bking) [13:30:28] 10GitLab (Account Approval), 06Release-Engineering-Team: Requesting GitLab account activation for Pushpaktiwari - https://phabricator.wikimedia.org/T429477#12029229 (10Pushpaktiwari) [13:38:40] 10GitLab (Account Approval), 06Release-Engineering-Team (Doing 😎): Requesting GitLab account activation for Pushpaktiwari - https://phabricator.wikimedia.org/T429477#12029314 (10Aklapper) 05Open→03Resolved a:03Aklapper I've approved your GitLab account. Happy hacking! [13:39:05] 10Continuous-Integration-Infrastructure, 10Gerrit, 06Release-Engineering-Team, 06collaboration-services, and 2 others: Fetches from Gerrit aborted due to: GnuTLS recv error (-54): Error in the pull function - https://phabricator.wikimedia.org/T420865#12029320 (10Lucas_Werkmeister_WMDE) >>! In T420865#12022... [13:49:25] (03PS1) 10Hslater: Zuul: [mediawiki/extensions/ContentDroplets] Re-enable master jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303435 [14:04:50] (03CR) 10Hashar: [C:03+2] Zuul: [mediawiki/extensions/ContentDroplets] Re-enable master jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303435 (owner: 10Hslater) [14:06:32] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/ContentDroplets] Re-enable master jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303435 (owner: 10Hslater) [14:07:10] (03CR) 10Hashar: [C:03+2] "Deployed!" [integration/config] - 10https://gerrit.wikimedia.org/r/1303435 (owner: 10Hslater) [14:44:08] 10GitLab (CI & Job Runners), 06collaboration-services: Add additional exporter for GitLab Runner metrics - https://phabricator.wikimedia.org/T347038#12029768 (10fgiunchedi) I would welcome the ability to monitor (recurring) pipelines for last timestamp of success/failure (e.g. I want to know if this has been f... [14:50:06] 06Release-Engineering-Team (Doing 😎), 10Catalyst (Luka Ijo Pimeja Jan), 07Essential-Work: Decouple dev & prod profiles in the OpenTofu stack - https://phabricator.wikimedia.org/T428443#12029788 (10jnuche) 05Open→03Resolved Done. No deployment was necessary after the changes. Prod & dev can now be ass... [14:56:33] (03PS4) 10Hashar: Zuul: [CirrusSearch] add WikibaseLexemeCirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [14:56:48] 06Release-Engineering-Team (Doing 😎), 06Abstract Wikipedia team, 10Catalyst (Luka Ijo Pimeja Jan), 07Essential-Work: Some wikilambda-catalyst-end-to-end jobs are failing with `ERROR: Failed to create wikifunctions environment: 400 Client Error: Bad Reques... - https://phabricator.wikimedia.org/T424815#12029805 [15:02:13] (03CR) 10Hashar: "I went to reproduce the build locally with Quibble. From `WikibaseLexeme` I installed the npm dependencies and ran:" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [15:09:24] (03CR) 10Hashar: "Browser tests SUCCESS, 🎉 https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-selenium/61737//console" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [15:25:40] 10Beta-Cluster-Infrastructure, 06Data-Platform-SRE (2026-06-05 - 2026-06-26), 13Patch-For-Review: Write lightweight OCI-image-based Puppet plans for beta cluster - https://phabricator.wikimedia.org/T425585#12029958 (10bking) Per conversation with @dcausse , on-wiki search is currently down on the beta cluste... [15:29:06] 10Continuous-Integration-Config, 10MinervaNeue: A MinervaNeue resourceloader module has hard dependency on Echo - https://phabricator.wikimedia.org/T429501 (10hashar) 03NEW [15:42:42] (03CR) 10Lucas Werkmeister (WMDE): "Thanks, great work!" [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [16:13:22] (03CR) 10Hashar: "Filed as T429501 and I have sent https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1303474 to drop the dependency." [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [16:13:50] (03open) 10jelto: gitlab-runner: bump image version to alpine-v19.0.1 [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/615 (https://phabricator.wikimedia.org/T426164) [16:14:16] 10GitLab (Infrastructure), 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: Upgrade GitLab to major version 19 - https://phabricator.wikimedia.org/T426164#12030356 (10Jelto) [16:17:34] 10GitLab (Infrastructure), 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: Upgrade GitLab to major version 19 - https://phabricator.wikimedia.org/T426164#12030376 (10Jelto) All hosts were upgraded accidentally during an unrelated package upgrade. So all hosts are on 19.0, I also b... [16:18:30] 10Continuous-Integration-Infrastructure, 07Jenkins, 10Castor: Waiting for the completion of castor-save-workspace-cache sometimes takes almost 4 minutes for core tests - https://phabricator.wikimedia.org/T418974#12030386 (10hashar) →14Duplicate dup:03T427471 [16:18:31] 10Continuous-Integration-Infrastructure, 10Castor, 06Test Platform (Tallinn 27): Speedup mwext-codehealth-master-non-voting Castor job - https://phabricator.wikimedia.org/T427471#12030383 (10hashar) [16:19:12] (03update) 10jelto: gitlab-runner: bump image version to alpine-v19.0.1 [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/615 (https://phabricator.wikimedia.org/T416707 https://phabricator.wikimedia.org/T426164) [16:24:04] 10Continuous-Integration-Infrastructure, 07Jenkins, 10Castor: Waiting for the completion of castor-save-workspace-cache sometimes takes almost 4 minutes for core tests - https://phabricator.wikimedia.org/T418974#12030415 (10hashar) This was later mentioned when investigating the slowness of mwext-codehea... [16:36:39] (03PS5) 10Hashar: Zuul: [CirrusSearch] add WikibaseLexemeCirrusSearch [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [16:38:58] (03CR) 10Hashar: "Amended to add `Echo` for now. Eventually it can be removed later if T429501 get resolved." [integration/config] - 10https://gerrit.wikimedia.org/r/1303351 (https://phabricator.wikimedia.org/T428975) (owner: 10DCausse) [16:47:13] 10GitLab (Project and group requests), 06Release-Engineering-Team, 07Essential-Work: GitLab Private Repository Request for: CSP Research - https://phabricator.wikimedia.org/T420485#12030508 (10thcipriani) 05In progress→03Stalled [16:50:37] 06Gerrit-Privilege-Requests: Request membership in extension-SphinxSearch group for NicJansma - https://phabricator.wikimedia.org/T396496#12030511 (10Aklapper) Hi, I think best way to move forward here would be bringing up this proposal on the [wikitech-l@ mailing list](https://lists.wikimedia.org/hyperkitty/lis... [16:51:08] 10GitLab (Project and group requests), 06Release-Engineering-Team, 07Essential-Work: GitLab Private Repository Request for: production access meta data - https://phabricator.wikimedia.org/T411642#12030512 (10brennen) 05In progress→03Resolved a:03brennen Created: https://gitlab.wikimedia.org/admin/p... [17:01:48] 06Gerrit-Privilege-Requests: Request membership in extension-SphinxSearch group for NicJansma - https://phabricator.wikimedia.org/T396496#12030538 (10hashar) The SphinxSearch has been created a while ago by at least @svemir [[ https://www.mediawiki.org/wiki/User:Svemir_Brkic | mw:User:Svemir_Brkic ]]. I am not... [17:02:13] (03open) 10dancy: .pipeline/blubber.yaml: Bump bookworm base image [repos/releng/gitlab-terraform-images] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/releng/gitlab-terraform-images/-/merge_requests/13 (https://phabricator.wikimedia.org/T416707) [17:03:35] (03merge) 10dancy: .pipeline/blubber.yaml: Bump bookworm base image [repos/releng/gitlab-terraform-images] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/releng/gitlab-terraform-images/-/merge_requests/13 (https://phabricator.wikimedia.org/T416707) [17:04:59] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:08:11] ^ except that is NOT supposed to recover :p [17:08:40] even with what looked like it was the fix.. [17:08:59] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:09:11] (03PS1) 10Pwangai: jjb: [maven-java] Authenticate sonar via SONAR_TOKEN [integration/config] - 10https://gerrit.wikimedia.org/r/1303491 [17:16:26] 10GitLab (CI & Job Runners), 06Release-Engineering-Team, 06collaboration-services: Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491#12030662 (10fnegri) I had to build a .deb package today (for https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils) and I think t... [17:16:49] (03CR) 10Pwangai: "Fixing this in CI avoids waiting on a wmf-jvm-parent-pom release/fix for the deprecated sonar.login" [integration/config] - 10https://gerrit.wikimedia.org/r/1303491 (owner: 10Pwangai) [17:19:34] (03open) 10dancy: .pipeline/blubber.yaml: Use gitlab-terraform-images:wmf-v1.9.0-1 image [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:23:20] (03update) 10dancy: .pipeline/blubber.yaml: Use gitlab-terraform-images:wmf-v1.9.0-1 image [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:23:27] (03update) 10dancy: .pipeline/blubber.yaml: Use gitlab-terraform-images:wmf-v1.9.0-1 image [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:35:10] (03update) 10dancy: .gitlab-ci.yml: Install black via pip in the lint job [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/617 [17:35:15] (03open) 10dancy: .gitlab-ci.yml: Install black via pip in the lint job [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/617 [17:37:33] (03update) 10dancy: Bump base images, and use pip to install black [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:37:34] (03update) 10dancy: Bump base images, and use pip to install black [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:37:54] (03close) 10dancy: .gitlab-ci.yml: Install black via pip in the lint job [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/617 [17:37:55] (03update) 10dancy: .gitlab-ci.yml: Install black via pip in the lint job [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/617 [17:38:50] (03merge) 10dancy: Bump base images, and use pip to install black [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/616 [17:42:01] (03open) 10neriah: Add author column to Gerrit patches table [repos/phabricator/extensions] (wmf/stable) - 10https://gitlab.wikimedia.org/repos/phabricator/extensions/-/merge_requests/66 (https://phabricator.wikimedia.org/T429105) [17:42:20] (03open) 10dancy: gitlab-runner: bump image version to alpine-v19.0.1 [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/618 (https://phabricator.wikimedia.org/T426164) [17:43:21] (03close) 10dancy: gitlab-runner: bump image version to alpine-v19.0.1 [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/615 (https://phabricator.wikimedia.org/T416707 https://phabricator.wikimedia.org/T426164) (owner: 10jelto) [17:43:44] (03merge) 10dancy: gitlab-runner: bump image version to alpine-v19.0.1 [repos/releng/gitlab-cloud-runner] - 10https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/618 (https://phabricator.wikimedia.org/T426164) [17:44:22] 10Phabricator, 10Wikimedia-Phabricator-Extensions, 13Patch-For-Review: Add an author column to the "Related Changes in Gerrit" table - https://phabricator.wikimedia.org/T429105#12030817 (10neriah) >>! In T429105#12028615, @Aklapper wrote: > How is the patch author so relevant in the Phab ticket when there is... [18:16:55] 06Release-Engineering-Team (Priority Backlog 📥): Allow configuration of canary and production checks based on deployment target - https://phabricator.wikimedia.org/T428971#12030988 (10Scott_French) Thanks for surfacing this as a task, @bd808. Zooming out a bit, I think there are a couple of interlocking concern... [18:36:46] (03PS1) 10Dduvall: zuul: Define hello-world-container for testing [integration/config] - 10https://gerrit.wikimedia.org/r/1303526 [18:37:37] does anyone know what I should do to get ci running? It looks stuck again [18:38:08] ^ dduvall thcipriani [18:40:24] blek. i can try restarting jenkins [18:40:49] oh jenkins. I thought it was zuul 😅 [18:41:08] well, yeah. zuul appears to have everything queued [18:41:21] jenkins has available executors but isn't doing new work [18:41:26] I see [18:42:30] hashar: thoughts? [18:42:48] ah, sorry it's late for antoine [18:43:03] sorry I am useless [18:44:49] but I have found the instructions to restart jenkins [18:44:51] me too. i'm just yolo-ing :) [18:45:07] !log restarting jenkins due to stuck zuul queues [18:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:45:14] 😆 [18:46:58] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:47:33] Does it take a while to restart? [18:48:52] cool, jenkins failed to restart [18:49:17] trying a stop/start [18:49:51] safeRestart wasn't very safe [18:49:58] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:49:59] but it appears to be back up now [18:50:48] Thanks dduvall ! [18:50:56] np! [18:51:30] !log performed stop/start of jenkins service on contint1002 following failed safeRestart [18:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:52:22] (03CR) 10Dduvall: [C:03+2] zuul: Define hello-world-container for testing [integration/config] - 10https://gerrit.wikimedia.org/r/1303526 (owner: 10Dduvall) [18:56:56] (03Merged) 10jenkins-bot: zuul: Define hello-world-container for testing [integration/config] - 10https://gerrit.wikimedia.org/r/1303526 (owner: 10Dduvall) [18:59:52] dduvall: jeena: that is Zuul being lagged out cause it has too many merges to perform. It will recover eventually [19:00:22] that can be seen at the bottom of https://integration.wikimedia.org/zuul/ , the second graph (top right) [19:00:23] Queue (Jenkins jobs + Zuul functions) [19:00:38] which links to https://grafana.wikimedia.org/d/ad656c66-d8b5-4b09-a54b-61e7df71fb17/zuul-3a-3a-gearman-prometheus?orgId=1&from=now-2d&to=now&timezone=utc [19:03:24] well or maybe not [19:06:17] it appeared to me that jenkins wasn't scheduling new work [19:06:19] contint1002$ gearadmin --status|grep merger:merge [19:06:19] merger:merge 1414 2 2 [19:06:26] 1414 merges pending, 2 workers running [19:06:32] it's moving now after the restart [19:07:44] but perhaps there are/were multiple issues [19:08:45] since I see you guys here talking about jenkins: one quick question.. should jenkins be installed by Debian package or by scap.. when we talk about the new jenkins host that we wanted to switch to [19:08:58] because we do it one way on contint and the other on releases [19:09:38] and I am still trying to debug how to properly disable it .. which changes based on that answer [19:11:30] dduvall: Zuul waits for the merge to have been completed before starting the Jenkins job [19:11:50] so when there is a huge queue of merger:merge functions, no jobs get to run until some merges get processed [19:11:57] (03PS1) 10Dduvall: zuul: Gather facts for hello-world-container [integration/config] - 10https://gerrit.wikimedia.org/r/1303532 [19:12:12] good news: the underlying issue in Zuul has been fixed in newer versions zuul [19:13:32] 06Release-Engineering-Team, 06collaboration-services: SystemdUnitFailed - jenkins on contint1002 - https://phabricator.wikimedia.org/T429530#12031209 (10Dzahn) [19:13:42] 06Release-Engineering-Team, 06collaboration-services: SystemdUnitFailed - jenkins on contint1002 - https://phabricator.wikimedia.org/T429530#12031210 (10Dzahn) 05Open→03Resolved a:03Dzahn [19:13:43] (03PS2) 10Dduvall: zuul: Gather facts for test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303532 [19:13:52] (03CR) 10Dduvall: [C:03+2] zuul: Gather facts for test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303532 (owner: 10Dduvall) [19:14:50] 10GitLab (Infrastructure), 06Release-Engineering-Team, 06collaboration-services: Upgrade GitLab to major version 19 - https://phabricator.wikimedia.org/T426164#12031215 (10Dzahn) ` Today at 10:08 AM Hmm.. Gitlab upgrade broke Gerritlab. .. anything my team should do? No. I prepared a fix for Gerritlab: ht... [19:15:04] mutante: re Jenkins is installed using a Debian package on both hosts [19:15:19] but on the release hosts the upgrade is driven by scap [19:15:23] hashar: yea, that is the status quo. but is that the goal? [19:15:26] on CI that is manual (apt install) [19:15:36] and we want to keep doing it both ways? [19:15:39] yes [19:15:54] the CI Jenkins will be decom as we switch to the newer Zuul [19:16:17] ok, so the future is no more Debian package [19:16:26] nop [19:16:32] it is still used to install the releases Jenkins [19:16:34] (03Merged) 10jenkins-bot: zuul: Gather facts for test jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1303532 (owner: 10Dduvall) [19:16:37] but that apt install is driven by scap [19:17:04] oh right. which is like worst of both worlds:) [19:17:08] ok, thanks for now! [19:17:29] just wanted to make sure which side of that "if else" to keep debugging [19:17:48] the scap script: https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/blob/master/scap/scripts/update_jenkins.sh?ref_type=heads [19:17:54] thanks [19:18:12] and the shell script that apt install is provided by Puppet: modules/jenkins/manifests/init.pp: source => 'puppet:///modules/jenkins/apt_update_jenkins.sh', [19:18:30] right, I remember now [19:18:51] (I guess it was done this way so that we have a single sudo rule for that apt_update_jenkins script [19:19:10] I think I might have even suggested that myself :) [19:19:24] to get around a problem with install, ack [19:19:50] jeena: dduvall: Zuul got backloged due to a chain of changes being sent to mediawiki/core [19:20:47] thanks hashar [19:20:58] it will catches up eventually [19:21:44] and the queue can be seen at https://grafana.wikimedia.org/d/ad656c66-d8b5-4b09-a54b-61e7df71fb17/zuul-3a-3a-gearman-prometheus?orgId=1&from=now-3h&to=now&timezone=utc&viewPanel=panel-10 [19:28:30] dduvall: jeena sorry I missed the fun, thanks for looking in to stuckness. The zuul merges thing hashar is talking about has some docs on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Very_high_queue_of_merger:merge_functions happens sometimes when there are huge relation chains like the one winding through CI right now [19:28:32] (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1271913 ) it looks like jenkins is stuck, but really zuul is working through a giant queue of work which is burying work that it could hand off to jenkins. The only options are to wait or to make all the zuul git operations fail (by making the repo read-only on the contint hosts), but that's a pretty drastic step, usually best to wait [19:28:34] [19:33:14] ah we got doc! thanks thcipriani ;) [20:00:50] 10Continuous-Integration-Infrastructure, 10Gerrit, 06Release-Engineering-Team: Backtrack usage of `recheck` in Gerrit to find CI failures - https://phabricator.wikimedia.org/T429539 (10hashar) 03NEW [20:05:20] 10Continuous-Integration-Infrastructure, 10Gerrit, 06Release-Engineering-Team: Backtrack usage of `recheck` in Gerrit to find CI failures - https://phabricator.wikimedia.org/T429539#12031421 (10hashar) [20:06:19] cscott: I have filed your idea of `recheck` metric https://phabricator.wikimedia.org/T429539 . I think I will tackle that on Friday, that sounds like a fun project [20:11:09] the backlog visible at https://integration.wikimedia.org/zuul/ seems to be entirely in low precedence pipelines ( patch-performance , codehealth, coverage) [20:11:22] and the patch for the backport window managed to merge ahead of the backlog [20:11:37] (operations/mediawiki-config and patches to wmf/* branches are in pipelines with higher precedence) [20:11:41] 🎉 [20:12:09] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Zuul, and 3 others: Make puppet-compiler execution run with higher priority, not like other 'experimental' jobs - https://phabricator.wikimedia.org/T414621#12031447 (10CDanis) Today (2026-06-17) was a re... [20:29:54] 06Release-Engineering-Team (Radar), 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review, 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12031521 (10dancy) [20:49:17] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Zuul, and 3 others: Make puppet-compiler execution run with higher priority, not like other 'experimental' jobs - https://phabricator.wikimedia.org/T414621#12031600 (10Jdforrester-WMF) Yeah, it'd be grea... [20:57:35] hashar, thcipriani: interesting, thanks for the info [20:58:14] * dduvall empties head of zuul v2 and goes back to banging it against zuul v14 [20:58:39] That's a lot of brain trauma [21:04:10] (03PS1) 10Dduvall: zuul: Change test/gerrit-ping job to hello-world-container [integration/config] - 10https://gerrit.wikimedia.org/r/1303565 [21:04:26] (03CR) 10Dduvall: [C:03+2] zuul: Change test/gerrit-ping job to hello-world-container [integration/config] - 10https://gerrit.wikimedia.org/r/1303565 (owner: 10Dduvall) [21:17:15] (03Merged) 10jenkins-bot: zuul: Change test/gerrit-ping job to hello-world-container [integration/config] - 10https://gerrit.wikimedia.org/r/1303565 (owner: 10Dduvall) [21:24:32] FIRING: [2x] InstanceDown: Project deployment-prep instance deployment-deleteme is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:29:32] RESOLVED: [2x] InstanceDown: Project deployment-prep instance deployment-deleteme is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:47:48] (03PS2) 10Pwangai: jjb: [maven-java] Authenticate sonar via SONAR_TOKEN [integration/config] - 10https://gerrit.wikimedia.org/r/1303491 (https://phabricator.wikimedia.org/T429547) [22:07:32] FIRING: InstanceDown: Project deployment-prep instance deployment-cache-text08 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:07:37] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T429552 (10wmcs-alerts) 03NEW [22:12:32] RESOLVED: InstanceDown: Project deployment-prep instance deployment-cache-text08 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:39:14] the coverage and patch-performance pipelines haven't processed a single job in the past 5 hours [22:40:14] lots of stuff is pretty backlogged [22:40:27] Jenkins restart a few hours ago [22:40:42] (03PS1) 10Dduvall: zuul: Refactor run-container variable names [integration/config] - 10https://gerrit.wikimedia.org/r/1303580 [22:40:49] (03CR) 10Dduvall: [C:03+2] zuul: Refactor run-container variable names [integration/config] - 10https://gerrit.wikimedia.org/r/1303580 (owner: 10Dduvall) [22:49:18] thcipriani: queues have been backed up for 5 hrs+. when do we consider it dire enough to "make the merger:merge fail fast by preventing read access to the git repository" ? [22:49:32] (03Merged) 10jenkins-bot: zuul: Refactor run-container variable names [integration/config] - 10https://gerrit.wikimedia.org/r/1303580 (owner: 10Dduvall) [22:55:25] * thcipriani looks [22:55:59] yeah, I think the proceedure is called for, all those jobs waiting on zuul will fail, but they'll fail fast. [22:56:56] fixing the issue with jenkins being started on that new jenkins host. confirmed just now it has zero effect on prod CI server [22:59:17] well, let's see if we can find what repo is awaiting 2,100 merge jobs: merger:merge 2099 2 2 [23:02:21] neat, I don't see the pending merges in the logs... [23:03:29] !log re-enabled puppet on contint1003 - triple checked puppet does NOT start jenkins anymore. BOTH masked AND stopped while running untouched on contint1002. after: gerrit:1303578 | (T418521) (T428791) [23:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:03:34] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [23:03:34] T428791: PuppetDisabled - contint1003 - https://phabricator.wikimedia.org/T428791 [23:03:36] I guess it would have been 5 hours ago at this point [23:06:26] (03PS1) 10Dduvall: zuul: Ensure container vars are available in localhost play [integration/config] - 10https://gerrit.wikimedia.org/r/1303584 [23:06:39] (03CR) 10Dduvall: [C:03+2] zuul: Ensure container vars are available in localhost play [integration/config] - 10https://gerrit.wikimedia.org/r/1303584 (owner: 10Dduvall) [23:06:52] 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06collaboration-services, 13Patch-For-Review: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521#12032006 (10Dzahn) puppet code has been fixed to ensure if jenkins is set as "masked" it is ALSO stopped, with... [23:08:30] ok, so I guess it did recover; i.e., that is gate-and-submit and test are running. But the lower priority pipelines are backed up. Again, waiting will eventually process those jobs (after the other pipelines have cleared out of jobs). To unstick, you can restart zuul, but that drops everything in process and we will just...never process those pipelines. [23:08:49] (03Merged) 10jenkins-bot: zuul: Ensure container vars are available in localhost play [integration/config] - 10https://gerrit.wikimedia.org/r/1303584 (owner: 10Dduvall) [23:09:37] 10Gerrit, 06collaboration-services: Investigate Gerrit root disk usage and logging - https://phabricator.wikimedia.org/T425667#12032013 (10Dzahn) I am not sure yet where this needs to be fixed: ` [sre-collaboration-services] [FIRING:1] AlertLintProblem collaboration-services (/srv/alerts/ops/team-collaborati... [23:09:43] things in test and gate+submit can be rechecked/resubmitted. Unsure if we have a thing to retrigger the other pipelines(?) [23:10:31] hmm, i guess if it's only low prio pipelines, perhaps we can continue to wait [23:11:01] it looks like everything in test and gate+submit is about to fail anyway [23:11:55] restarting would set zuul on a happier path and coverage will generate with the next thing. [23:13:47] the magic of blocking merge operations works if we catch it right away, otherwise we have to restart zuul to get it to catch up and that just drops everything :/ [23:14:11] so... restart zuul? [23:16:40] !log restarted zuul to clear up 5 hrs of stuck queues [23:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:19:43] thanks dduvall I was just considering whether we could register a phony worker to drain the jobs, net result would be the same, but they would get back a failure message from jenkins-bot. Something to consider for future...probably something we should have thought of already :/ [23:21:08] imo any resource investment should be in service of moving away from this decrepit system [23:22:24] I ... wonder if this part of this system is unavoidable. Like it still has to prepare changes. [23:22:45] I mean, the gearman part goes away. The backup from giant patch chains will remain. [23:23:56] > so when there is a huge queue of merger:merge functions, no jobs get to run until some merges get processed [23:24:11] > good news: the underlying issue in Zuul has been fixed in newer versions zuul [23:24:15] from hashar above [23:24:57] oh [23:25:09] ok ¯\_(ツ)_/¯ [23:57:33] 06Release-Engineering-Team (Radar), 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review, 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12032107 (10SomeRandomDeveloper) This seems to have caused {T429559}. I assume this was supposed to be addressed by {T42...