[06:46:17] at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/775938/2 when I click "rebase" there is no option to rebase on parent change, it says that it is up to date, but then in the relation chain on the side it knows that the parent is not current, am I missing something here? [06:51:15] nevermind, I guess there was just some delay, now it shows up [07:07:45] DannyS712: yeah there is a cache. Once a change has merged to a branch, the branch has moved forward and all open changes for the repository have to be reindexed / mergeability checked again [07:07:49] which takes a bit of time [07:08:08] I found that if I force-refresh the page that fixes it too [07:08:26] oh [07:08:33] but my issue was not for a change being merged. but rather adding a patchset to the parent change [07:08:41] maybe there is another cache involved so :\ [08:06:11] !log move kafka logging in deployment-prep to fixed uid/gid - T296982 [08:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:06:13] T296982: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 [08:29:30] !log move kafka main in deployment-prep to fixed uid/gid - T296982 [08:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:29:32] T296982: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 [08:37:55] !log move kafka jumbo in deployment-prep to fixed uid/gid - T296982 [08:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [08:37:57] T296982: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 [10:18:05] Project beta-build-scap-deb build #277: 04FAILURE in 1 hr 54 min: https://integration.wikimedia.org/ci/job/beta-build-scap-deb/277/ [10:41:28] Project beta-build-scap-deb build #278: 04STILL FAILING in 23 min: https://integration.wikimedia.org/ci/job/beta-build-scap-deb/278/ [14:25:47] maintenance-disconnect-full-disks build 388551 integration-agent-docker-1037 (/: 30%, /srv: 97%, /var/lib/docker: 38%): OFFLINE due to disk space [14:32:19] maintenance-disconnect-full-disks build 388552 integration-agent-docker-1037 (/: 30%, /srv: 73%, /var/lib/docker: 37%): RECOVERY disk space OK [14:49:56] !log Zuul: Enforce Postgres and SQLite support via in-mediawiki-tarball [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:55:47] maintenance-disconnect-full-disks build 388557 integration-agent-docker-1037 (/: 30%, /srv: 97%, /var/lib/docker: 37%): OFFLINE due to disk space [15:01:02] maintenance-disconnect-full-disks build 388558 integration-agent-docker-1037 (/: 30%, /srv: 51%, /var/lib/docker: 36%): RECOVERY disk space OK [15:01:43] (Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [15:06:42] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [15:25:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:42:42] ouch ^ [15:43:04] 15k jobs queue nice [15:43:15] beta jobs stuck again too. I'm re-reading the old tickets about it to see where we might try some hacking [15:43:31] I see you mentioning locks, etc but no references to code. [15:45:03] https://phabricator.wikimedia.org/T72597 seems to be the most detailed ticket on the problem. [15:48:16] ooh, some code mentioned in there [15:51:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:52:16] (03PS1) 10Jaime Nuche: requirements: add missing dependency [tools/scap] - 10https://gerrit.wikimedia.org/r/797336 [15:54:50] 10Project-Admins: Request: Indiana Wikimedians group - https://phabricator.wikimedia.org/T308696 (10Dominicbm) @Aklapper I don't know if I put this request in the right place, but just wanted to ping you in case. [15:58:25] (03CR) 10Ahmon Dancy: [C: 03+2] requirements: add missing dependency [tools/scap] - 10https://gerrit.wikimedia.org/r/797336 (owner: 10Jaime Nuche) [15:59:40] (03PS2) 10Ahmon Dancy: Make the patch verifier also perform the submit action. [tools/train-dev] - 10https://gerrit.wikimedia.org/r/794770 [16:01:05] 10Project-Admins: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706 (10JArguello-WMF) Hello @Aklapper! I'm a project manager working with the Data Engineering team, and I need to create subprojects for every sprint. May I have access to that feature, pl... [16:05:25] 10Release-Engineering-Team (Doing), 10Scap: Stage new scap release - https://phabricator.wikimedia.org/T307086 (10jnuche) 05Open→03Resolved [16:05:27] 10Release-Engineering-Team (Priority Backlog 📥), 10Scap, 10Infrastructure-Foundations, 10serviceops: Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [16:06:34] hashar: I'm going to enable one of the FINE* log levels for a bit. [16:06:55] dancy: on jenkins CI ? sure :) [16:07:02] yeah [16:07:14] I often create custom loggers [16:07:40] I think the logs are helped in memory and the buffer has some size limit to avoid filing the whole memory [16:07:44] ooh, there may already be a lot of detailed info in the log section in the UI. Didn't see that until just now. [16:07:49] at least I never caused Jenkins to go oom as a result [16:07:55] I was just looking at jenkins.log before. [16:08:37] I would love the buckets defined via the web ui to somehow end up in ELK stack [16:09:02] for the jenkins.log I am not quite sure what ends up there, maybe anything that is a warning or worse [16:13:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:15:12] couldn't be me making https://grafana-rw.wikimedia.org/d/25McYJunk/beta-deployment?orgId=1&from=now%2Fd&to=now the other day because I noticed the beta deploy jobs get stuck every now and then.. [16:15:43] nod. Very annoying [16:20:29] I started seeing what we can get from the Jenkins API ( https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/api/json?tree=allBuilds[id,actions[blockedTimeMillis]] ) then realised how much I dislike the Jenkins API [16:20:50] hehe [16:56:56] would (temporarily) throwing some more Jenkins agents at it help? [16:57:01] 10Beta-Cluster-Infrastructure, 10Data-Engineering: deployment-kafka-jumbo-5 in deployment-prep without role - https://phabricator.wikimedia.org/T309006 (10Ottomata) 05Open→03Resolved a:03Ottomata Indeed! I suppose this was an oversight when I recreated these as buster nodes. But how did Kafka get appli... [16:57:52] I don't think so. The problem has been cleared now (I think someone performed some manual actions to do that). [16:58:09] It's a bug. [16:59:48] oh I just meant generally at the moment, beta deploy aside - https://grafana-rw.wikimedia.org/d/000000284/continuous-integration?orgId=1&viewPanel=7&refresh=30s&from=now-30m&to=now sorta suggests we're not getting "ahead" of the curve yet? [17:20:58] 10Continuous-Integration-Infrastructure, 10MinervaNeue, 10Vector, 10Accessibility, and 4 others: Add automated accessibility tests in CI to generate accessibility benchmarks for Skins - https://phabricator.wikimedia.org/T301184 (10LGoto) a:05bwang→03nray [17:25:43] maintenance-disconnect-full-disks build 388587 integration-agent-docker-1036 (/: 30%, /srv: 100%, /var/lib/docker: 37%): OFFLINE due to disk space [17:30:36] maintenance-disconnect-full-disks build 388588 integration-agent-docker-1036 (/: 30%, /srv: 57%, /var/lib/docker: 36%): RECOVERY disk space OK [17:44:07] (03CR) 10Jeena Huneidi: [C: 03+2] "good idea" [tools/train-dev] - 10https://gerrit.wikimedia.org/r/794770 (owner: 10Ahmon Dancy) [17:44:43] (03Merged) 10jenkins-bot: Make the patch verifier also perform the submit action. [tools/train-dev] - 10https://gerrit.wikimedia.org/r/794770 (owner: 10Ahmon Dancy) [17:45:52] TheresNoTime: gotcha [17:49:16] 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 (10Jdlrobson) [18:01:54] Yippee, build fixed! [18:01:54] Project beta-build-scap-deb build #279: 09FIXED in 55 sec: https://integration.wikimedia.org/ci/job/beta-build-scap-deb/279/ [18:35:44] !log Upgrading beta cluster scap to 4.7.1-1+0~20220523183110.280~1.gbpaa0826 [18:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:36:24] Project beta-scap-sync-world build #52406: 04FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52406/ [18:37:07] Project beta-scap-sync-world build #52407: 04STILL FAILING in 2.7 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52407/ [18:37:46] !log Reverted to scap 4.7.1-1+0~20220505181519.270~1.gbpeb47ae in beta cluster [18:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:39:10] Yippee, build fixed! [18:39:10] Project beta-scap-sync-world build #52408: 09FIXED in 1 min 3 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52408/ [18:56:38] (03PS1) 10Ahmon Dancy: parse_wmf_version: Handle "master" [tools/scap] - 10https://gerrit.wikimedia.org/r/797421 [19:03:21] (03CR) 10Ahmon Dancy: [C: 03+2] parse_wmf_version: Handle "master" [tools/scap] - 10https://gerrit.wikimedia.org/r/797421 (owner: 10Ahmon Dancy) [19:06:57] (Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [19:08:09] (03Merged) 10jenkins-bot: parse_wmf_version: Handle "master" [tools/scap] - 10https://gerrit.wikimedia.org/r/797421 (owner: 10Ahmon Dancy) [19:51:58] (03PS2) 10Jeena Huneidi: Add tests: scap backport changes with dependencies [tools/train-dev] - 10https://gerrit.wikimedia.org/r/793553 (https://phabricator.wikimedia.org/T308474) [19:52:07] (03CR) 10Jeena Huneidi: Add tests: scap backport changes with dependencies (033 comments) [tools/train-dev] - 10https://gerrit.wikimedia.org/r/793553 (https://phabricator.wikimedia.org/T308474) (owner: 10Jeena Huneidi) [19:53:27] (03CR) 10Ahmon Dancy: [C: 03+2] Add tests: scap backport changes with dependencies [tools/train-dev] - 10https://gerrit.wikimedia.org/r/793553 (https://phabricator.wikimedia.org/T308474) (owner: 10Jeena Huneidi) [19:54:09] (03Merged) 10jenkins-bot: Add tests: scap backport changes with dependencies [tools/train-dev] - 10https://gerrit.wikimedia.org/r/793553 (https://phabricator.wikimedia.org/T308474) (owner: 10Jeena Huneidi) [20:01:13] 10Continuous-Integration-Config: Coverage pipeline appears stuck - https://phabricator.wikimedia.org/T309047 (10TheresNoTime) [20:09:39] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar), 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Deploy new bullseye elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797 (10bking) [20:10:22] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar), 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Deploy new bullseye elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797 (10bking) Stretch servers have been deleted, deployment-prep elas... [20:10:39] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar), 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Deploy new bullseye elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797 (10bking) 05Open→03Resolved [20:10:43] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): Migrate deployment-prep away from Debian Stretch to Buster/Bullseye - https://phabricator.wikimedia.org/T278641 (10bking) [20:24:41] 10Continuous-Integration-Infrastructure, 10Zuul: Coverage pipeline appears stuck - https://phabricator.wikimedia.org/T309047 (10hashar) The changes in the `patch-performance` or `coverage` have a low precedence and are only triggered after anything else. The spike is probably what has caused the problem and... [20:26:52] 10Continuous-Integration-Infrastructure, 10Zuul: Coverage pipeline appears stuck - https://phabricator.wikimedia.org/T309047 (10TheresNoTime) >>! In T309047#7951183, @hashar wrote: > The changes in the `patch-performance` or `coverage` have a low precedence and are only triggered after anything else. > > The... [20:27:56] hashar: ^ does that also include the jobs in `postmerge`? *That* would make slightly less sense [20:27:59] TheresNoTime: there is a long tail of issues with our obsolete Zuul CI system unfortunately :( [20:28:10] yeah postmerge have low precedence as well [20:28:47] https://gerrit.wikimedia.org/g/integration/config/+/refs/heads/master/zuul/layout.yaml#603 [20:28:58] that is the definition for the coverage pipeline, it is set as low [20:29:17] the job servers run any job in the "high" queue [20:29:23] then in the "normal" queue [20:29:28] and finally "low" queue [20:29:43] so as long as there are jobs added to the "high" or "normal" queues, the "low" jobs would never run [20:29:52] hm *shrug*, well I won't pretend to know enough about that to make a comment :P [20:30:04] there is also an issue that a merge of the proposed change against the branch takes a minute or more for the big repos [20:30:38] yeah ;D [20:30:56] it is not that complicated, but there is surely a few layers of faulty stacks badly interacting with each others [20:31:14] I will probably dig in the log tomorrow to find out whathappened [20:31:26] *just keep adding more jenkins agents!* [20:31:33] partly [20:31:45] and probably some other daemons running the git merge operations [20:31:45] * TheresNoTime was joking.. but if it works.. :D [20:31:55] s/works/helps [20:39:54] (03PS3) 10Jforrester: Stop branching the CodeReview extension for Wikimedia production [tools/release] - 10https://gerrit.wikimedia.org/r/593353 (https://phabricator.wikimedia.org/T116948) [20:40:00] (03CR) 10Jforrester: [C: 03+2] "It's done. Finally." [tools/release] - 10https://gerrit.wikimedia.org/r/593353 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:41:47] (03Merged) 10jenkins-bot: Stop branching the CodeReview extension for Wikimedia production [tools/release] - 10https://gerrit.wikimedia.org/r/593353 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:41:55] 10Release-Engineering-Team (Seen), 10MediaWiki-extensions-CodeReview, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Technical-Debt: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF [21:15:55] 10Release-Engineering-Team (Doing), 10MW-on-K8s, 10Release Pipeline, 10User-brennen: scap backport change_url: Update to use new zuul plugin - https://phabricator.wikimedia.org/T308474 (10jeena) 05In progress→03Resolved [21:15:57] 10Release-Engineering-Team (Doing), 10MW-on-K8s, 10Release Pipeline, 10User-brennen: Scap backport change_url command - https://phabricator.wikimedia.org/T287042 (10jeena) [21:26:42] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [21:27:59] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [21:30:21] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Phabricator [21:31:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:35:16] (03PS6) 10Hashar: Make pytest error out on warnings [tools/scap] - 10https://gerrit.wikimedia.org/r/793846 (owner: 10Ahmon Dancy) [21:35:18] (03PS1) 10Hashar: checks: wait after kill [tools/scap] - 10https://gerrit.wikimedia.org/r/797509 [21:37:54] (03CR) 10Hashar: checks: wait after kill (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/797509 (owner: 10Hashar) [21:38:13] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-dduvall: Implement linting and unit tests for Blubber on GitLab CI - https://phabricator.wikimedia.org/T307534 (10jeena) 05In progress→03Resolved [21:38:19] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-dduvall: Create Blubber repo on GitLab, archive Gerrit repo - https://phabricator.wikimedia.org/T307533 (10jeena) [21:38:32] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/797509 "checks: wait after kill" which fixes the pytest failure " [tools/scap] - 10https://gerrit.wikimedia.org/r/793846 (owner: 10Ahmon Dancy) [21:41:32] hashar: Is zuul-merger single-threaded? [21:42:13] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10Patch-For-Review, 10User-brennen: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token - https://phabricator.wikimedia.org/T308501 (10dduvall) p:05Triage→03Medium a:03dduvall [21:44:07] All evidence indicates it, so I'll go with that until told otherwise. [21:44:19] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops, 10User-brennen: GitLab major release: 15.x - https://phabricator.wikimedia.org/T309062 (10brennen) [21:44:33] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops, 10User-brennen: GitLab major release: 15.x - https://phabricator.wikimedia.org/T309062 (10brennen) p:05Triage→03Medium [21:44:38] dancy: yeah it is [21:44:48] dancy: we have two instances running, one on each contint servers [21:45:04] but it is super slow notably when a repo is huge and has ton of branches [21:45:24] + the disks are slow [21:45:24] like mediawiki/core. :-) [21:45:29] yeah exactly [21:45:39] or wikibase / minervaneneue etc [21:45:40] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10Patch-For-Review, 10User-brennen: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token - https://phabricator.wikimedia.org/T308501 (10brennen) 05Open→03In progress [21:45:44] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10Patch-For-Review, 10User-brennen: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 (10brennen) [21:46:02] + zuul runs the low precedence jobs only when it has nothing else to do [21:46:22] and the jobs in coverage / patch-performance have a mutex so only one of them run at a time [21:46:29] so they are essentially waiting for a slot [21:46:35] it will recovers eventually [21:47:12] ah I see you already responded on https://phabricator.wikimedia.org/T309047 [21:47:20] then if the mwcore-phpunit-coverage-patch and mediawiki-fresnel-patch-docker jobs are fast enough, maybe they canbe included in the test pipeline [21:47:20] yeah [21:47:36] So someone really merged like 600 commits all at once? [21:47:49] https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-patch/buildTimeTrend seems like it takes 3 minutes [21:48:36] more or less [21:48:45] the 600 jobs in gearman are zuul-merger requests [21:48:49] Rude! [21:49:06] and a series of patch with bunch of depends-on ends up creating a lot of such merge requests [21:49:17] it is an issue in our current zuul which got addressed upstream [21:49:20] but well ... [21:50:51] dancy: yeah I'd already given the issue my uninformed opinion :P [21:51:13] I think the 600 or so gearman functions waiting are related to all the patches pending in zuul [21:51:29] TheresNoTime: you have raised the bell which is great! [21:51:43] (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [21:52:15] ssh contint1001.wikimedia.org tail -F /var/log/zuul/merger-debug.log [21:52:18] ssh contint2001.wikimedia.org tail -F /var/log/zuul/merger-debug.log [21:52:27] dancy: ^ that is what I use to watch Zuul merger [21:52:36] Annoyingly merger-debug.log shows when a merge starts but not when it finishes. [21:52:36] probably should be send to ELK somehow [21:52:40] I suspect it's just the flood of libup patches [21:52:54] libup waits for the test+gate-and-submit queues to be low before sending more patches [21:53:04] but it doesn't wait for the coverage/postmerge/etc. ones [21:53:05] who runs that bot (: [21:53:06] hi legoktm :) [21:53:10] o/ [21:53:26] Hey legoktm! [21:54:18] there has been a large spike between 15:13 UTC and 15:33 UTC roughly [21:54:39] which quickly resolved by 15:49 UTC [21:54:43] ref: https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&from=1653317106013&to=1653321406667 [21:54:52] the rest (which is still alive) is the long tail [21:57:19] legoktm: I don't think libup is much at fault since as you said it throttles the changes being send [21:58:09] I think it's at fault for the big backlog, because it keeps sending in new patches without letting the deprioritized queues clear theirs (unless I'm out of date on how executors are assigned these days) [21:58:19] https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&from=1653265286954&to=1653326133500 hm, but even so, it was creeping up (> 100) prior to the spike today? [21:59:52] https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&from=now-30d&to=now&viewPanel=21 biggest spike we've had in a month.. o.o [22:00:24] legoktm: how very true [22:00:46] Looks like it takes about 57 seconds for zuul merger to process a mediawiki/core merge. [22:01:02] anyway it will recover eventually [22:07:44] 10Continuous-Integration-Infrastructure, 10Zuul: Coverage and patch-performance pipelines appear stuck - https://phabricator.wikimedia.org/T309047 (10hashar) [22:10:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:10:41] 10Continuous-Integration-Infrastructure, 10Zuul: Coverage and patch-performance pipelines appear stuck - https://phabricator.wikimedia.org/T309047 (10hashar) After a chat with @TheresNoTime @dancy and @Legoktm on IRC. There has been a large spike between 15:13 TC and 15:33 UTC roughly. Most probably due to a... [22:10:42] I posted a summary [22:11:33] ty \o/ [22:13:05] dancy: the reason for the zuul-merger slow down is due to the multiple branches [22:13:29] it has a very poor way of reseting the repo which iterate accross all the branches present locally and until recently we did not even prune stall branch. I have fixed that https://phabricator.wikimedia.org/T220606 [22:13:59] but core / wikibase (well a lot of repositories) still have a ton of branches (old release ones, the wmf/* ones) [22:14:12] so zuul-merger keeps iterating accross those [22:14:30] Nod.. We should hack it to only operate on the branch needing merge. [22:14:39] which eventually we should convert to tag or move to a refs/attic/heads/* [22:14:45] which is https://phabricator.wikimedia.org/T303828 [22:15:09] which well I have nicely forgot to follow up on after our Spring Cleaning sprint :-\ [22:21:43] (Queue (Jenkins jobs + Zuul functions) alert) resolved: Queue (Jenkins jobs + Zuul functions) alert - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [22:21:49] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-dduvall: Update Blubber documentation, codesearch, and other references for new GitLab location - https://phabricator.wikimedia.org/T307535 (10jeena) a:03jeena [22:22:02] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-dduvall: Create Blubber repo on GitLab, archive Gerrit repo - https://phabricator.wikimedia.org/T307533 (10jeena) [22:22:39] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-dduvall: Update Blubber documentation, codesearch, and other references for new GitLab location - https://phabricator.wikimedia.org/T307535 (10jeena) 05Open→03In progress [22:32:06] I will drop that here, courtesy of Mukunda https://www.textualize.io/ [22:32:27] a ui framework for the terminal :] [22:32:29] Very pretty [22:32:49] I blame Mukunda for killing my night. I must try something with it before heading to bed :] [22:33:50] You can whip up a nice interface for scap backport. [22:34:37] WITH ONE BIG [DEPLOY] BUTTON [22:59:45] (03CR) 10Ahmon Dancy: [C: 03+2] checks: wait after kill [tools/scap] - 10https://gerrit.wikimedia.org/r/797509 (owner: 10Hashar) [23:04:03] (03Merged) 10jenkins-bot: checks: wait after kill [tools/scap] - 10https://gerrit.wikimedia.org/r/797509 (owner: 10Hashar) [23:13:27] (03CR) 10Ahmon Dancy: [C: 03+2] Make pytest error out on warnings [tools/scap] - 10https://gerrit.wikimedia.org/r/793846 (owner: 10Ahmon Dancy) [23:14:18] 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 (10Jdforrester-WMF) [23:14:22] 10Release-Engineering-Team (Priority Backlog 📥), 10MW-1.39-notes (1.39.0-wmf.12; 2022-05-16), 10Patch-For-Review, 10Release, 10Train Deployments: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 (10Jdforrester-WMF) [23:15:21] 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 (10Jdforrester-WMF) ##### Risky Patch! 🚂🔥 * **Change**: https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/593353 * **Summary**: ** Branchi... [23:21:01] (03Merged) 10jenkins-bot: Make pytest error out on warnings [tools/scap] - 10https://gerrit.wikimedia.org/r/793846 (owner: 10Ahmon Dancy)