[02:34:57] (03CR) 10Jforrester: [C: 03+2] Archive operations/debs/pybal [integration/config] - 10https://gerrit.wikimedia.org/r/975386 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [02:36:15] (03Merged) 10jenkins-bot: Archive operations/debs/pybal [integration/config] - 10https://gerrit.wikimedia.org/r/975386 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [02:36:52] !log Zuul: Archive operations/debs/pybal for T347623 [02:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [02:36:56] T347623: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 [02:37:36] (03CR) 10Jforrester: "extension-broken is already an anti-pattern. I don't think we should add second 'broken' templates like this." [integration/config] - 10https://gerrit.wikimedia.org/r/975374 (owner: 10Zoranzoki21) [02:37:38] (03CR) 10Jforrester: [C: 04-1] Zuul: Introduce extension-broken-php81-or-later [integration/config] - 10https://gerrit.wikimedia.org/r/975374 (owner: 10Zoranzoki21) [02:41:54] (03CR) 10Jforrester: [C: 03+2] parameter_functions.py: Add dependency BlogPage for the SportsTeams extension [integration/config] - 10https://gerrit.wikimedia.org/r/975367 (owner: 10Zoranzoki21) [02:43:41] (03Merged) 10jenkins-bot: parameter_functions.py: Add dependency BlogPage for the SportsTeams extension [integration/config] - 10https://gerrit.wikimedia.org/r/975367 (owner: 10Zoranzoki21) [04:49:41] 10Release-Engineering-Team, 10Diffusion-Repository-Administrators, 10Projects-Cleanup: Archive Gerrit repositories "operations/software/hhvm-dev*" (20141017) - https://phabricator.wikimedia.org/T351600 (10hashar) At a quick glance, all three repositories got forked from upstream, apparently by @ori who work... [07:23:33] 10Continuous-Integration-Config, 10MediaWiki-Core-Tests, 10Quality-and-Test-Engineering-Team, 10SonarQube Bot, and 2 others: Improve speed of codehealth checks - https://phabricator.wikimedia.org/T351561 (10kostajh) I did some searching, I am not sure if it's possible to disable the taint analysis, but per... [08:52:29] (03CR) 10Hashar: [C: 03+2] zuul: [mediawiki/extensions/AntiSpoof] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975434 (owner: 10Umherirrender) [08:52:31] (03CR) 10Hashar: [C: 03+2] zuul: [mediawiki/extensions/CentralNotice] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975435 (owner: 10Umherirrender) [08:52:33] (03CR) 10Hashar: [C: 03+2] zuul: [mediawiki/extensions/GlobalBlocking] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975436 (owner: 10Umherirrender) [08:52:35] (03CR) 10Hashar: [C: 03+2] zuul: [mediawiki/extensions/WikiLove] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975437 (owner: 10Umherirrender) [08:54:22] (03Merged) 10jenkins-bot: zuul: [mediawiki/extensions/AntiSpoof] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975434 (owner: 10Umherirrender) [08:54:24] (03Merged) 10jenkins-bot: zuul: [mediawiki/extensions/CentralNotice] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975435 (owner: 10Umherirrender) [08:54:26] (03Merged) 10jenkins-bot: zuul: [mediawiki/extensions/GlobalBlocking] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975436 (owner: 10Umherirrender) [08:54:28] (03Merged) 10jenkins-bot: zuul: [mediawiki/extensions/WikiLove] Run phan with UserMerge [integration/config] - 10https://gerrit.wikimedia.org/r/975437 (owner: 10Umherirrender) [08:55:51] (03CR) 10Hashar: "Deployed" [integration/config] - 10https://gerrit.wikimedia.org/r/975367 (owner: 10Zoranzoki21) [09:00:28] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jelt... [09:29:57] (03Abandoned) 10Zoranzoki21: Zuul: Introduce extension-broken-php81-or-later [integration/config] - 10https://gerrit.wikimedia.org/r/975374 (owner: 10Zoranzoki21) [09:34:14] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jelto@cu... [09:37:08] Project beta-scap-sync-world build #130210: 04FAILURE in 1 min 58 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/130210/ [09:47:22] Yippee, build fixed! [09:47:22] Project beta-scap-sync-world build #130211: 09FIXED in 2 min 8 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/130211/ [09:48:33] (03CR) 10Hashar: [C: 04-1] "I filed that task more than 3 years ago, my understanding at the time is the error breaks rendering of diff due to a change made in MediaW" [integration/config] - 10https://gerrit.wikimedia.org/r/975375 (https://phabricator.wikimedia.org/T250967) (owner: 10Zoranzoki21) [09:49:05] 10GitLab (Project Migration), 10collaboration-services: Migrate SRE repositories to GitLab - operations/debs - https://phabricator.wikimedia.org/T341991 (10LSobanski) [10:17:20] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10Jelto) [10:20:33] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10Jelto) Reimage of a Trusted Runner worked, the Runner is available again aft... [11:24:57] (03CR) 10Zoranzoki21: Zuul: [mediawiki/extensions/WikEdDiff] Don't use selenium for tests (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/975375 (https://phabricator.wikimedia.org/T250967) (owner: 10Zoranzoki21) [11:45:19] 10Beta-Cluster-Infrastructure, 10Machine-Learning-Team, 10ORES, 10Wikimedia-production-error: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T243553 (10isarantopoulos) [11:47:20] 10Beta-Cluster-Infrastructure, 10RESTBase, 10RESTBase Sunsetting: Parsoid instance on beta not accesible from restbase CI/dev envs - https://phabricator.wikimedia.org/T350353 (10Vgutierrez) hmm there must have been some change impacting the kind of certificate used for that endpoint. Right now it's using a W... [11:47:32] (03CR) 10Zoranzoki21: Zuul: [mediawiki/extensions/WikEdDiff] Don't use selenium for tests (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/975375 (https://phabricator.wikimedia.org/T250967) (owner: 10Zoranzoki21) [11:47:58] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Infrastructure-Foundations, 10Puppet CI: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10jbond) [11:55:47] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Infrastructure-Foundations, 10Puppet CI: create systemd timer toi clean up failed pcc jobs - https://phabricator.wikimedia.org/T351634 (10jbond) [12:19:44] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Infrastructure-Foundations, 10Puppet CI: create systemd timer to clean up failed pcc jobs - https://phabricator.wikimedia.org/T351634 (10jbond) p:05Triage→03Medium [12:20:15] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10jbond) [12:20:21] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Infrastructure-Foundations, 10Puppet CI: create systemd timer to clean up failed pcc jobs - https://phabricator.wikimedia.org/T351634 (10jbond) 05Open→03Resolved Timer as now been deployed [12:49:41] (03CR) 10Hashar: [C: 03+2] "Deploying since https://gerrit.wikimedia.org/r/c/wmf-jvm-utils/+/975353 passes after I have fixed `sonar:sonar`." [integration/config] - 10https://gerrit.wikimedia.org/r/975352 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [12:50:52] (03CR) 10CI reject: [V: 04-1] Add Java 11 to wmf-jvm-utils [integration/config] - 10https://gerrit.wikimedia.org/r/975352 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [12:52:12] (03CR) 10Hashar: [C: 03+2] Add Java 11 to wmf-jvm-utils [integration/config] - 10https://gerrit.wikimedia.org/r/975352 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [12:53:27] (03Merged) 10jenkins-bot: Add Java 11 to wmf-jvm-utils [integration/config] - 10https://gerrit.wikimedia.org/r/975352 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:03:10] (03PS1) 10Hashar: Add Java 11 to wikimedia-event-utilities [integration/config] - 10https://gerrit.wikimedia.org/r/975800 (https://phabricator.wikimedia.org/T350587) [13:04:45] (03CR) 10Hashar: [C: 03+2] Add Java 11 to wikimedia-event-utilities [integration/config] - 10https://gerrit.wikimedia.org/r/975800 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:05:58] (03CR) 10CI reject: [V: 04-1] Add Java 11 to wikimedia-event-utilities [integration/config] - 10https://gerrit.wikimedia.org/r/975800 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:06:38] (03CR) 10Hashar: [C: 03+2] "INFO:jenkins_jobs.builder:Creating jenkins job wikimedia-event-utilities-maven-java11-docker" [integration/config] - 10https://gerrit.wikimedia.org/r/975800 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:07:24] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Remove Java 8 images from integration/config - https://phabricator.wikimedia.org/T350587 (10hashar) [13:08:21] (03Merged) 10jenkins-bot: Add Java 11 to wikimedia-event-utilities [integration/config] - 10https://gerrit.wikimedia.org/r/975800 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:08:25] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Remove Java 8 images from integration/config - https://phabricator.wikimedia.org/T350587 (10hashar) [13:21:01] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Remove Java 8 images from integration/config - https://phabricator.wikimedia.org/T350587 (10hashar) [13:28:32] (03PS1) 10Hashar: Archive search/cirrus-streaming-updater [integration/config] - 10https://gerrit.wikimedia.org/r/975810 [13:29:29] (03CR) 10Hashar: [C: 03+2] Archive search/cirrus-streaming-updater [integration/config] - 10https://gerrit.wikimedia.org/r/975810 (owner: 10Hashar) [13:30:44] (03Merged) 10jenkins-bot: Archive search/cirrus-streaming-updater [integration/config] - 10https://gerrit.wikimedia.org/r/975810 (owner: 10Hashar) [13:37:29] 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: Migrate all CI jobs from buster to bullseye or later and drop buster testing support - https://phabricator.wikimedia.org/T335765 (10hashar) [13:38:04] (03CR) 10Hashar: [C: 03+2] Add Java 11 jobs to analytics/gobblin-wmf [integration/config] - 10https://gerrit.wikimedia.org/r/974196 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:38:16] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Remove Java 8 images from integration/config - https://phabricator.wikimedia.org/T350587 (10hashar) 05Open→03Resolved a:03hashar I have added Java 11 to most repositories, filed some tasks to archive some repositor... [13:39:34] (03Merged) 10jenkins-bot: Add Java 11 jobs to analytics/gobblin-wmf [integration/config] - 10https://gerrit.wikimedia.org/r/974196 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:42:06] (03PS1) 10Hashar: Add back java8 to gate for analytics/gobblin-wmf [integration/config] - 10https://gerrit.wikimedia.org/r/975813 (https://phabricator.wikimedia.org/T350587) [13:42:20] (03CR) 10Hashar: [C: 03+2] Add back java8 to gate for analytics/gobblin-wmf [integration/config] - 10https://gerrit.wikimedia.org/r/975813 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [13:43:43] (03Merged) 10jenkins-bot: Add back java8 to gate for analytics/gobblin-wmf [integration/config] - 10https://gerrit.wikimedia.org/r/975813 (https://phabricator.wikimedia.org/T350587) (owner: 10Hashar) [14:01:22] 10Beta-Cluster-Infrastructure, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q2): De-provision beta-specific Prometheus - https://phabricator.wikimedia.org/T344974 (10fgiunchedi) The time has come, I'll clean up beta prometheus! [14:09:31] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds wh... [14:21:40] 10Beta-Cluster-Infrastructure, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): De-provision beta-specific Prometheus - https://phabricator.wikimedia.org/T344974 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, the beta prometheus instance is gone and so... [14:27:16] (03CR) 10Hashar: [C: 03+2] Zuul: [mediawiki/extensions/HeaderFooter] Use extension-quibble instead of extension-quibble-composer [integration/config] - 10https://gerrit.wikimedia.org/r/975372 (owner: 10Zoranzoki21) [14:28:32] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/HeaderFooter] Use extension-quibble instead of extension-quibble-composer [integration/config] - 10https://gerrit.wikimedia.org/r/975372 (owner: 10Zoranzoki21) [14:31:13] 10Gerrit, 10Wikidata, 10Wikidata Query UI, 10[DEPRECATED] wdwb-tech, and 2 others: wikidata-query-gui-build doesn’t work when latest commit is by dependabot (commit-msg hook adds Change-Id in wrong place) - https://phabricator.wikimedia.org/T295601 (10ItamarWMDE) 05Open→03Resolved Validated on https://... [14:34:00] 10Gerrit, 10Wikidata, 10Wikidata Query UI, 10[DEPRECATED] wdwb-tech, and 2 others: wikidata-query-gui-build doesn’t work when latest commit is by dependabot (commit-msg hook adds Change-Id in wrong place) - https://phabricator.wikimedia.org/T295601 (10hashar) Congratulations! :] [14:50:11] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10Jelto) There is no proper way of deleting the old reimaged runner beside del... [14:56:57] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10Jelto) [15:01:29] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services, 10Patch-For-Review: Migrate to using new GitLab CI runner authentication scheme - https://phabricator.wikimedia.org/T344951 (10Jelto) 05Open→03Resolved The old `profile::gitlab::runner::registration_... [15:02:33] 10Project-Admins, 10Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706 (10Cleo_Lemoisson) Hi, I'm doing project management for the #secteam and I'm requesting addition to the #acl_project-admins group to be able to create both ad... [15:21:52] (03PS1) 10Hashar: Update maven-javadoc-plugin to 3.3.0 [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975835 (https://phabricator.wikimedia.org/T351413) [15:31:43] (03PS2) 10Hashar: Update maven-javadoc-plugin to 3.3.0 [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975835 (https://phabricator.wikimedia.org/T351413) [15:33:00] (03PS3) 10Zoranzoki21: Zuul: [mediawiki/extensions/WikEdDiff] Don't use selenium for tests [integration/config] - 10https://gerrit.wikimedia.org/r/975375 (https://phabricator.wikimedia.org/T250967) [15:36:42] (03PS4) 10Zoranzoki21: Zuul: [mediawiki/extensions/WikEdDiff] Don't use selenium for tests [integration/config] - 10https://gerrit.wikimedia.org/r/975375 (https://phabricator.wikimedia.org/T250967) [15:54:22] (03CR) 10CI reject: [V: 04-1] Update maven-javadoc-plugin to 3.3.0 [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975835 (https://phabricator.wikimedia.org/T351413) (owner: 10Hashar) [16:09:31] 10Gerrit, 10Release-Engineering-Team (Radar), 10CAS-SSO, 10Infrastructure-Foundations, and 3 others: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10jbond) a:05jbond→03None [16:11:44] (03PS1) 10Hashar: Switch to Wikimedia parent pom [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975847 [16:12:07] (03CR) 10Hashar: [C: 04-1] Switch to Wikimedia parent pom [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975847 (owner: 10Hashar) [16:18:59] (03CR) 10CI reject: [V: 04-1] Switch to Wikimedia parent pom [integration/gearman-java] - 10https://gerrit.wikimedia.org/r/975847 (owner: 10Hashar) [16:47:37] hashar: do you have any idea what the root partition usage may be caused by on gerrit1003, it picked up yesterday: https://grafana.wikimedia.org/d/nX8li17Sk/overview-sre-collab?from=now-7d&orgId=1&to=now&viewPanel=40 [16:47:55] Or anyone else who's around [16:50:17] I am in a meeting, will check when it has completed :) [16:56:55] 10Gerrit, 10Release-Engineering-Team, 10collaboration-services: gerrit1003 root partition filing up - https://phabricator.wikimedia.org/T351658 (10hashar) [16:59:06] hashar: Did you delete a bunch of historical tags from mediawiki/vendor.git in the last month or so? I saw some backscroll here that looked like you were pruning some historical things from gerrit repos so I thought I would ask. [16:59:42] 10Gerrit, 10Release-Engineering-Team, 10collaboration-services: gerrit1003 root partition filing up - https://phabricator.wikimedia.org/T351658 (10hashar) [17:00:31] I don't know if they've filed a bug yet, but a volunteer poked me over the weekend about the dev environment for Striker being busted, and part of the problem was that tags it was using had been removed from mediawiki/vendor.git. [17:17:21] 10Gerrit, 10Release-Engineering-Team, 10collaboration-services: gerrit1003 root partition filing up - https://phabricator.wikimedia.org/T351658 (10hashar) From the cache: ` Name |Entries | AvgGet |Hit Ratio| | Mem Disk Space|... [17:27:24] bd808: hi, not on mediawiki/vendor.git that does not ring a bell [17:30:37] 10Gerrit, 10Release-Engineering-Team, 10collaboration-services: gerrit1003 root partition filing up - https://phabricator.wikimedia.org/T351658 (10hashar) After restart: ` -rw-r--r-- 1 gerrit2 gerrit2 2.9G Nov 20 17:28 gerrit_file_diff.h2.db -rw-r--r-- 1 gerrit2 gerrit2 7.6G Nov 20 17:28 git_file_diff.h2.db... [17:31:42] 10Beta-Cluster-Infrastructure, 10RESTBase, 10RESTBase Sunsetting: Parsoid instance on beta not accesible from restbase CI/dev envs - https://phabricator.wikimedia.org/T350353 (10Vgutierrez) as mentioned on IRC: ` it looks like profile::tlsproxy::envoy::ssl_provider should be set to acme for depl... [17:37:19] hashar: thanks. Maybe it has been some longer term thing that I just didn't think about when I made the Striker dev environment. It looks like all wmf/* tags prior to wmf/1.41.0-wmf.3 have been removed from mediawiki/vendor.git. Unfortunately git doesn't actually track metadata history to show when particular tags were added/removed. [17:39:16] It is not an end of the world problem. Mostly I was interested in knowing if there was a benefit ot pruning those tags that I had not thought of. [18:08:42] bd808: I filled a task about dropping old REL branches from the super projects mediawiki/extensions and mediawiki/skins [18:08:48] for mediawiki/vendor I have no clue [19:26:53] (03PS1) 10Subramanya Sastry: Don't suppress the footer anymore [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/975885 [19:28:40] (03CR) 10Subramanya Sastry: [C: 03+2] Re-enable DiscussionTools since Parsoid supports it now [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/975113 (owner: 10Subramanya Sastry) [19:28:43] (03CR) 10Subramanya Sastry: [C: 03+2] Don't suppress the footer anymore [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/975885 (owner: 10Subramanya Sastry) [19:28:57] (03CR) 10Subramanya Sastry: Don't suppress the footer anymore [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/975885 (owner: 10Subramanya Sastry) [19:29:26] (03Merged) 10jenkins-bot: Re-enable DiscussionTools since Parsoid supports it now [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/975113 (owner: 10Subramanya Sastry) [20:06:05] 10Project-Admins, 10Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706 (10Ladsgroup) >>! In T706#9345215, @Cleo_Lemoisson wrote: > Hi, I'm doing project management for the #secteam and I'm requesting addition to the #acl_project-... [21:10:21] Hello dancy, trying to build and push an image to the docker-registry within a gitlab-ci (in a merge request). And this is the first time since removing trusted runner https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/802cac99d8eb81f9ef210285340f2acfa970735b [21:10:21] It's not working anymore. [21:10:21] I've tagged back my gitlab pipeline "trusted", then I allowed the worker to pick jobs from unprotected branches, and now I'm stuck with buildkit not supporting those syntaxes: https://gitlab.wikimedia.org/repos/releng/kokkuri/-/blob/main/lib/image.py?ref_type=heads#L13 [21:10:21] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/167385 [21:10:21] Do you have any ideas what I'm doing wrong? [21:48:33] Hello, does anyone know why https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console is failing? here is the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/974621 [21:50:44] Have you tried re-running the tests to see if they pass a second time? [21:51:50] aqu: dancy is out this week, from the output I glean that we don't allow the dockerfile-upstream builder on trusted runners, but it sounds like this worked before? [21:52:38] JustHannah: +1 what Reedy said, and it looks like Kri.nkle beat us both, it should be re-running tests now, looks like an ephemeral problem [21:56:10] Hello thcipriani , it used to work. We are not running this pipeline very often. [21:56:10] I've got the same result with docker/dockerfile:latest which the default if you don't specify the syntax. [21:58:14] Thanks for take a look. Here is the result with docker/dockerfile:latest https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/167411 [22:07:38] hrm, well this error is coming from buildkit, and it's upset about that docker/dockerfile frontend. But https://gerrit.wikimedia.org/r/965157/ seems to imply this is ok. Still looking [22:09:31] Thank you! looks like it failed again :-( [22:10:19] JustHannah: link to the job? [22:11:52] aqu: indeed, that is the crux of the problem, we only allow the blubber gateway on trusted runners with buildkit: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/gitlab_runner.yaml#83 looks like that's been in place since July—did this job succeed since then? [22:12:22] https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console [22:13:58] well that's odd. [22:14:15] seems to be happening on the same runner [22:17:31] ty thcipriani . Indeed, our last build with this pipeline is in july 20230711 . We used to have an adhoc custom runner on VCS to build. I can revive it. But will I be able to push to docker-registry ? [22:20:05] JustHannah: well, I guess you're not alone: https://phabricator.wikimedia.org/T282893 now lemme see how we fixed this last time... [22:21:00] 22:14:48 rsync: [receiver] mkstemp "/cache/.phpcs.01b241d8c2f0.58a7b1068f58.cache.H0ApeS" failed: Permission denied (13) [22:21:00] 22:14:48 rsync: [receiver] mkstemp "/cache/.phpcs.02e459ed8923.8926235aba9a.cache.yOWx8Q" failed: Permission denied (13) [22:23:38] aqu: no push access is limited to the trusted runners, I (as a manager who has no idea) am not sure of the workaround here (if there is one). Would you mind filing a task and tagging "GitLab (CI & Job Runners)" + "Release-Engineering-Team"? The experts are out this week from the releng side, I'm afraid :( [22:28:21] Reedy: yeah, no idea how this directory is getting created as root. That seems to be what the job says is happening. [22:29:10] also the workspace on that instance is now gone, so presumably it can't happen...again? [22:29:33] famous last words :D [22:29:51] JustHannah: seems to be working now: https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47936/console [22:31:47] yeah, it is! Thank you so much! [22:34:08] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Priority Backlog 📥), 10ci-test-error: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893 (10thcipriani) Some notes today: Seems like the workspace directory i... [22:34:47] ^ pasted my notes from digging into that; tl;dr: workspace owned by root. The hard part is figuring out how it got into that state :\ [22:51:41] easy [22:51:50] B L A M E D O C K E R [22:53:03] that also got reported on https://phabricator.wikimedia.org/T346723 albeit from the Jenkins jobs (which use Castor/rsync for cache) [22:53:07] so somehow something runs as root [22:53:35] or that one is a different issue [22:54:20] what's interesting is the timestamp on the file: some job ran created a cache dir at 19:34 today and that's the job that created the issue [22:55:04] and I have another Docker conspiracy theory on https://phabricator.wikimedia.org/T282893 [22:55:10] which is oddly similar [22:55:44] 10Continuous-Integration-Infrastructure, 10ci-test-error: mkstemp "/cache/.phpcs.011c3bbfdf57.36abd737b3f6.cache.tcHNqQ" failed: Permission denied (13) - https://phabricator.wikimedia.org/T346723 (10hashar) That might share a similar root cause as {T282893} [22:56:28] the best I could do was to do a `ls -l` at the start of the job in the hope of having some more details [23:28:41] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Priority Backlog 📥), 10ci-test-error: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893 (10hashar) The one before that was a success https://integration.wikim... [23:30:34] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Priority Backlog 📥), 10ci-test-error: Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T282893 (10thcipriani) Timestamps are interesting here. This job happened at `... [23:32:05] thcipriani: I also posted some logs :D [23:33:59] I am pretty sure it is a container that is still being ripped off [23:34:08] but docker does it asynchronously [23:34:44] and since we have deleted the `cache` directory in the postbuild step which does a `find $WORKSPACE --delete` as root, the cache is deleted [23:35:34] but if a container is still behind, Docker get confused because the dir vanished an dI guess it magically recreate it to fullyfil the mount to the still running container [23:35:38] it is a mess :( [23:36:00] I should sleep, I woke up at 5am this morning and it is past midnight [23:37:03] maybe enabling debug logs in the docker daemon can help pint point that [23:37:27] so this is the job the previous job exploded while waiting on: https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/4112339/console [23:37:31] hieradata/cloud/eqiad1/integration/common.yaml: [23:37:33] profile::ci::docker::settings: [23:37:33] # Logging is unnecessary in CI as container output is streamed to Jenkins [23:37:33] log-driver: none [23:37:45] I note it says: 19:34:46 Creating directory holding cache [23:38:00] hehe [23:38:47] but that line should be about the creation of the directory on integration-castor05 [23:38:54] right [23:39:01] and then it gets sync'd at the start of the next job [23:39:26] the job then rsync from whatever instance ran the build to fetch the caches and store it on the integration-castor05 instance [23:40:03] with a trick which is that it is not rsync being run on the remote side but: [23:40:23] docker run --rm -i --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --entrypoint=/usr/bin/rsync ... [23:40:34] (due to `rsync --rsync-path=` ) [23:40:36] so yeah [23:40:52] if that one runs after the parent build deleted everything [23:40:56] that would recreate the cache dir [23:41:22] you can paste the timestamped build output of https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/4112339/console to the task [23:42:17] on the parent build the ls is empty at 19:34:41 [23:42:37] but that castor save workspace build starts rsync at 19:34:46 [23:42:44] which creates the dir [23:42:46] fun [23:43:07] right, so the only possible directory it could be, is the one on castor. Now how did it get there? [23:43:31] no job for that workspace ran between those two... [23:43:38] at least on that node [23:43:44] no no [23:43:47] the sequence is [23:43:53] https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/console [23:43:56] gets killed [23:44:20] the postbuild step trigger the castor-save-workspace-cache at 19:33:59 [23:44:30] something somehow errors out 19:34:37 [23:44:44] the build loose track of that sub job and execute the remaining build steps [23:44:50] the cache directory get erased [23:45:01] at 19:34:41 the ls -l shows everything is empty [23:45:07] the build complete [23:45:38] meanwhile the `castor-save-workspace-cache` build is still in the Jenkins build queue and eventually start at 19:34:45 [23:45:46] which is AFTER the parent build has completed [23:45:55] it does the rsync which create the dir as root [23:46:10] rsync nothing (or well even end up erasing the existing cache) [23:46:25] complete successfully at 19:34:48 but has left behind a cache dir [23:47:17] and that castor save job has: [23:47:20] This run spent: [23:47:20] 45 sec waiting; [23:47:20] 3.3 sec build duration; [23:47:20] 48 sec total from scheduled to completion. [23:47:41] so I think it is an issue with the postbuild script not cancelling the job while it is waiting to be scheduled [23:48:04] and the save workspace cache should probably not try to create the dir on the remote [23:48:21] anyway, I think you found the root cause! \o/ [23:50:59] oh [23:54:18] didn't realize I found it even when it was staring me in the face, but your timeline makes sense. Yeah, fix: remove the mkdir in castor-save-workspacecache and instead bail out if the directory doesn't exist? [23:55:26] iirc the mkdir is made on castor05 [23:55:35] that is to hold the cache and have a destination where to rsync to [23:55:47] I think the fix would have to be made in the --rsync-path [23:56:13] maybe make it run as --user=nobody [23:56:48] + some shortcircuit before to ensure the remote `cache` actually exist, no idea whether that is doable with rsync [23:57:10] eg if I do `rsync source:cache /mycache/xyz` [23:57:35] can rsync be made ot gracefully abort if the source does not exist? [23:58:14] but I guess it still has to run the command given to `--rsync-path` [23:58:54] that is run via the shell [23:59:54] oh