[00:44:05] 10Phabricator: Requesting Phabricator + LDAP account username rename to ajhalili2006 - https://phabricator.wikimedia.org/T339998 (10AndreiJirohOnDevsCentral) [00:48:26] 10Phabricator: Requesting Phabricator + LDAP account username rename to ajhalili2006 - https://phabricator.wikimedia.org/T339998 (10AndreiJirohOnDevsCentral) [01:09:09] 10GitLab (Auth & Access), 10Release-Engineering-Team, 10Patch-For-Review, 10User-brennen: Create bot to sync LDAP groups with related GitLab groups - https://phabricator.wikimedia.org/T319211 (10brennen) > Maybe `/srv/gitlab-settings` +1 [05:48:34] 10Phabricator, 10SRE: phabricator: make jnuche and dancy admins, remove dzahn - https://phabricator.wikimedia.org/T339174 (10Dzahn) Thank you, @taavi :) [05:49:16] 10Gerrit, 10LDAP-Access-Requests, 10SRE: LDAP/Gerrit: replace Daniel Z with Jelto in gerritadmins - https://phabricator.wikimedia.org/T339161 (10Dzahn) Thank you @ssingh :) [06:06:23] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: gitlab_default_can_create_group setting deprecation - https://phabricator.wikimedia.org/T330282 (10CodeReviewBot) jelto closed https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/merge_requests/19 Disable creation of top-level groups [06:07:49] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: Upgrade GitLab to major version 16 - https://phabricator.wikimedia.org/T338460 (10CodeReviewBot) jelto merged https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/merge_requests/32 move can_create_group... [06:09:59] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: gitlab_default_can_create_group setting deprecation - https://phabricator.wikimedia.org/T330282 (10Jelto) 05Open→03Resolved This was also tracked in T338460 and a duplicate MR was uploaded in https://gitlab.wikimedia.org/repos/releng/git... [06:10:22] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: Upgrade GitLab to major version 16 - https://phabricator.wikimedia.org/T338460 (10Jelto) [06:10:27] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: gitlab_default_can_create_group setting deprecation - https://phabricator.wikimedia.org/T330282 (10Jelto) [06:37:53] (03PS2) 10Hashar: Add TimedMediaHandler to docroot [integration/docroot] - 10https://gerrit.wikimedia.org/r/931320 (https://phabricator.wikimedia.org/T338458) (owner: 10TheDJ) [06:39:25] (03CR) 10Hashar: [C: 03+2] "I have amended the commit message to expand the TMH acronym to TimedMediaHandler." [integration/docroot] - 10https://gerrit.wikimedia.org/r/931320 (https://phabricator.wikimedia.org/T338458) (owner: 10TheDJ) [06:39:58] (03Merged) 10jenkins-bot: Add TimedMediaHandler to docroot [integration/docroot] - 10https://gerrit.wikimedia.org/r/931320 (https://phabricator.wikimedia.org/T338458) (owner: 10TheDJ) [07:24:22] 10Phabricator: Requesting Phabricator + LDAP account username rename to ajhalili2006 - https://phabricator.wikimedia.org/T339998 (10Aklapper) Hi, please file separate tickets for separate systems. This is tagged with #Phabricator so let's handle only Phabricator here. [07:24:32] 10Phabricator: Requesting Phabricator username rename to ajhalili2006 - https://phabricator.wikimedia.org/T339998 (10Aklapper) [07:25:31] 10Phabricator: Requesting Phabricator username rename to ajhalili2006 - https://phabricator.wikimedia.org/T339998 (10Aklapper) 05Open→03Resolved a:03Aklapper [07:47:17] 10Project-Admins: Add puppet-core, puppet-infra tags - https://phabricator.wikimedia.org/T336153 (10Aklapper) 05Open→03Resolved p:05Triage→03Medium Thanks! [08:11:15] (03PS1) 10Hashar: Set label["Verified"].copyAllScoresIfNoChange = false [test/gerrit-ping] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/931873 (https://phabricator.wikimedia.org/T336660) [08:11:48] (03CR) 10Hashar: [C: 04-1] "Applying it to test/gerrit-ping.git with:" [All-Projects] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/919831 (https://phabricator.wikimedia.org/T336660) (owner: 10Hashar) [08:12:01] (03CR) 10Hashar: [V: 03+2 C: 03+2] Set label["Verified"].copyAllScoresIfNoChange = false [test/gerrit-ping] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/931873 (https://phabricator.wikimedia.org/T336660) (owner: 10Hashar) [08:14:05] (03CR) 10Hashar: [V: 03+2 C: 03+2] "The diff:" [test/gerrit-ping] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/931873 (https://phabricator.wikimedia.org/T336660) (owner: 10Hashar) [08:17:37] 10Continuous-Integration-Config, 10Gerrit, 10Patch-For-Review, 10Upstream: mwext-phpunit-coverage-patch-docker votes, gets overridden - https://phabricator.wikimedia.org/T336660 (10hashar) I have applied [[ https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/931873 | the change ]] to the `test/gerrit-ping.... [08:39:17] 10GitLab, 10Release-Engineering-Team: GitLab loses track which link you clicked when session expires - https://phabricator.wikimedia.org/T340011 (10taavi) [08:53:09] 10GitLab (Project Migration), 10Release-Engineering-Team (They Live 🕶️🧟), 10Infrastructure-Foundations, 10serviceops-collab: New LDAP user to trigger Jenkins downstream jobs - https://phabricator.wikimedia.org/T338950 (10MoritzMuehlenhoff) >>! In T338950#8928186, @Dzahn wrote: > Hey @jBond_WMF @Muehlenhoff... [09:24:09] Project beta-code-update-eqiad build #449104: 04FAILURE in 1 min 8 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449104/ [09:35:32] Yippee, build fixed! [09:35:32] Project beta-code-update-eqiad build #449105: 09FIXED in 2 min 30 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449105/ [09:53:52] Project beta-code-update-eqiad build #449106: 04FAILURE in 44 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449106/ [10:01:33] Yippee, build fixed! [10:01:33] Project beta-code-update-eqiad build #449107: 09FIXED in 7 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449107/ [10:23:41] Project beta-code-update-eqiad build #449109: 04FAILURE in 29 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449109/ [10:23:51] Project beta-scap-sync-world build #108434: 04FAILURE in 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/108434/ [10:26:05] 11:23:40 /usr/lib/git-core/git-submodule: 567: cd: can't cd to BlueSpiceSocialBlog [10:26:05] 11:23:40 Unable to find current revision in submodule path 'BlueSpiceSocialBlog' [10:26:24] 10:24:09 /usr/lib/git-core/git-submodule: 567: cd: can't cd to LdapAuthentication [10:26:24] 10:24:09 Unable to find current revision in submodule path 'LdapAuthentication' [10:26:28] something ain't happy [10:28:17] its broked [10:28:23] BlueSpiceSocialBlog and LdapAuthentication on beta? [10:29:27] yeah [10:29:33] I didn't check all the failures above [10:32:02] Lucas_WMDE: beta clones the full extensions.git repo [10:33:01] 10GitLab (Project Migration), 10Release-Engineering-Team (They Live 🕶️🧟), 10Infrastructure-Foundations, 10serviceops-collab: New LDAP user to trigger Jenkins downstream jobs - https://phabricator.wikimedia.org/T338950 (10MoritzMuehlenhoff) There were no objections on IRC, so I went ahead and created the ou... [10:34:00] ah ok [10:35:23] Yippee, build fixed! [10:35:23] Project beta-code-update-eqiad build #449110: 09FIXED in 2 min 22 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449110/ [10:50:37] 10GitLab (Project Migration), 10Release-Engineering-Team (They Live 🕶️🧟), 10Infrastructure-Foundations, 10serviceops-collab: New LDAP user to trigger Jenkins downstream jobs - https://phabricator.wikimedia.org/T338950 (10MoritzMuehlenhoff) The new OU is now listed at https://wikitech.wikimedia.org/wiki/SR... [10:52:44] Yippee, build fixed! [10:52:44] Project beta-scap-sync-world build #108435: 09FIXED in 17 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/108435/ [10:54:09] Project beta-code-update-eqiad build #449111: 04FAILURE in 1 min 24 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449111/ [10:57:19] 10GitLab (Project Migration), 10Release-Engineering-Team (GitLab V: Event Horizon 🌄), 10Patch-For-Review, 10User-brennen: Migrate mediawiki/tools/release/ to GitLab - https://phabricator.wikimedia.org/T290260 (10jbond) @dancy the git::clone change failed on the following machines contint2001: /srv/dev-ima... [10:58:32] lol [11:01:46] Yippee, build fixed! [11:01:46] Project beta-code-update-eqiad build #449112: 09FIXED in 7 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449112/ [11:23:52] Project beta-code-update-eqiad build #449114: 04FAILURE in 17 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449114/ [11:35:26] Yippee, build fixed! [11:35:26] Project beta-code-update-eqiad build #449115: 09FIXED in 2 min 25 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449115/ [11:40:37] maintenance-disconnect-full-disks build 501981 integration-agent-docker-1037 (/: 29%, /srv: 99%, /var/lib/docker: 45%): OFFLINE due to disk space [11:41:10] 10GitLab (Integrations), 10Release-Engineering-Team, 10serviceops-collab, 10User-brennen: gitlab: enable github integrations - https://phabricator.wikimedia.org/T335565 (10jbond) 05Open→03Declined >>! In T335565#8949912, @LSobanski wrote: > @jbond is the request to enable log in with GitHub on our GitL... [11:45:34] maintenance-disconnect-full-disks build 501982 integration-agent-docker-1037 (/: 29%, /srv: 38%, /var/lib/docker: 42%): RECOVERY disk space OK [11:50:32] 10GitLab (Project Migration), 10Release-Engineering-Team (They Live 🕶️🧟), 10Infrastructure-Foundations, 10serviceops-collab: New LDAP user to trigger Jenkins downstream jobs - https://phabricator.wikimedia.org/T338950 (10jnuche) @MoritzMuehlenhoff thanks a lot for looking into this. I've created the user... [11:54:01] Project beta-code-update-eqiad build #449116: 04FAILURE in 1 min 8 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449116/ [11:57:45] Project beta-code-update-eqiad build #449117: 04STILL FAILING in 3 min 44 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449117/ [12:10:17] Project beta-code-update-eqiad build #449118: 04STILL FAILING in 7 min 16 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449118/ [12:15:26] Project beta-code-update-eqiad build #449119: 04STILL FAILING in 2 min 25 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449119/ [12:23:46] Project beta-code-update-eqiad build #449120: 04STILL FAILING in 45 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449120/ [12:35:21] Project beta-code-update-eqiad build #449121: 04STILL FAILING in 2 min 20 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449121/ [12:38:18] hmm [12:41:29] 10Beta-Cluster-Infrastructure, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) [12:43:30] 10Beta-Cluster-Infrastructure, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) p:05Triage→03Unbreak! Guess this is blocking testing things on beta (#shared-build-failure?), so setting //UBN!// [12:44:03] unbreak now means it'll get fixed quicker ;P /s [12:45:04] Project beta-code-update-eqiad build #449122: 04STILL FAILING in 2 min 3 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449122/ [12:47:47] !log deployment-prep: `[samtar@deployment-deploy03 ~]$ sudo -u jenkins-deploy scap prep auto --no-log-message --verbose` T340030 [12:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [12:47:50] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [12:50:59] 10Beta-Cluster-Infrastructure, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) Verbose run didn't really give anything useful :/ ` 12:48:47 https://gerrit.wikimedia.org/r/mediawiki/skins checked... [12:54:15] Project beta-code-update-eqiad build #449123: 04STILL FAILING in 1 min 14 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449123/ [12:54:32] Project beta-scap-sync-world build #108439: 04FAILURE in 3.5 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/108439/ [12:59:07] TheresNoTime: I think I had similar message yesterday ( cd: can't cd to ) [13:01:55] oh something is very up with `/srv/mediawiki-staging` on deployment-deploy03.... `/srv/mediawiki-staging/php-master` isn't meant to just contain all the extensions is it... they're meant to be in `/srv/mediawiki-staging/php-master/extensions` iirc? [13:02:46] and `/srv/mediawiki-staging/php` is symlinked to `/srv/mediawiki-staging/php-1.41.0-wmf.13` which doesn't exist.. [13:03:41] Project beta-code-update-eqiad build #449124: 04STILL FAILING in 40 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449124/ [13:03:51] yeah that is "normal" [13:04:04] I am writting the explanation [13:04:29] just forced a puppet run and it's doing `Notice: /Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-core]/Exec[git_set_origin_beta-mediawiki-core]/returns: executed successfully (corrective) [13:04:29] Info: Git::Clone[beta-mediawiki-core]: Scheduling refresh of Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions] [13:04:29] Notice: /Stage[main]/Beta::Autoupdater/Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions]: Triggered 'refresh' from 1 event [13:04:29] Notice: /Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-extensions]/File[/srv/mediawiki-staging/php-master/extensions]/ensure: created (corrective)` [13:05:55] TheresNoTime: And in production the wmf branch has the extensions and skins in submodules. Thus scap backport can do a submodule update from the mediawiki/core root. On beta that is not possible since mediawiki/core has no submodules. [13:05:59] that is my theory :] [13:06:03] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) ` $ scap prep auto ... scap.runcmd.FailedCommand: Command 'git submodule foreach --recursive /usr/bin/git -C /srv... [13:06:41] well puppet is trying to correct *something*.. [13:07:22] ahhh yeah jbond merged a change about `git::clone` to make it do the right thing when changing the branch/remote url. So maybe git::clone learned to `rm -r` [13:08:08] !log deployment-prep: `[samtar@deployment-deploy03 mediawiki-staging (master u=)]$ sudo puppet agent -tv` T340030, nb. taking a while to do corrective actions.. [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:08:11] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [13:08:29] I think we should overhaul those bits [13:08:55] create a deployment-prep branch in mediawiki/core which has the extensions/skins registered as submodule [13:09:13] and on the deployment host clone mediawiki/core.git deployment-prep branch + submodule [13:09:59] that might be more robust and save us from cloning every single extensions/skins that are attached as submodules of mediawiki/extensions.git @master and mediawiki/skins.git @master (which have pretty much every thing) [13:10:18] *something something* beta needs a code steward :p [13:10:36] yeah there are tasks for that [13:10:50] one is in a process pipeline which is itself abandonned [13:11:23] my stance is we should shut it down and if people really care get it properly resources/staffed [13:11:25] (though as an aside, it's encouraging to see things like test.wikipedia being run from k8s!) [13:12:06] Project beta-code-update-eqiad build #449125: 04STILL FAILING in 2 min 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449125/ [13:12:09] hashar: I agree, beta only really gets attention when it (really) breaks.. :( [13:13:11] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) ^ ` [samtar@deployment-deploy03 mediawiki-staging (master u=)]$ sudo puppet agent -tv Info: Using configur... [13:13:20] Project beta-code-update-eqiad build #449126: 04STILL FAILING in 19 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449126/ [13:16:38] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) >>! In T340030#8952654, @hashar wrote: > ` > $ scap prep auto > ... > scap.runcmd.FailedCommand: Command 'g... [13:19:26] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) I think there is a race condition with Puppet `git::clone`. The beta-code-update-eqiad triggered at 9:23:00 UTC a... [13:20:18] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) > Oh wait, it's looking for .gitmodules in /srv/mediawiki-staging/php-master, but it's in /srv/mediawiki-staging/... [13:20:34] TheresNoTime: I think I rephrased what you wrote eaerlier sorry [13:20:34] maintenance-disconnect-full-disks build 502001 integration-agent-docker-1027 (/: 29%, /srv: 98%, /var/lib/docker: 48%): OFFLINE due to disk space [13:20:45] anyway the issue is Puppet doing: [13:20:46] Jun 21 09:24:07 (Git::Clone[beta-mediawiki-core]) Scheduling refresh of Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions] [13:20:54] which nukes the extensions directory (THAT IS WRONG) [13:21:15] and the first jenkins failure happened exactly when that puppet command ran [13:21:21] so there is a race condition [13:21:39] there's now a copy of all extensions in both `/srv/mediawiki-staging/php-master` and `/srv/mediawiki-staging/php-master/extensions` — if I `rm -rf /srv/mediawiki-staging/php-master` and then trigger a puppet run again, will that correct back to at least a good known file structure? [13:22:06] hashar: true, but that rm -r is followed by `Notice: /Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-extensions]/File[/srv/mediawiki-staging/php-master/extensions]/ensure: created (corrective)` ? [13:22:20] gotta dig :/ [13:23:00] and there should be no extensions directly in php-master [13:23:24] Project beta-code-update-eqiad build #449127: 15ABORTED in 22 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449127/ [13:23:31] and there should be no extensions directly in php-master. /srv/mediawiki-staging/php-master/VisualEditor is wrong [13:24:03] yeah, I'm thinking to just delete everything in `php-master` and then get puppet to recreate it correctly? [13:25:07] Project beta-update-databases-eqiad build #68189: 15ABORTED in 5 min 6 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/68189/ [13:25:42] maintenance-disconnect-full-disks build 502002 integration-agent-docker-1027 (/: 29%, /srv: 7%, /var/lib/docker: 45%): RECOVERY disk space OK [13:26:08] possibly [13:26:19] I am wondering how extensions ended up cloned there though [13:26:22] * TheresNoTime is going to [13:27:07] the extensions directly inside php-master have times ranging between 11:54 and 11:57 UTC [13:27:35] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10jbond) > Notice: /Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-core]/Exec[git_set_origin_beta-mediawiki-core]/... [13:27:52] hashar: TheresNoTime: i added a comment to the task [13:28:19] jbond: OH [13:28:29] tl;dr scap::master is clonding operations/mediawiki-config to /srv/mediawiki-staging [13:28:55] !log deployment-prep: `[samtar@deployment-deploy03 php-master (master *% u=)]$ sudo rm -rfv /srv/mediawiki-staging/php-master/*` T340030 [13:28:56] but why would it kill /srv/mediawiki-staging/php-master ? [13:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:28:57] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [13:29:10] jbond: I'd already started the delete when I saw your message, damn. [13:29:38] there may be a bit more to it then that still looking [13:30:23] 10Beta-Cluster-Infrastructure, 10Scap, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) >>! In T340030#8952772, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-releng), href=... [13:30:39] (Queue (Jenkins jobs + Zuul functions) alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [13:30:44] at least the `rm -r php-master/extensions` has been around for a while that is to leave place to clone mediawiki/extensions.git [13:32:19] hashar: can you pause the beta jenkins jobs? [13:32:35] good point [13:32:59] not sure them starting up while puppet is rebuilding `/srv/mediawiki-staging/php-master` will be helpful.. [13:33:10] Project beta-code-update-eqiad build #449128: 15ABORTED in 9.2 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449128/ [13:34:41] do you want to ping me when done. im not sure exactly whasts going on but it looks like git::clone{"${stage_dir}/php-master} is fighting with git::clone{operations/mediawiki-config}. the dont clone to the same dir but one is a subdir of the other so there could be some strange issue with that [13:35:14] (03PS1) 10Hashar: jjb: disable beta cluster jobs [integration/config] - 10https://gerrit.wikimedia.org/r/931936 (https://phabricator.wikimedia.org/T340030) [13:36:34] they are being disabled [13:37:02] TheresNoTime: jenkins jobs are disabled [13:37:12] (03CR) 10Hashar: [C: 03+2] jjb: disable beta cluster jobs [integration/config] - 10https://gerrit.wikimedia.org/r/931936 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [13:37:40] hm, so puppet finished, `/srv/mediawiki-staging/php-master` looks like it "should", I ran puppet again (force of habit to run puppet twice to make sure everything applied) and it *again* did `Info: Git::Clone[beta-mediawiki-core]: Scheduling refresh of Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions]`, even though it had just finished cloning them all [13:37:49] fun times [13:38:12] the rm comes from modules/beta/manifests/autoupdater.pp [13:38:33] and I definitely have not written than since I always use a capital R: `rm -R` :] [13:39:07] but at least there's no duplicate clone of all the extensions in `/srv/mediawiki-staging/php-master` now.. :) [13:39:42] it has refreshonly => true and subscribe => git::clone['beta-mediawiki-core'] [13:39:46] (03Merged) 10jenkins-bot: jjb: disable beta cluster jobs [integration/config] - 10https://gerrit.wikimedia.org/r/931936 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [13:40:05] so if Puppet executes Git::Clone["beta-mediawiki-core"] that cause the rm to happen [13:40:30] hm :/ [13:40:52] jbond: I'm finished with the things I was trying.. `/srv/mediawiki-staging/php-master` is now back to "normal" [13:40:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:41:21] some changes were made this morning to git::clone which maybe cause it to send a refresh notification to its subscribed and thus trigger the rm -r php-master/extensions [13:41:22] TheresNoTime: ok checking puppet debug output now [13:41:37] ^ oh cool *and* there's like a million `VisualEditor` patches in zuul D: [13:41:52] Zuul will recover eventually :] [13:42:09] hashar: ahh yes that would explain it. when pup[pet came along and set the origin url it would have triggered the rm [13:42:24] i guess after that puppet was not good enough to fix things its self [13:42:49] puppet is looking good now thogh and i think what i was looking at was unrelated [13:42:58] in its current state, I wonder if a manual scap would work? [13:43:14] *manually running scap [13:43:17] gotta try to be sure :] [13:43:28] hashar: did you want to, or shall I? :) [13:43:35] please do :] [13:43:53] this way I can pretend I am still ignoring beta [13:44:04] jbond: thanks for the help! [13:44:09] * TheresNoTime will run `sudo -u jenkins-deploy scap prep auto --no-log-message --verbose`, look ok? [13:44:27] yeah that is the command that was mentioned on the task, that looks good [13:44:29] (and yes, thank you j/bond!) [13:44:57] !log deployment-prep: `[samtar@deployment-deploy03 php-master (master *% u=)]$ sudo -u jenkins-deploy scap prep auto --no-log-message --verbose` T340030 [13:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:44:59] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [13:45:50] I am tempted to migrate that mess to a monorepo :] [13:46:22] I would say "if it's not broke, don't fix it", but.... :-P [13:49:11] modules/mediabackup/manifests/worker.pp: git::clone { 'operations/mediawiki-config': [13:49:14] fun :] [13:49:16] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) ^ above got further than previously, but then error'd with: ` 13:46:50 Update https:/... [13:49:35] and it is not the only one, so bunch of codes relies on a copy of mediawiki-config which is not scap deployed [13:50:39] (Queue (Jenkins jobs + Zuul functions) alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DQueue+%28Jenkins+jobs+%2B+Zuul+functions%29+alert [13:50:51] fatal: no such branch: 'HEAD..' nice [13:51:06] hashar: for https://phabricator.wikimedia.org/T340030#8952830, I could just go into `skins/` and do `sudo -u jenkins-deploy git pull origin master`.. right? [13:51:46] well it is not cloned so ... [13:52:15] then Puppet should have cloned it? [13:52:49] ah no it is the jenkins job cloning it [13:53:00] so yeah pull [13:53:03] okay :) [13:53:09] or maybe even reclone it [13:53:52] skins$ git rev-parse HEAD [13:53:52] 225e32a657f6995da4b8ff884875675abc308a5f [13:53:58] \o/ [13:55:58] !log deployment-prep: Pulled `skins/`, then `sudo -u jenkins-deploy scap prep auto --verbose` T340030 [13:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [13:56:01] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [14:00:10] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) This time failed with: ` Another git process seems to be running in this repository,... [14:04:35] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) Ran again, worked! :D [14:04:40] hashar: try re-enabling the CI jobs? :D [14:06:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:06:25] I will in a few [14:06:33] are both Puppet and scap prep auto running fine now? [14:07:30] just testing sync-world (even though that was fine..) and then will test puppet [14:10:05] (03PS1) 10Hashar: Revert "jjb: disable beta cluster jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/931941 (https://phabricator.wikimedia.org/T340030) [14:17:49] looks like scap is taking its time :] [14:18:37] really is ^^' [14:19:18] it is probably busy resyncing everything from scratch [14:19:59] slowest rsync ever :D [14:20:52] Hello! We tested the new host that we will switch releases over to (releases1003), and it looks good. Thanks to jnuche for the help! Next step is to plan the actual transition. How do you folks want to handle informing people who have upload access? I have a list of people who have shell access to the host, do you want to email them? Or perhaps email tech@ instead? I'm aiming to do this on Monday next week if there are no objections. [14:21:48] eoghan: I don't think we have a process. The sole use case would be for mediawiki tarball which I guess is solely Reedy [14:22:08] I don't think anyone ever tells me [14:22:17] +1 on monday I am available :] [14:22:18] I manage to work it out fine ;) [14:22:20] Are there not people who upload things manually to the releases hosts? [14:22:24] Only me [14:22:28] Oh, cool. [14:22:43] Reedy: Hey, FYI I'm going to switch the releases host on Monday from releases1002 to releases1003. Hope that's ok :D [14:22:48] Job done. [14:22:51] :D [14:22:54] lol [14:22:57] though potentially people could try to download the tarballs, so it might be worth announcing it to wikitech-l for good measure [14:23:21] eoghan: As long as https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org gets updated too [14:23:28] also congrats eoghan on doing all the preparation steps to switch it :] [14:24:09] hashar: The content served by the hosts should be the same, everything will be synced from whatever the primary host is to the old hosts. We'll keep the old hosts around (with an MOTD banner if anyone uses ssh) for a few weeks after and watch the logs for who connects. [14:24:35] The big risk is uploading to a replica (because using an old hostname) and being overwritten by the sync from the primary [14:24:45] Reedy: I'll make sure that's on my list, thanks! [14:25:34] https://phabricator.wikimedia.org/T334435 is the task we're tracking, I'll put my checklist for the migration and decom in there shortly. [14:26:25] hashar: don't re-enable the CI jobs (: sync-world failed [14:27:27] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) ***aaaaaah!*** — the `sync-world` failed right at the end; ` 14:25:20 php-fpm-restar... [14:27:51] Reedy: we should be able to have the doc to point to `releases.discovery.wmnet` (currently points me to releases1002) [14:28:04] with a ssh config having: [14:28:08] Host *.discovery.wmnet [14:28:14] ProxyJump bast1003.wikimedia.org [14:28:59] hashar: The downside of that is that someone who sshes to releases.d.w will get a host key error if we move from one host to another. [14:28:59] TheresNoTime: oh joy :-\ [14:30:24] doing another prep, but I'm pretty sure puppet or something is conflicting with the (long running) scap processes and "refreshing" the extensions dir part the way through.. probably didn't affect things much as scap normally runs pretty quickly [14:31:01] Modify: 2023-06-21 14:28:41.023672334 +0000 [14:31:09] that is the extension.json [14:31:57] `14:28:24 Fatal error: Error Loading extension. Unable to open file /srv/mediawiki-staging/php-master/extensions/PagedTiffHandler/extension.json` [14:32:00] yuppppp D: [14:32:09] Jun 21 14:29:45 deployment-deploy03 puppet-agent[31824]: (/Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-extensions]/Exec[git_set_origin_beta-mediawiki-extensions]/returns) executed successfully (corrective) [14:32:13] fun [14:32:18] grr [14:32:25] so Puppet keeps mangling things ? :( [14:32:41] Jun 21 14:24:45 deployment-deploy03 puppet-agent[31824]: (/Stage[main]/Beta::Autoupdater/Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions]) Triggered 'refresh' from 1 event [14:32:43] ah yeah :] [14:32:45] it does [14:33:00] so it again `rm -r` the extensions at 14:24 [14:33:09] which well .. deletes them all [14:33:26] why does it need to "refresh" them like that :/ [14:33:27] mediawiki fails cause I guess the command is invoked from mediawiki-staging and the files have vanished? [14:33:39] and then puppet reclones everything [14:33:44] looks like it, yeah [14:34:00] and the rm is notified by: Jun 21 14:24:24 deployment-deploy03 puppet-agent[31824]: (/Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-core]/Exec[git_set_origin_beta-mediawiki-core]/returns) executed successfully (corrective) [14:34:08] aka Puppet clones mediawiki/core [14:34:36] which on a fresh host creates a php-master/extensions/ directory that is in mediawiki/core which would thus prevent the clone of mediawiki/extensions [14:34:43] going to try a sync one more time and see if I get lucky, but failing that (and even then..) someone needs to take a look at puppet :(( [14:34:44] hence why after cloning core we delete ./extensions [14:35:19] the thing is git::clone is apparently not reentrant? [14:35:30] * TheresNoTime is not that comfortable with puppet so wouldn't like to guess.. [14:35:48] the Exec has title `git_set_origin_beta-mediawiki-core` which is a changed merged this morning to `git remote set-url origin ...` [14:35:59] so I think it keeps running it [14:36:23] which thus cause the Git::Clone["beta-mediawiki-core"] to be considered as having changed [14:36:32] which triggers the notification to the exec rm -r extensions [14:36:34] and things explodes [14:36:53] so: [14:36:59] hashar: if/when you get a moment, would you be so kind as to comment that on the task? ^^ [14:37:02] 1) that exec should not do anything unless it has too [14:37:21] 2) we can add a guard to not delete php-master/extensions if it is already a git directory (should be easy) [14:37:40] TheresNoTime: I can't! I have disabled copy pasting on my system to avoid repeating code [14:37:44] (just kidding) [14:37:58] hahah [14:38:53] is there a way to pause puppet on a machine? [14:38:58] (disable the service?) [14:39:03] s/disable/stop [14:41:32] https://wikitech.wikimedia.org/wiki/Puppet#Maintenance [14:41:43] sudo disable-puppet "some reason - T12345" [14:41:43] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [14:42:10] ty! [14:43:20] !log deployment-prep: `[samtar@deployment-deploy03]$ sudo disable-puppet "T340030"` T340030, seeing if it *is* puppet to blame here [14:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [14:43:23] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [14:43:33] TheresNoTime: task updated with a summary of the above :] [14:43:41] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) Ah that is never ending :] From IRC: 14:24:45 deployment-deploy03 puppet-agent[31824]: (/S... [14:43:42] thank you! :D [14:44:28] (03CR) 10Daniel Kinzler: [C: 03+1] "Fine with me!" [integration/config] - 10https://gerrit.wikimedia.org/r/931706 (owner: 10Jforrester) [14:48:23] oh fun [14:48:27] skins are handled differently [14:51:42] hashar: I haven't read all scrollback but lemme know if there's something I can help with [14:52:16] https://phabricator.wikimedia.org/T340030#8953010 is a good summary fwiw [14:52:25] Thanks. Reading. [14:53:26] remote: https://gerrit.wikimedia.org/r/c/operations/puppet/+/931949 beta: avoid erasing extensions when already present [14:54:29] (okay, as expected, with `sudo disable-puppet "T340030"` on `deployment-deploy03`, scap prep/sync runs successfully) [14:54:30] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [14:55:54] Looks like the issue is in good hands. [14:56:03] dancy: good morning :] [14:56:20] your git::clone patches got merged this morning and have a side effect on the Puppet code which manages the git repo on beta [14:56:28] but well the fault is really in the beta mnaifest :) [14:57:01] then I suspect git::clone to set the remote origin on each run, then I haven't investigated yet [14:57:33] I think the `git remote set-url` guard does not work properly: `unless => "[ \"\$(${git} remote get-url ${remote_name})\" = \"${remote}\" ]"`. OR it is something else [15:04:36] (03CR) 10Jforrester: [C: 03+2] Revert "Enable the parsoid extension when testing Flow" [integration/config] - 10https://gerrit.wikimedia.org/r/931706 (owner: 10Jforrester) [15:06:08] (03Merged) 10jenkins-bot: Revert "Enable the parsoid extension when testing Flow" [integration/config] - 10https://gerrit.wikimedia.org/r/931706 (owner: 10Jforrester) [15:07:25] !log Zuul: Revert "Enable the parsoid extension when testing Flow" [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:10:16] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) > A) have git clone to only set the origin when it is changed (I caught that on codereview... [15:11:10] TheresNoTime: I am cherry picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/931949 [15:11:18] should prevent the deletion of extensions at lest [15:11:20] least [15:12:18] hashar: Will you also have to do that for the skins.git dir? [15:12:44] James_F: nop, skins are handled differnelty using a `git init /srv/mediawiki-staging/php-master/skins` [15:12:49] … [15:12:51] * James_F sighs. [15:12:54] and I have been too lazy to refactor the code to have both handled the same way [15:12:55] yeah [15:13:16] Courage, mon brave, courage! [15:14:22] merci mon bon ami [15:14:45] * hashar runs puppet [15:21:19] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/931949 to the pup... [15:21:44] TheresNoTime: so I think git::clone will no more delete the extension directory. I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/931949/3/modules/beta/manifests/autoupdater.pp and it seems happy now [15:22:00] by happy, I mean it no more deletes the extensions directory [15:22:02] hashar: It would be nice if "scap prep auto" was used to set up /srv/mediawiki/php-master [15:22:13] */srv/mediawiki-stating/php-master [15:24:52] possibly eyah [15:25:03] and move the logic from Puppet to scap? [15:25:21] I also considered filing a task to have Gerrit to craft a monorepository holding all the repositories we need [15:25:42] so then we would just git clone/git update --recurse-submodule [15:26:02] but I gave up half way through writing down the idea [15:26:19] Baby steps. [15:26:46] yeah [15:26:48] it is hard [15:30:17] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10jbond) > No clue WHY it does not work on beta though : I think it dose work on beta. ignore what i... [15:30:22] Having the staging area on the deploy servers be a manually curated mono-repo was something I pitched years ago as a possible way to speed up `scap fetch`. Computing the rsync deltas over and over and over as each node pulls from its closest mirror uses up a fair amount of IOPS [15:31:08] thank you hashar et al! :D [15:31:38] *now* the CI jobs can be enabled again? [15:34:37] hashar: do you want me to merge that cr you just cherry picked, afai it only affects beta anyway right? [15:34:37] TheresNoTime: I guess we can try :) [15:35:03] jbond: it looks like the cherry pick does what is intended and it should only affect beta :) [15:35:07] so +1 :] thank you! [15:35:14] ack will mereg now [15:35:54] hashar: i also sent a messgae to the task with a more expansive explanation of what i think might have happened [15:36:02] !log Reenabling deployment-prep Jenkins jobs # T340030 [15:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:36:05] T340030: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 [15:36:24] (03CR) 10Hashar: [C: 03+2] Revert "jjb: disable beta cluster jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/931941 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:37:02] jbond: ahh very welcome thank you, cause I always think your are the Hubble telescope while I am running blindfolded :] [15:37:18] lol :) [15:37:42] TheresNoTime: it is running at https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/449129/console [15:37:51] ^^ [15:38:27] (03Merged) 10jenkins-bot: Revert "jjb: disable beta cluster jobs" [integration/config] - 10https://gerrit.wikimedia.org/r/931941 (https://phabricator.wikimedia.org/T340030) (owner: 10Hashar) [15:39:50] Yippee, build fixed! [15:39:50] Project beta-code-update-eqiad build #449129: 09FIXED in 2 min 22 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/449129/ [15:40:16] woo, finally :) [15:40:24] 10GitLab (Auth & Access), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review, 10User-brennen: Create bot to sync LDAP groups with related GitLab groups - https://phabricator.wikimedia.org/T319211 (10Jelto) >>! In T319211#8951404, @dancy wrote: > @jelto Our plan is to run `sync-gitlab-gro... [15:41:31] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10TheresNoTime) p:05Unbreak!→03Triage Looks [almost] resolved, no longer blocking beta deployment... [15:41:50] insert a relevant Monthy Python quote here [15:42:05] then will scap prep auto work still? :) [15:43:53] Notice: /Stage[main]/Beta::Autoupdater/Exec[/bin/rm -r /srv/mediawiki-staging/php-master/extensions]: Triggered 'refresh' from 1 event [15:43:53] Notice: /Stage[main]/Beta::Autoupdater/Git::Clone[beta-mediawiki-extensions]/Exec[git_set_origin_beta-mediawiki-extensions]/returns: executed successfully (corrective) [15:43:55] :-( [15:44:56] but I don't think that got deleted [15:45:17] well one prep worked, the sync is taking a while (but that's expected).. so hopefully it's still okay? [15:46:20] then maybe https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/449129/console changes the git remote somehow [15:47:55] `scap prep auto` does have a call to `git remote set -url --push origin ...` [15:48:45] Code added by you. [15:49:06] yeah though that is for the push url [15:49:59] maybe I can add some `git remote get-url` print statements here and there [15:53:34] and the sync has some large bumps in running time https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/buildTimeTrend [15:55:06] l10n rebuilds [15:55:24] *and resync [15:56:21] Proposal: Remove the `"/bin/rm -r ${stage_dir}/php-master/extensions"` exec from autoupdater.pp [15:57:15] Yippee, build fixed! [15:57:15] Project beta-scap-sync-world build #108440: 09FIXED in 17 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/108440/ [15:57:40] AH [15:57:51] so currently the remote is: https://gerrit.wikimedia.org/r/mediawiki/extensions [15:58:09] while the beta code update job is running [15:58:20] but puppet sets it to https://gerrit.wikimedia.org/r/mediawiki/core.git [15:58:24] err [15:58:29] https://gerrit.wikimedia.org/r/mediawiki/extensions.git [15:58:35] ah cripes. [16:01:10] Hey Releng - sec.team is using https://docker-registry.wikimedia.org/python3/tags/ and other images for Gitlab CI and it looks like some are still using stretch as a non-archived Debian release? Are there plans to push out new images soon? I looked around Phab for a bug but couldn’t find one. Anyhow, the last tag for the python3 image is 4/23/2023, which is around when stretch went to archive.debian.org. [16:01:19] 10Beta-Cluster-Infrastructure, 10Scap, 10Patch-For-Review, 10ci-test-error: beta-code-update-eqiad: fatal: No url found for submodule path {repo} in .gitmodules - https://phabricator.wikimedia.org/T340030 (10hashar) So while https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ is runn... [16:01:31] TheresNoTime: so yeah the git remote url is set differently by scap versus Puppet and they fight whith each others :] [16:01:34] to be continued [16:02:41] sbassett: python3-{buster,bullseye} exist, but for more details you should ask serviceops [16:03:06] sbassett: they are maintained by SRE via operations/docker/production-images git repo. The new images have the debian distro in the name eg: `docker-registry.wikimedia.org/python3-buster` [16:03:33] or more recent `docker-registry.wikimedia.org/python3-bullseye` [16:03:36] Ok thanks! [16:07:58] If -buster,-bullseye are set up similarly to the old python3 image and considered stable, then that should work fine for us. [16:25:52] sbassett: yeah they should :] [16:26:10] TheresNoTime: I am off for virtual offsite, thanks for all the scap / puppet debugging runs! [16:26:41] No worries, thank you for getting it working in the end! [16:27:00] something something will need to be fixed though [16:35:21] 10Release-Engineering-Team, 10ci-test-error: Gerrit gives spurious V-1 Merge Failed in wikimedia/fundraising/tools repo - https://phabricator.wikimedia.org/T336902 (10Ejegg) Currently blocking merge of https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/tools/+/931268/ [16:35:51] 10Release-Engineering-Team, 10ci-test-error: Gerrit gives spurious V-1 Merge Failed in wikimedia/fundraising/tools repo - https://phabricator.wikimedia.org/T336902 (10AnnWF) happened couple times on https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/tools/+/931268 [16:45:50] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Zuul, 10ci-test-error: Gerrit gives spurious V-1 Merge Failed in wikimedia/fundraising/tools repo - https://phabricator.wikimedia.org/T336902 (10hashar) Yes that is a known issue in #zuul due to Gerrit repository names overlapping. We h... [16:55:33] maintenance-disconnect-full-disks build 502044 integration-agent-docker-1031 (/: 29%, /srv: 98%, /var/lib/docker: 53%): OFFLINE due to disk space [17:00:40] maintenance-disconnect-full-disks build 502045 integration-agent-docker-1031 (/: 29%, /srv: 47%, /var/lib/docker: 51%): RECOVERY disk space OK [18:33:35] i've been seeing a lot of jobs failing recently with various variants of "ENOSPC: no space left on device". is that a known problem? example: https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74-docker/43081/console [18:35:23] MatmaRex: yes the infra is falling apart [18:35:37] oh D: [18:36:23] it started happening fairly recently, most probably cause mediawiki + extensions somehow reached some threshold which cause too much disk pressure [18:40:42] MatmaRex: https://phabricator.wikimedia.org/T338627#8922715 is the analysis I did last time [18:41:02] there is a Jenkins job which automatically un pool the agents, notifiy this channel and repool them when they have enough disk again [18:42:00] the root cause is `/srv` is 36GB large, 4 or 5 are consume by various things which leaves roughly 31G [18:42:06] which are shared by up to 3 concurrent builds [18:42:26] and a build of mediawiki+extensions (the wmf-quibble-* jobs) takes 10.5G once fully installed [18:42:37] so 10.5G * 3 > 31G available disk space => failure [18:42:43] so basically someone added too many node_modules? :> [18:42:44] gotta rebuild the whole fleet to larger instances I guess [18:42:53] possibly yeah I haven't investigated [18:43:14] I also added a few more repos to the git cache which end up on /srv, but that was a few weeks ago [18:43:36] and some repositories might have been added to the `wmf-quibble*` jobs [18:44:18] i feel like i've been seeing these errors more often since only a week or two ago [18:44:53] that can surely be figured out by parsing the log of this channel (Jenkins alerts here as `wmf-insecte`) [18:44:57] maybe we did add some ridiculous dependency somewhere. it would be interesting if you had a way to figure out when the error rate increased [18:45:02] (insecte is french word for "bug" [18:45:31] I guess a grep through https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-releng/ :) [18:45:45] "maintenance-disconnect-full-disks"? [18:45:48] i'll grep just for fun [18:45:53] yes correct [18:46:05] there is an example above at 17:00:40 UTC [18:47:04] if you do the grep, paste it somewhere :] I will have to file a task about it and reach out to wmcs to ask for larger flavor [18:47:21] then migrate everything (and I guess upgrade Debian OS, repool all agents etc, it is a bit tedious) [18:47:42] I am off! [18:47:53] :o good night [19:02:36] hashar: i guess it's been a problem for a couple of months, but it increased about 2 weeks ago. https://phabricator.wikimedia.org/F37111403 https://phabricator.wikimedia.org/P49465 [19:03:11] can't really pinpoint a specific date though [19:46:09] (03PS1) 10AikoChou: inference-services: add readability pipelines [integration/config] - 10https://gerrit.wikimedia.org/r/931994 (https://phabricator.wikimedia.org/T334182) [20:07:44] 10GitLab, 10Release-Engineering-Team: GitLab merge request pages show an error when logged out - https://phabricator.wikimedia.org/T340062 (10taavi) [20:10:32] maintenance-disconnect-full-disks build 502083 integration-agent-docker-1025 (/: 29%, /srv: 95%, /var/lib/docker: 45%): OFFLINE due to disk space [20:15:40] maintenance-disconnect-full-disks build 502084 integration-agent-docker-1025 (/: 29%, /srv: 39%, /var/lib/docker: 46%): RECOVERY disk space OK [20:26:54] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10Patch-For-Review: Provide ability to tag GitLab CI built images with a datetime format, set as default in pipeline-to-gitlab conversion - https://phabricator.wikimedia.org/T338224 (10CodeReviewBot) dancy closed https:/... [20:41:09] (03CR) 10Jon Harald Søby: "This change is ready for review." [integration/config] - 10https://gerrit.wikimedia.org/r/932001 (owner: 10Jon Harald Søby) [20:42:42] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10Patch-For-Review: Provide ability to tag GitLab CI built images with a datetime format, set as default in pipeline-to-gitlab conversion - https://phabricator.wikimedia.org/T338224 (10CodeReviewBot) dancy opened https:/... [20:51:58] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops, 10Patch-For-Review: Provide ability to tag GitLab CI built images with a datetime format, set as default in pipeline-to-gitlab conversion - https://phabricator.wikimedia.org/T338224 (10CodeReviewBot) kharlan closed https... [21:42:28] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Rebuild WMCS insegration instances to larger flavor - https://phabricator.wikimedia.org/T340070 (10hashar) [21:43:26] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Machine-Learning-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) As a follow up, I have filed {T340070} [21:43:33] 10Continuous-Integration-Infrastructure, 10ci-test-error: integration-agent-docker-1032 out of disk space (was: Frequent Selenium failures) - https://phabricator.wikimedia.org/T338627 (10hashar) As a follow up, I have filed {T340070} [21:49:39] 10GitLab (Auth & Access), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review, 10User-brennen: Create bot to sync LDAP groups with related GitLab groups - https://phabricator.wikimedia.org/T319211 (10dancy) >>! In T319211#8953270, @Jelto wrote: > gitlab1003 is gitlab-replica-old.wikimedia... [21:59:06] 10GitLab (Upstream pit of despair 🕳️), 10Release-Engineering-Team: GitLab loses track which link you clicked when session expires - https://phabricator.wikimedia.org/T340011 (10brennen) I'm not totally sure, but I think this is general GitLab behavior that I've run into on gitlab.com repeatedly. I agree that... [22:02:51] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Run Jenkins jobs from tmpfs rather than disk - https://phabricator.wikimedia.org/T340073 (10hashar) [22:03:04] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Rebuild WMCS insegration instances to larger flavor - https://phabricator.wikimedia.org/T340070 (10hashar) [22:04:54] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Rebuild WMCS insegration instances to larger flavor - https://phabricator.wikimedia.org/T340070 (10hashar) //I have moved the bits about moving the builds to use `tmpfs` to a standalone task T340073//. [22:05:04] now I can sleep :] [22:28:08] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Zuul, 10ci-test-error: Gerrit gives spurious V-1 Merge Failed in wikimedia/fundraising/tools repo - https://phabricator.wikimedia.org/T336902 (10Ejegg) @hashar Thanks for finding that, and thanks for the temporary fix! We really want t... [22:32:46] 10Gerrit, 10Release-Engineering-Team, 10SRE, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [22:52:15] gerrit1001 is now history (boot sector nuked) [22:52:37] but more data than ever is in bacula [22:52:51] also, even data from cobalt is still copied to gerrit1003 now [23:33:07] 10Phabricator, 10Release-Engineering-Team (They Live 🕶️🧟), 10serviceops-collab, 10Patch-For-Review, 10User-brennen: Migrate phabricator.wikimedia.org to Phorge as upstream - https://phabricator.wikimedia.org/T333885 (10Dzahn) [23:33:12] 10Phabricator, 10DBA, 10Data-Persistence-Backup, 10serviceops-collab, 10Patch-For-Review: phabricator->phorge migration - database handling - https://phabricator.wikimedia.org/T335080 (10Dzahn) 05Open→03In progress p:05Medium→03High [23:55:02] 10Phabricator, 10DBA, 10Data-Persistence-Backup, 10serviceops-collab, 10Patch-For-Review: phabricator->phorge migration - database handling - https://phabricator.wikimedia.org/T335080 (10Dzahn) @brennen I am still working on creating the VM.. unfortunately I am running into issues such as: ` Wed Jun 21...