[00:04:33] Ok! [07:04:00] 10Scap, 10Parsoid, 10SRE, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Joe) yeah, +1 to killing with fire :) [08:33:53] 10Scap, 10Parsoid, 10SRE, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've removed the Puppet class from the bastions, the existing files will vanish with ongoing reimages. [08:43:09] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: GitLab minor version upgrade 15.7.3 - https://phabricator.wikimedia.org/T326815 (10Jelto) I updated replicas to `gitlab-ce` to `15.7.3` and `gitlab-runner` to `15.7.2`. Test instance: [x] gitlab-prod-1001.devto... [08:49:18] GitLab needs a short restart at around 9:30 UTC [09:43:02] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10Jelto) [09:43:54] 10GitLab (Infrastructure), 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: GitLab minor version upgrade 15.7.3 - https://phabricator.wikimedia.org/T326815 (10Jelto) 05Open→03Resolved I all instances updated to `gitlab-ce` `15.7.3` and `gitlab-runner` to `15.7.2`. Test instance: [x... [11:03:58] 10Diffusion, 10Phabricator: Incorrectly attributed as the author of a random commit I have nothing to do with in a repository I've never touched - https://phabricator.wikimedia.org/T327105 (10Lucas_Werkmeister_WMDE) How bizarre, thanks a lot @Dylsss! [11:40:57] GitLab needs another short restart at around 12:15 UTC [12:17:12] 10Beta-Cluster-Infrastructure, 10ContentTranslation: Special:ContentTranslation raise internal error on beta cluster - https://phabricator.wikimedia.org/T323417 (10Nikerabbit) @YedidyaPopper You should create a a new task for your issue and follow the instructions in https://en.wikipedia.org/wiki/Wikipedia:Rep... [12:46:06] 10Beta-Cluster-Infrastructure, 10ContentTranslation: Special:ContentTranslation raise internal error on beta cluster - https://phabricator.wikimedia.org/T323417 (10YedidyaPopper) @Nikerabbit The issue resolved itself after waiting a few days, thanks anyway [12:57:05] Hi, we couldn't deploy mobileapps yesterday because of the codfw network outage/broken CI. Is it OK if we deploy to production now outside of regular deployment window? [14:05:17] Project beta-scap-sync-world build #86684: 04FAILURE in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86684/ [14:11:27] Project beta-scap-sync-world build #86685: 04STILL FAILING in 2 min 0 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86685/ [14:16:58] Project beta-scap-sync-world build #86686: 04STILL FAILING in 1 min 59 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86686/ [14:28:28] Project beta-scap-sync-world build #86687: 04STILL FAILING in 1 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86687/ [14:36:37] Project beta-scap-sync-world build #86688: 04STILL FAILING in 1 min 31 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86688/ [14:46:17] Project beta-scap-sync-world build #86689: 04STILL FAILING in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86689/ [14:57:42] Project beta-scap-sync-world build #86690: 04STILL FAILING in 1 min 37 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86690/ [15:07:02] Project beta-scap-sync-world build #86691: 04STILL FAILING in 1 min 25 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86691/ [15:08:55] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10Jelto) The new cookbook was used in the last two updates in T327230 and T326815. The cookbook worked good. I added some docs under https://wikitech.wikime... [15:10:06] 10GitLab (Infrastructure), 10serviceops-collab, 10Patch-For-Review: Automate GitLab version upgrade process - https://phabricator.wikimedia.org/T323569 (10Jelto) [15:17:25] Project beta-scap-sync-world build #86692: 04STILL FAILING in 1 min 49 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86692/ [15:25:46] > deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud: rsync: write failed on "/srv/mediawiki/php-master/cache/l10n/upstream/l10n_cache-zu.cdb.json": No space left on device [15:25:48] bah [15:28:21] thcripriani: I was just looking into that, dir `php-master/cache/l10n/upstream` on the jobrunner is only ~9G in size [15:28:33] Project beta-scap-sync-world build #86693: 04STILL FAILING in 1 min 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86693/ [15:28:40] oops, that was all mangled [15:29:14] I meant `/srv/mediawiki-staging/php-master` is 15G and the garget partition is just ~9G [15:30:09] *target [15:30:41] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Doing), 10Patch-For-Review: Replace deployment-imagescaler03 (stretch) with deployment-imagescaler04 (buster) - https://phabricator.wikimedia.org/T294148 (10Andrew) This VM is shut off. We'll see what comes. [15:30:46] thcipriani: I can't type today [15:30:54] ^^^ [15:32:33] we seem to have a lot more extensions in the beta deployment server that seem to take most of the space, I don't know how this used to look but it's definitely way bigger than what we have in i.e. the prod deployment server [15:33:09] yeah, we have a ton more extensions in beta [15:33:13] let's see [15:36:44] Project beta-scap-sync-world build #86694: 04STILL FAILING in 1 min 41 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86694/ [15:42:23] !log deployment-jobrunner04: sudo rm -rf /srv/mediawiki/php-master/cache/l10n/upstream/.~tmp~ [15:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [15:43:09] 15GB is pretty enormous [15:43:25] Probably a lot of that in .git directories which are not rsync'd [15:45:46] php-master/extensions/.git is 6.5GB. [15:46:02] Joy. [15:46:40] I think that .git dir covers all of the submodules. [15:46:55] s/think/know/ [15:48:08] Project beta-scap-sync-world build #86695: 04STILL FAILING in 1 min 54 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86695/ [15:54:07] It's always sad to run into space issues and then to find 16GB of idle space in the root filesystem. [15:54:24] yeah :( [15:56:46] Project beta-scap-sync-world build #86696: 04STILL FAILING in 1 min 46 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86696/ [15:57:17] We could change the clone to be shallow and discard old git data? [15:57:35] James_F: That wouldn't fix this problem (which doesn't involve the .git dir) [15:57:35] It's not like we use git to actually do anything. [15:57:56] Ah, never mind then. [15:57:57] .git is big on the deploy server, but there's no space problem there. [15:58:06] with delayed updates we're storing 4 copies of the l10n cache during sync [15:58:16] Oh, this is on the jobrunner? Meh. [15:58:28] I was hoping some old data was hanging out in .~tmp~, but I guess not :( [15:58:51] er, well, 2 copies of the 2 copies in .~tmp~ that is [15:58:57] Ack. [16:01:19] yeah, doesn't look like any space-saving tricks will save this one :\ [16:08:06] Project beta-scap-sync-world build #86697: 04STILL FAILING in 1 min 33 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86697/ [16:19:08] Crazy idea: mv /srv/mediawiki-staging/php-master /somewhere/on/root/filesystem, then symlink [16:19:33] of course the rsync will undo that. [16:19:41] disregard. :-) [16:20:55] Will rsync undo it if it's a hard link? [16:21:04] Can't hard link a directory [16:21:07] Ah, yeah. [16:21:09] Nevermind. [16:21:21] ooh.. but a bind mount.. [16:21:44] bind mount a root filesystem directory into /srv/mediawiki-staging/php-master.. that would be unaffected by rsync [16:25:05] Project beta-scap-sync-world build #86698: 04STILL FAILING in 10 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86698/ [16:30:34] 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10phuedx) [16:31:19] Project beta-scap-sync-world build #86699: 04STILL FAILING in 2 min 4 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86699/ [16:34:15] 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10phuedx) [16:36:53] Project beta-scap-sync-world build #86700: 04STILL FAILING in 1 min 55 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86700/ [16:37:49] I'm starting a copy to test that idea. That said it would only be a temporary workaround. [16:43:34] copy completed, bind mount created. [16:49:17] Yippee, build fixed! [16:49:17] Project beta-scap-sync-world build #86701: 09FIXED in 3 min 10 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86701/ [16:49:22] and beta-scap-sync-world is happy now. [16:49:40] So, this was all just hacking that will be undone when the system reboots. [16:50:10] thcipriani, James_F: ^^ [16:50:25] Neat. [16:50:28] Now to delete a hundred notifications [16:50:33] Ha. [16:50:47] Should we make this 'hack' permanent, or is there a better solution? [16:51:14] The best solution is a properly provisioned VM [16:51:45] Hmm, yes. [16:51:50] but.... I think an edit to /etc/fstab will keep it alive across reboots. [16:51:53] so I'll do that. [16:52:07] +1 /etc/fstab [16:52:18] but, also +1 rebuilding the jobrunner [16:52:30] +1 [16:52:32] which, in theory, is well-managed by puppet :D [16:52:37] I mean, how hard could it be? ;-) [16:52:40] har. [16:52:46] bind mount: clever. [16:52:47] beta cluster is filled with dragons [16:53:12] and, yeah, I wouldn't have thought of a bind mount [16:54:29] I modified /etc/fstab. I wouldn't mind another set of eyes.. .and we'll need to reboot the host to verify [16:54:50] thcipriani: imagine what an awesome dragon army one would have if they spent the past couple of years taming those dragons! just sign up here: T215217 :D [16:54:51] T215217: deployment-prep: Code stewardship request - https://phabricator.wikimedia.org/T215217 [16:55:03] hehe [16:55:13] Imagine Dragons indeed [16:55:41] very sneaky taavi, well played :) [16:59:48] 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10thcipriani) Hrm, so I see the `mediawiki-libs-metrics-platform` group probably has the rights you seek. Currently only @Jdlrobson and @Mholloway which seems out-of-date. We could either re... [17:01:11] 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10taavi) To me the proper fix would be to fix the CI so that a manual +2 and submit is not a part of the "normal" workflow. [17:15:43] 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10hashar) The fault lies in the CI workflow (Zuul): The Gerrit project is configured with: ` lang=yaml - name: mediawiki/libs/metrics-platform test: - trigger-metrics-platform-pi... [17:18:34] 10Continuous-Integration-Config, 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10hashar) [17:19:20] 10Continuous-Integration-Config, 10Release-Engineering-Team: Fix up libs/metrics-platform Gerrit permissions - https://phabricator.wikimedia.org/T327301 (10phuedx) >>! In T327301#8535735, @thcipriani wrote: > We could either revert to having this owned by the `MediaWiki` group (all MediaWiki +2ers can verify)... [17:19:36] 10Continuous-Integration-Config, 10Release-Engineering-Team: mediawiki/libs/metrics-platform CI jobs files filters have no match for some change: no CI job running - https://phabricator.wikimedia.org/T327301 (10hashar) [17:19:41] phuedx: the fix is probably easy: add some filters ;) [17:26:50] 10Continuous-Integration-Config, 10Release-Engineering-Team: mediawiki/libs/metrics-platform CI jobs files filters have no match for some change: no CI job running - https://phabricator.wikimedia.org/T327301 (10hashar) As for Gerrit permissions: submitting a change requires the label {nav Code-Review} and {nav... [17:51:07] (03PS1) 10Hashar: zuul: mediawiki/libs/metrics-platform trigger for tests [integration/config] - 10https://gerrit.wikimedia.org/r/881457 (https://phabricator.wikimedia.org/T327301) [17:53:42] (03CR) 10Hashar: [C: 03+2] "I have talked about it with Phuedx. Once deployed we can `recheck` the faulty patch." [integration/config] - 10https://gerrit.wikimedia.org/r/881457 (https://phabricator.wikimedia.org/T327301) (owner: 10Hashar) [17:54:47] (03Merged) 10jenkins-bot: zuul: mediawiki/libs/metrics-platform trigger for tests [integration/config] - 10https://gerrit.wikimedia.org/r/881457 (https://phabricator.wikimedia.org/T327301) (owner: 10Hashar) [17:55:15] !log Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/881457 | T327301 [17:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:55:18] T327301: mediawiki/libs/metrics-platform CI jobs files filters have no match for some change: no CI job running - https://phabricator.wikimedia.org/T327301 [17:58:13] 10Continuous-Integration-Config, 10Release-Engineering-Team, 10Patch-For-Review: mediawiki/libs/metrics-platform CI jobs files filters have no match for some change: no CI job running - https://phabricator.wikimedia.org/T327301 (10hashar) With CI adjusted, a `recheck` on the change now triggers the `js` and... [18:06:09] looks like deployment-jobrunner04 has a too small `/srv` [18:06:24] it is 9.3G and full which earlier today caused a scap deployment failure [18:08:50] most of that is the l10n cache [18:38:38] (03PS1) 10Subramanya Sastry: Add more explanatory comments to Parsoid custom CSS [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/881459 [18:49:50] 10GitLab (Infrastructure), 10Toolforge, 10cloud-services-team, 10serviceops-collab, 10Kubernetes: gitlab: enable agent server for kubernetes (KAS) - https://phabricator.wikimedia.org/T320483 (10fnegri) [18:50:19] 10GitLab (Administration, Settings & Policy), 10Release-Engineering-Team (Priority Backlog 📥), 10cloud-services-team: gitlab: consider enabling docker container registry - https://phabricator.wikimedia.org/T304845 (10fnegri) [18:52:04] 10Phabricator, 10Cloud-Services, 10cloud-services-team: Organize additional projects as #cloud-services subprojects - https://phabricator.wikimedia.org/T177787 (10fnegri) [18:53:54] 10Release-Engineering-Team (Priority Backlog 📥), 10cloud-services-team, 10wikitech.wikimedia.org, 10LDAP: Request rename of "Alangi derick" to "Alangi Derick" on wikitech/LDAP/Gerrit - https://phabricator.wikimedia.org/T171417 (10fnegri) [18:57:22] hashar I filed https://phabricator.wikimedia.org/T327329 [19:01:32] 10Release-Engineering-Team (Seen), 10cloud-services-team, 10Release Pipeline (Blubber): Intermittent package download failures during jenkins/pipeline tests - https://phabricator.wikimedia.org/T291135 (10fnegri) [19:09:40] 10Continuous-Integration-Config, 10Toolforge Build Service, 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: [tbs] Set up CI for cloud/toolforge/buildpacks repository - https://phabricator.wikimedia.org/T265685 (10fnegri) [19:20:32] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 (10dancy) [19:37:25] 10Continuous-Integration-Infrastructure, 10Cloud-VPS, 10Wikidata, 10cloud-services-team, and 3 others: Wikibase selenium tests timeout, seemingly due to "memory compaction" events on CI VMs - https://phabricator.wikimedia.org/T281122 (10fnegri) [20:24:55] I'm trying to find when and why closed wikis aren't part of group0 anymore. E.g. aawiki I always thought of as closed and group0, but it's actually in group2. [20:25:23] I would have thought group0 is defined by an expression that includes closed.dblist or something, but it seems to be a manual list with group1 and group2 defined relative to it. [20:25:49] https://wikitech.wikimedia.org/wiki/Deployments/Train#Groups still says it, but I wrote that probably so maybe it was never true? [20:43:13] for what it's worth (which is very little), this hasn't really been part of my mental model, but i also don't recall it really coming up. [20:43:32] looking at the git logs it seems like aawiki was only group0 for like 6 hours back in 2017. But the remainder of closed wikis are part of group0 (when they were manually added in 2017). [20:44:05] not sure they've ever been derived from the closed dblist, tho [21:23:21] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 (10hashar) The instance flavor is `g3.cores4.ram8.disk20` so only 20G which is allocated to `/`. The 10G /dev/sdb c... [21:23:54] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 (10hashar) Side question, /srv/mediawiki/php-master/cache/l10n is almost 5G: * 2.4G for the `l10n_cache-XXX.cdb` file... [21:24:00] dancy: thanks for T327329 filing :] [21:24:01] T327329: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 [21:25:13] ooh, so we have a resize opportunity. I was wondering about that. [21:27:14] I'll give it a try. [21:31:07] `extend volume:Compute service failed to extend volume.` :-( [21:31:58] hmm. I think that just means it somehow failed to make the VM notice the change in volume size [21:33:22] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 (10dancy) I extended the size of the volume to 25GB but the change did not register in the VM. The Horizon UI says `... [21:33:48] !log Rebooting deployment-jobrunner04 for T327329 [21:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:33:51] T327329: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 [21:36:23] Project beta-scap-sync-world build #86730: 04FAILURE in 1 min 28 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86730/ [21:36:59] ^ that's me [21:37:17] yeah probably need the instance to be rebooted then unmount /srv and manually resize it (ext2fsresize or something?) [21:37:34] Post reboot it still sees /dev/sdb as 10GB [21:37:37] so trying something else now [21:39:13] maybe the volume needs to be detached first [21:40:03] kudos :] [21:40:05] I shut down, detached, reattached, and restarted.. Now the new volume size is seen. Progress [21:40:18] now resizing the filesystem [21:40:39] and undoing prior hacks [21:42:19] looks on track; I am going to bed :] [21:42:27] Goodnight! Thanks for the hint! [21:42:35] :unicor [21:42:41] bah my emojis are broken [21:43:10] happy day and nights & [21:47:16] Yippee, build fixed! [21:47:17] Project beta-scap-sync-world build #86731: 09FIXED in 1 min 14 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/86731/ [21:49:00] (03PS1) 10Legoktm: Add Izno to the CI allowlist [integration/config] - 10https://gerrit.wikimedia.org/r/881481 [21:53:40] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud inadequate storage resources - https://phabricator.wikimedia.org/T327329 (10dancy) 05Open→03Resolved a:03dancy Rebooting didn't work. Ultimately I had to shut down the VM, detach the...