[00:10:20] (03PS1) 10Brian Wolff: Make composer-php80 run on gate-and-submit for MW core [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) [00:12:59] (03CR) 10CI reject: [V: 04-1] Make composer-php80 run on gate-and-submit for MW core [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) (owner: 10Brian Wolff) [00:16:02] (03CR) 10Reedy: Make composer-php80 run on gate-and-submit for MW core (032 comments) [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) (owner: 10Brian Wolff) [00:17:45] (03PS2) 10Brian Wolff: Make composer-php80 run on gate-and-submit for MW core [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) [00:35:07] (03PS3) 10Brian Wolff: Make composer-php80 run on gate-and-submit for MW core [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) [01:38:16] 10Continuous-Integration-Config, 10PHP 8.0 support, 10Patch-For-Review: Make PHP 8.0 voting on MW master - https://phabricator.wikimedia.org/T300463 (10Bawolff) Well looks like it does not pass on 1.35 yet. [08:38:42] (03CR) 10Jaime Nuche: [C: 03+2] deploy-promote: Terminate line after jenkins has merged the patch [tools/scap] - 10https://gerrit.wikimedia.org/r/816015 (owner: 10Ahmon Dancy) [08:43:09] (03Merged) 10jenkins-bot: deploy-promote: Terminate line after jenkins has merged the patch [tools/scap] - 10https://gerrit.wikimedia.org/r/816015 (owner: 10Ahmon Dancy) [08:56:55] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-SettingsBuilder, 10ci-test-error: beta-update-databases-eqiad failing due to invalid MediaWiki configuration parameters - https://phabricator.wikimedia.org/T313128 (10daniel) >>! In T313128#8096152, @RhinosF1 wrote: > We sp... [08:58:55] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-SettingsBuilder, 10ci-test-error: beta-update-databases-eqiad failing due to invalid MediaWiki configuration parameters - https://phabricator.wikimedia.org/T313128 (10RhinosF1) There is a copy of the code somewhere that can... [09:06:13] (03PS1) 10Jaime Nuche: deploy-promote: abort process if version check fails [tools/scap] - 10https://gerrit.wikimedia.org/r/816113 [09:23:13] (03PS1) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 [09:23:46] 10Deployments, 10Release-Engineering-Team (Doing), 10SRE, 10bacula, 10Parsoid (Tracking): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) @elukey We didn't receive any bad reports so far, should we be good to close this task or... [09:32:53] 10Deployments, 10Release-Engineering-Team (Doing), 10SRE, 10bacula, 10Parsoid (Tracking): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10RhinosF1) T309162 is still actionable from the incident. [10:02:17] 10Release-Engineering-Team (The Decommission Mission 💀), 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10Vgutierrez) p:05Triage→03Medium [12:11:07] 10GitLab (CI & Job Runners), 10serviceops, 10serviceops-collab, 10Patch-For-Review: DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241 (10Jelto) p:05High→03Medium >>! In T311241#8091812, @dduvall wrote: > > The primary reason for the custom docker network is to h... [12:34:36] 10GitLab (Project Migration), 10Release-Engineering-Team: Create new GitLab project group: Community Resources Team - https://phabricator.wikimedia.org/T313593 (10Osnard) [12:36:32] 10Phabricator (Upstream), 10Release-Engineering-Team, 10Upstream, 10User-brennen: Uploaded files via the drag-and-drop are defaulting to private-access - https://phabricator.wikimedia.org/T310833 (10Esanders) This also happens when editing comments. [13:48:29] (03PS1) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [13:48:44] (03PS2) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [14:07:38] (03PS3) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [14:40:13] (03CR) 10Ahmon Dancy: [C: 03+2] deploy-promote: abort process if version check fails [tools/scap] - 10https://gerrit.wikimedia.org/r/816113 (owner: 10Jaime Nuche) [14:46:59] (03Merged) 10jenkins-bot: deploy-promote: abort process if version check fails [tools/scap] - 10https://gerrit.wikimedia.org/r/816113 (owner: 10Jaime Nuche) [14:51:28] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-SettingsBuilder, 10ci-test-error: beta-update-databases-eqiad failing due to invalid MediaWiki configuration parameters - https://phabricator.wikimedia.org/T313128 (10RhinosF1) > 16:23:04 RhinosF1: James_F: we can... [15:14:54] 10Phabricator (Upstream), 10Release-Engineering-Team, 10Upstream, 10User-brennen: Uploaded files via the drag-and-drop are defaulting to private-access - https://phabricator.wikimedia.org/T310833 (10DLynch) Granted, my understanding is that "automatically enabling access to files that're in edited-content"... [15:26:15] 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-SettingsBuilder, 10ci-test-error: beta-update-databases-eqiad failing due to invalid MediaWiki configuration parameters - https://phabricator.wikimedia.org/T313128 (10hashar) Validating configuration remembered me of MediaW... [15:50:53] (03CR) 10Jforrester: "I don't think it's acceptable for us to have divergent PHP support criteria for vendor and composer jobs for the master branch. Otherwise " [integration/config] - 10https://gerrit.wikimedia.org/r/816062 (https://phabricator.wikimedia.org/T300463) (owner: 10Brian Wolff) [19:38:51] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10serviceops, 10serviceops-collab: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) > (must. save. @mmodell's bash history.) I made a phab1001-home-twentyafterfour.tar.gz so the entire home and... [19:42:25] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10serviceops, 10serviceops-collab: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) a:03Dzahn [19:45:25] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10serviceops, 10serviceops-collab: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 (10Dzahn) from syncing data last time back in 2019 https://gerrit.wikimedia.org/r/c/operations/puppet/+/554628 [19:48:35] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10brennen) `scap deploy -v -l 'phab2001.codfw.wmnet'` fails from deploy1002 - ` ... Received disconnect from 10.192.32.147 port 22:2: Too many aut... [20:11:26] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10Dzahn) 10.64.32.28 is deploy1002 in the logs on phab2001, looking for connections from deploy1002: ` Jul 19 14:23:02 phab2001 sshd[13278]: Con... [20:15:20] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10Dzahn) root@deploy1002:/home/dzahn# ssh -i /etc/keyholder.d/phabricator scap@phab2001.codfw.wmnet Jul 22 20:12:39 phab2001 sshd[27629]: Failed... [20:20:59] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10Dzahn) For the scap user it would be: `ssh -i /etc/keyholder.d/scap scap@phab2001.codfw.wmnet`. scap key for scap user. but that one has: Load... [20:30:32] brennen: SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/phabricator phab-deploy@phab2001.codfw.wmnet [20:30:37] this is how it _should_ work [20:30:53] it says so at https://wikitech.wikimedia.org/wiki/Keyholder#Hints [20:31:01] and the user is "phab-deploy" [20:31:19] I only get an 'sign_and_send_pubkey: signing failed: agent refused operation' [20:31:33] but that only happens when the rest is correct,heh [20:32:55] [deploy1002:~] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/phabricator phab-deploy@phab2001.codfw.wmnet [20:32:57] Linux phab2001 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 [20:33:01] ^ works [20:34:52] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10Dzahn) This is how it actually works, using the AUTH_SOCK from keyholder, and using the correct "phab-deploy" user and not trying it as root: `... [20:38:23] 10Phabricator, 10Release-Engineering-Team (The Decommission Mission 💀), 10User-brennen: Deploy Phabricator with scap - https://phabricator.wikimedia.org/T313259 (10Dzahn) @brennen Seems to me the issue is it's trying to connect as "scap" but it should use "phab-deploy" user. Then it should work together with... [20:39:12] hmm, yeah, wrong user would make sense, i think - though i don't know why it's not using the one in the config file... [20:43:10] which one, /etc/scap.cfg ? [20:44:04] /srv/deployment/phabricator/deployment/scap/scap.cfg [20:44:06] the one from the repo [20:44:47] i see. yea, that has phab-deploy [20:47:09] "Uses local .scaprc as config for each host in cluster [20:47:24] but that is just general scap help text [20:53:40] brennen: mutante: maybe scap log has some more details? [20:54:15] there is something funky which will cause ssh to try every single keys instead of the one for the user [20:54:28] so it tries each of the keys in the keyholder one after the others [20:54:38] until the remote sshd bails out cause there was too many auth failures [20:55:23] eg https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2022.29?id=bDinJ4IB86RsLKL31MDN [20:55:55] ran as phab-deploy@phab2001.codfw.wmnet (correct user?) [20:56:08] then it lists a long list of keys [20:56:38] it tries the 6 first then the remote bails out [20:57:23] hmm: 20:44:44 Unable to find keyholder key for phab_deploy [20:57:36] ...is it converting phab-deploy to phab_deploy or something? [20:57:50] yeah scap has a `get_keyholder_key()` which iirc is being passed the user (so would be phab-deploy) [20:57:57] I also noticed earlier in one place it was underscore and in the other it was - [20:57:57] it iterates through the key comment names [20:58:03] i think [20:58:18] i just saw something about underscores [20:58:19] we had the issue with trainbranchbot [20:58:22] * brennen digs through open tabs [20:58:42] is that key new? - 2048 SHA256:QpALwrv9ZQnSiC42TDpwfHSHuMxqNgxDv1M7MOP1I30 /etc/keyholder.d/phabricator (RSA) [20:58:55] or is that cause you are using a new username? [20:59:19] i haven't changed either [20:59:26] key is not new [21:01:12] /etc/ssh/userkeys/phab-deploy is also same on phab1001 and phab2001 [21:02:11] and it also has the same checksum as /etc/keyholder.d/phabricator.pub on deploy1002 [21:02:22] maybe the scap.cfg needs the name `keyholder_key: phabricator` [21:02:55] reading scap code it looks like it checks for the existence of `/etc/keyholder.d/{self.config["ssh_user"]}` [21:03:05] which would be `/etc/keyholder.d/phab-deploy` [21:03:22] which does not exist [21:03:24] yea, this sounds like a good guess [21:03:37] phabricator vs phab-deploy [21:03:40] so maybe on the deployment server manually amend /srv/deployment/phabricator/deployment/scap/scap.cfg [21:03:41] and add [21:03:45] keyholder_key: phab-deploy [21:03:53] ERROR [21:04:00] `keyholder_key: phabricator` [21:04:08] * brennen tries that [21:04:10] $ ls -la /etc/keyholder.d/phabricator [21:04:10] -r--r----- 1 root keyholder 1766 Nov 30 2020 /etc/keyholder.d/phabricator [21:04:19] I don't know why it would have broken [21:04:31] maybe due to some codechange done recently in scap [21:05:04] seems like it might be working [21:05:14] I might be the one to blame [21:05:32] cause I know close to nothing about scap code and if I know about that get_keyholder_key method it must be that I have altered it recently [21:05:50] that did it - thanks hasharAway! [21:05:53] well [21:05:57] great :] [21:05:59] :) nice win for Friday afternoon/night [21:06:08] Using key: /etc/keyholder.d/phabricator [21:06:10] from the scap log [21:06:54] while previously we had: [21:06:57] `Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'phabricator/deployment', '-g', 'default', 'fetch', '--refresh-config']` [21:07:03] `Unable to find keyholder key for phab_deploy` [21:07:16] `['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'phabricator/deployment', '-g', 'default', 'fetch', '--refresh-config'] (ran as phab-deploy@phab2001.codfw.wmnet) ` [21:07:49] if we can't find a keyholder key using the ssh_user or the keyholder_key config value if it is set [21:07:56] then I think scap should abort entirely [21:08:09] else it tries to do every single keys from the keyholder ( see above logstash link) [21:08:17] right, and then just fails on too many auth attempts [21:08:23] and fails unless you deploy with one of the first 6 keys [21:08:30] my guess is that at one time the fallback might have worked because there weren't very many keys to try [21:08:31] which sounds like it can be filed as a task [21:08:34] yea, that explains "too many authentication failures" [21:08:48] I am sure I have encountered the same issue with jnuche a few weeks ago [21:08:51] i can file a task [21:08:57] there is a max number [21:09:02] cause at 11pm there is no way I can figure that out of thin air [21:09:20] I bet $7 or a drink that the faulty code would blame me :] [21:09:25] haha [21:09:44] for the task you can copy the few lines I have pasted above [21:09:57] what's that line - "debugging is like solving a mystery in which you are simultaneously the detective, the murderer, and the victim" [21:10:03] and the ssh debug log showing up the list of keys attempted (that is the message in https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.7.0-5-2022.29?id=bDinJ4IB86RsLKL31MDN ) [21:10:30] fun thing [21:10:42] when I had regular 1/1 hacking sessions with thcipriani [21:10:59] we often resorted to google search to figure out about a cryptic faults we encountered during the error [21:11:07] only to find out the first hit is a phabricator task filed a few years ago [21:11:17] with PAGES of debugging about it often authored by one of us [21:11:25] fun [21:11:39] cause years later we encountered the exact same issue and were about to do the whole debugging step [21:11:57] but were thanksful to have extensively captured the debugging sessions and the solution founds a few years back [21:12:00] time saver! :] [21:12:27] what I am wondering is whether people in a century will still resort on those tales and lore to fix up the future infra [21:12:49] or maybe by that time the singularity AI will spurt the non sense we have been writing since January 1st 1970 [21:13:06] * hasharAway shuts up [21:15:02] the other day I searched for something like "NodeSet syntax wildcard" and the result was a page where v.olans is talking to upstream clustershell project what syntax we could use for host selection in cumin [21:16:39] it has a few bugs iirc [21:17:43] brennen: so phab can be deployed again isn't it? [21:19:31] hasharAway: i am unblocked in getting phab deploy to work with scap [21:19:38] \o/ [21:20:04] jnuche mentioned moving Phabricator to a docker image and potentially toward k8s [21:20:22] it is probably a good thing to do, then that is unrelated to the above or current sprint :-] [21:22:40] 10Release-Engineering-Team, 10Scap, 10User-brennen: scap should fail if it can't find a keyholder key using ssh_user or keyholder_key values - https://phabricator.wikimedia.org/T313624 (10brennen) [21:22:44] if we ever do that then we should do that with phorge.it [21:22:56] 10Release-Engineering-Team, 10Scap, 10User-brennen: scap should fail if it can't find a keyholder key using ssh_user or keyholder_key values - https://phabricator.wikimedia.org/T313624 (10brennen) p:05Triage→03Low [21:27:13] ideally we would want to invest some engineering time to assist phorge.it [21:27:23] or get involved in the community effort [21:27:56] anyway I have E_TOO_MANY_IDEAS [21:30:02] we already have our own patches that upstream phab doesnt have. so we need to get those into phorge then [21:30:14] but the benefit is we don't have wmf-form [21:30:16] fork [21:32:04] yeah forking has a price [21:32:18] I am super happy to be able to deploy Gerrit straight from the upstream release [21:32:34] and would love to achieve that for plugins as well [21:44:50] 10GitLab (Project Migration), 10Release-Engineering-Team (The Decommission Mission 💀), 10Striker, 10Tools: Figure out workflow for programatically adding GitLab users - https://phabricator.wikimedia.org/T313366 (10demon) a:03demon [21:48:06] maybe its time to review our custom patches and see if there is any that we could potentially ditch