[00:13:49] 10Diffusion: Diffusion not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Yaron_Koren) [00:27:09] 10Diffusion, 10Gerrit: Diffusion not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10bd808) My first thought was that {T313250} may have been involved in this, but it looks like that was completed prior to the mirror breaking. The configured origin URL for the MediaWiki cor... [00:27:29] 10Diffusion, 10Gerrit: Diffusion mirros of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10bd808) [00:27:43] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10bd808) [00:56:44] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:58:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:59:47] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) While https://gerrit.wikimedia.org/r/mediawiki/core" is a 404 in a browser, a "git clone "https://gerrit.wikimedia.org/r/mediawiki/core" or "git clone "https:/... [01:03:57] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) I cloned core from gerrit-replica and git log ends on August 17th. Then I cloned core from gerrit and git log is already at August 23rd. So what failed here... [01:21:26] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) p:05Triage→03High gerrit replication between gerrit servers is broken due to: `Caused by: org.apache.sshd.common.SshException: KeyExchange signature verif... [01:48:41] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) I double checked the key is in place, identical between gerrit1001 and gerrit2002. Then I manually became user gerrit2 and connected with ssh from gerrit1001... [01:52:34] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) [01:52:36] 10Release-Engineering-Team, 10Gerrit (Gerrit 3.4): Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 (10Dzahn) [01:52:41] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) Gerrit was upgraded on August 17th (T315408). [01:57:34] 10Release-Engineering-Team, 10Gerrit (Gerrit 3.4): Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 (10Dzahn) Since this day the replication from gerrit1001 to gerrit2002 has stopped working. T315942 Did the sshd (mina) version change with this? That would explain it if it is indeed: htt... [02:00:13] 10Release-Engineering-Team, 10Gerrit (Gerrit 3.4): Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 (10Dzahn) https://groups.google.com/g/repo-discuss/c/sIhjyblUh4A "I added new configuration option to re-enabled deprecated kex algorithms: sshd.enableDeprecatedKexAlgorithms = true" ^ we sh... [02:02:06] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) seems a lot like this: https://groups.google.com/g/repo-discuss/c/sIhjyblUh4A https://www.gerritcodereview.com/3.4.html#jcraft-jsch-client-library-is-disabl... [02:04:38] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) I will try adding the "sshd.enableDeprecatedKexAlgorithms = true" and restarting gerrit tomorrow morning unless someone beats me to it. [02:16:22] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:18:44] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:39:36] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:46:40] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:46:40] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:49:02] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:01:53] 10Phabricator, 10Release-Engineering-Team: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10hashar) [08:02:05] I have filed a task for those flappy alarms [08:08:05] 10Phabricator, 10Release-Engineering-Team: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Peachey88) [08:08:11] 10Phabricator: phd has stopped working a few times on August 14 (and before) - https://phabricator.wikimedia.org/T315184 (10Peachey88) [08:35:04] (03PS1) 10Jaime Nuche: WIP: small PoC Change-Id: I731fb64b551775122ba29a53f8b651e7d77ee8ce [blubber] - 10https://gerrit.wikimedia.org/r/825708 [08:37:08] hashar: do the phd logs show anything [08:40:47] RhinosF1: I havent looked nor do I plan to investigate. I filed it so we don't forget about it [08:40:57] Daniel / Brennen probably know about it already ;) [08:41:20] hashar: I think brennen saw it [08:45:26] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Doing), 10Patch-For-Review: Relocate Jenkins agents root directory to /srv/jenkins - https://phabricator.wikimedia.org/T309698 (10hashar) 05Open→03Resolved @jbond and I have successfully migrated the Puppet compiler agents. The last r... [09:20:57] Hello releng! Why did https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php73-noselenium-docker/57125/console fail with a lot of `Permission denied` errors, please? [10:21:19] urbanecm: good question! :-( [10:21:26] the files are supposedly owned by nobody [10:21:32] 00:00:01.261 rsync: recv_generator: mkdir "/cache/composer" failed: Permission denied (13) [10:21:41] I blame docker [10:22:54] urbanecm: I am pretty sure it is a one off error [10:23:14] on the host we do `mkdir -m 2777 -p cache` which creates the `cache` dir [10:23:37] then the cache artifacts are populated with `docker run --volume "$(pwd)/cache:/cache" ...` [10:23:42] so it should be writable [10:28:43] urbanecm: it is a mystery and should not happen. I am assuming it was a one off error [10:28:59] okay, thanks hashar [10:29:17] there is no reason for ` mkdir "/cache/composer" failed: Permission denied (13)` [10:51:05] 10Release-Engineering-Team, 10Fresh, 10MediaWiki-Core-Tests, 10Browser-Tests, and 2 others: Fresh problem when running Selenium tests - https://phabricator.wikimedia.org/T313899 (10zeljkofilipin) 05Open→03In progress a:03zeljkofilipin [11:10:55] Amir1: Krinkle: I am going to add php-excimer to quibble ci image ( https://gerrit.wikimedia.org/r/c/integration/config/+/748312 ) [11:11:03] given it is not enabled by default and is rather small [11:11:10] not sure why I never processed it :D [11:15:43] (03PS2) 10Hashar: dockerfiles: Add php-excimer to quibble [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:16:57] (03CR) 10Hashar: [C: 03+2] "I am not sure how I have missed this change but here it is finally. php-excimer is only installed for our PHP flavor. Debian has a package" [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:17:36] (03CR) 10CI reject: [V: 04-1] dockerfiles: Add php-excimer to quibble [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:21:09] (03PS3) 10Hashar: dockerfiles: Add php-excimer to quibble [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:21:21] (03CR) 10Hashar: [C: 03+2] dockerfiles: Add php-excimer to quibble [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:22:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:23:25] (03Merged) 10jenkins-bot: dockerfiles: Add php-excimer to quibble [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [11:24:39] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:54:00] 10Release-Engineering-Team, 10docker-pkg: docker-pkg / docker downloads all versions of parent image upon building - https://phabricator.wikimedia.org/T310458 (10hashar) I have tried again building an image named `releng/quibble-buster-php72` which depends on `releng/quibble-buster` ` [docker-pkg-build] INFO -... [11:54:29] !log Manually applied a `docker-pkg` fix on contint2001 to prevent it from downloading unrelated images T310458 [11:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [11:54:31] T310458: docker-pkg / docker downloads all versions of parent image upon building - https://phabricator.wikimedia.org/T310458 [12:13:55] (03CR) 10Hashar: [C: 03+2] "The php74 image has been build and published:" [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [12:14:57] (03PS1) 10Hashar: dockerfiles: fix php-excimer package name for php7.2 [integration/config] - 10https://gerrit.wikimedia.org/r/825733 [12:15:33] (03CR) 10Hashar: [C: 03+2] dockerfiles: fix php-excimer package name for php7.2 [integration/config] - 10https://gerrit.wikimedia.org/r/825733 (owner: 10Hashar) [12:18:22] (03Merged) 10jenkins-bot: dockerfiles: fix php-excimer package name for php7.2 [integration/config] - 10https://gerrit.wikimedia.org/r/825733 (owner: 10Hashar) [12:24:01] (03CR) 10Hashar: [C: 03+2] "After fixing the package name with https://gerrit.wikimedia.org/r/c/integration/config/+/825733 :" [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [12:35:17] brennen: the existence of https://github.com/moabualruz/docker-arm-wikimedia-dev-images suggests that current images don't work on M1/M2 Mac's. Is this known? [12:35:18] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [12:36:10] There's a few tasks about stuff [12:36:12] Maybe we can switch to general upstream debian instead of matching prod. We can keep the prod-ish config for CI I suppose, but we already have custom images for docker dev, might solve it. Not sure exactly what it takes to have it work for arm. [12:36:33] but whatever it takes, I guess it's easier to do without using the prod base image for debian given sre declined making that work for ARM. [12:45:27] 10Project-Admins: Create project tag for Apple Silicon support - https://phabricator.wikimedia.org/T315424 (10hashar) Looks like we might want something more generic than just Apple M1. What about we go with a tag named `ARM Support` and add hashtags aliases as needed (`arm64`, `apple-m1`, `m1-mac` etc)? [12:45:44] Krinkle: the (declined) task is https://phabricator.wikimedia.org/T274140 [12:46:09] that was for SRE base images [12:46:15] for dev images there is https://phabricator.wikimedia.org/T272500 [12:46:41] surely we could get some kind of basic official image [12:46:57] then there is a long tail of various custom packages we would have to build both for amd64 and arm64 [12:48:12] which custom packages do we require in the dev environment? [12:48:54] maybe we can inline their source steps in Dockerfile which presumably makes it more or less work regardless of architecture and thus removes the need for upstreaming to debian.org and/or adding arm to wmf debian repo. [12:48:54] so we could change the base image to an arm64 based one and rebuild the fleet [12:49:04] docker-pkg supports that via the seed image parameter [12:49:30] then bunch of Dockerfiles would need adjustements since some packages would not be available (ie php 7.2) or whatever custom things we had [12:49:47] there will probably be a few glitches here and there whenever using a binary of some sort [12:49:54] so I suspect we would have to copy paste the dockerfiles [12:50:08] something like that [12:50:17] we can make htis php74+ only which debian will provide directly [12:50:28] yes we might [12:50:58] but I don't think anyone will work on that [12:54:28] at least I did ask a tag to be created in phabricator for M1 Mac / arm https://phabricator.wikimedia.org/T315424 [12:54:29] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [13:01:43] I'll find out today how well the current images work (if at all) under M1 with emulation [13:01:44] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [13:02:02] I know that at least for some releng images they don't work at all on ARM, e.g. chromium [13:02:55] performance could probably improved a lot if we didn't write log files to a mounted directory, and if we used mysql instead of sqlite. [13:03:23] using mysql/mariadb would probably benefit us more with regards to being "prod like" than using the exact php/debian image version [13:39:20] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [extensions/UnlinkedWikibase] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/825748 [13:39:24] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [extensions/UnlinkedWikibase] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/825748 (owner: 10QChris) [13:39:28] (03PS1) 10QChris: Import done. Revoke import grants [extensions/UnlinkedWikibase] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/825749 [13:39:32] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [extensions/UnlinkedWikibase] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/825749 (owner: 10QChris) [14:50:56] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10dancy) p:05Triage→03High [14:52:45] qchris: hi, I was wondering if you can execute "gerrit plugin reload replication" [14:53:03] I can send the command but get "remote plugin administration is disabled" [14:53:27] wondering if it's just privileges or global [14:55:20] mutante: It might be that we've turned off reloading in the config. Let me check... [14:55:37] thank you [14:56:27] I am frying to fix gerrit replication from 1001 to 2002 [14:56:49] I am hoping it might be fixed after reload because meanwhile i did add the new host to known_hosts..by manually connecting [14:57:50] We'd need `plugins.allowRemoteAdmin` to be true. It's unset in our config (as far as I can see) and defaults to false. [14:57:58] So no plugin reloading for us. [14:58:00] I also have some vague memory of turning off reloading the replication config [14:58:16] alright, so I guess then I need to do a gerrit service restart [14:58:23] yeah :( [14:58:35] I think if the plugins change on disk gerrit is smart enough to autoreload them [14:58:46] but manual reloading doesn't work [14:58:50] ok, I will do it, now just need to decide whether to do the _other attempt to fix_ as well or not at the same time [15:00:04] one is "add to known_hosts" the other is "use new config option "sshd.enableDeprecatedKexAlgorithms = true"" [15:00:30] my hunch leans towards the latter being the fix [15:00:33] but it's a hunch [15:00:53] also, I don't know if minossh and openssh share known_hosts(?) [15:01:04] my hunch is both are needed, heh [15:01:09] they dont [15:01:20] mina has its own known_hosts in its home dir [15:01:28] but I did a manual ssh connection -as- gerrit2 [15:01:31] ah [15:01:33] which added it there [15:01:46] that makes sense. I could believe they're both needed [15:03:07] so I manually added the "DeprecatedKexAlgo" option to config and puppet is disabled [15:03:23] (fwiw this is also about an old ticket that we have RSA keys) [15:03:33] let me just do the service restart [15:05:06] Krinkle: re: m1/m2 macs, we haven't done anything about this in releng. [15:07:24] thcipriani: success, replication log is SUPER busy now [15:07:31] oh good :) [15:07:38] well done mutante [15:07:39] now I need to just puppetize the config [15:07:54] wait.. there is more :( [15:08:05] there always is [15:08:20] it's actually like before :/ [15:08:28] after a minute [15:08:45] it says it starts replication and then it fails [15:08:46] ah, kicked off replication, but replication is still failing [15:08:52] it might actually be the MINA upstream bug [15:09:02] if the mina version changed in the August 17th upgrade [15:09:12] do you know how to check version of mina? [15:09:40] it's probably set in the maven/build file [15:16:14] (although I'm failing to find it and I'm in a meeting. Also, unzipping the jar would probably work) [15:16:24] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) Tried that, both the new host has been added to known_hosts and the above config option has been added which was made in reaction to the upstream bug. Unfortu... [15:16:47] yea, I tried to find it in operations/software/gerrit but haven't yet [15:16:55] trying the jar in a moment [15:17:09] .war [15:19:56] ./WEB-INF/lib/sshd-mina-2.7.0.jar [15:19:57] ./WEB-INF/lib/mina-core-2.0.21.jar [15:20:49] Affects Version/s: [15:20:50] 2.7.0 [15:21:01] Priority: [15:21:01] Major [15:21:07] ^ basically confirmed upstream bug [15:21:15] blames https://issues.apache.org/jira/browse/SSHD-1163 [15:21:44] https://github.com/apache/mina-sshd/pull/195 [15:22:12] serverKey is "rsa-sha2-256" and "rsa-sha2-512" only return "ssh-rsa" , this lead to a error called "KeyExchange signature verification failed for key type= ssh-rsa". [15:23:00] well. that sounds like the symptoms :\ [15:23:27] yes, that's the one [15:23:33] it matches the version and all [15:23:56] we can try those other settings for the KexAlgos [15:26:08] 10Diffusion, 10Gerrit: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) I downloaded the Gerrit 3.4.5 .war file from the download site, unpacked it and could confirm the mina version: ` ./WEB-INF/lib/sshd-mina-2.7.0.jar ./WEB-IN... [15:43:41] I tried configuring the kex algos in ssh client config [15:43:46] this changed the error message: [15:43:48] Unable to negotiate key exchange for kex algorithms (client: ecdh-sha2-nistp256,ext-info-c / server: curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256 [16:00:32] (03PS1) 10Ahmon Dancy: Wait until the new commit is merged before exiting [tools/release] - 10https://gerrit.wikimedia.org/r/825828 (https://phabricator.wikimedia.org/T315452) [16:01:50] (03CR) 10CI reject: [V: 04-1] Wait until the new commit is merged before exiting [tools/release] - 10https://gerrit.wikimedia.org/r/825828 (https://phabricator.wikimedia.org/T315452) (owner: 10Ahmon Dancy) [16:09:40] (03PS2) 10Ahmon Dancy: Wait until the new commit is merged before exiting [tools/release] - 10https://gerrit.wikimedia.org/r/825828 (https://phabricator.wikimedia.org/T315452) [16:16:59] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) The build got triggered by https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/823724/1//C... [16:18:18] 10Phabricator, 10Release-Engineering-Team: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) I think we should prioritize the broken replication issue first. [16:20:09] 10Release-Engineering-Team, 10Gerrit (Gerrit 3.4): Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 (10Dzahn) 05Resolved→03Open I tried a bunch of things at T315942 and restarted gerrit a couple times but it's not solved yet. [16:21:15] hrm, all I found upstream is https://groups.google.com/g/repo-discuss/c/S0TA0n4icOQ which is says was fixed in gerrit 3.4 (but we're running 3.4.5...) [16:22:15] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10TheresNoTime) >>! In T315897#8178810, @hashar wrote: > The build got triggered by https://gerrit.wikimedi... [16:27:01] thcipriani: [16:27:03] 16:24 < paladox> The issue you linked to links to https://issues.apache.org/jira/plugins/servlet/mobile#issue/SSHD-1163 [16:27:06] 16:24 < paladox> Which has a fix [16:27:09] 16:25 < paladox> It seems fixed in 2.8 [16:27:11] 16:25 < paladox> Gerrit 3.5 uses 2.7 [16:27:14] 16:25 < mutante> yea, I saw that. ok, mina sshd 2.8 [16:27:16] 16:26 < mutante> does any Gerrit version use 2.8? [16:27:19] 16:26 < paladox> 3.6 [16:27:21] (03CR) 10Ladsgroup: "Thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/748312 (https://phabricator.wikimedia.org/T225730) (owner: 10Ladsgroup) [16:28:21] well. I know hash.ar is planning an upgrade to 3.5 Soon™. [16:28:28] thcipriani: there is already https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/824200 but maybe we can go further [16:28:37] 3.6 though.. hmm [16:28:39] I don't get how this only started happening when we moved servers [16:28:46] maybe we have to revert and then jump to 3.6 [16:28:48] no versions changed [16:28:54] it started happening when Gerrit was upgraded [16:28:55] on Aug 17 [16:28:58] to 3.5.4 [16:29:22] dont think it was actually related to server switch [16:29:27] you mean 3.4.5? [16:29:33] it fits the Aug 17 upgrade [16:29:35] yes, I do [16:29:51] both gerrit1001 and gerrit2002 /srv/deployment were changed that day [16:30:22] There was no package update for sshd/mina tho [16:31:13] we went 3.4.4 → 3.4.5 on Aug 17 https://sal.toolforge.org/log/laoRq4IBa_6PSCT95ciU [16:31:25] we could try a revert if that's when replication stopped [16:32:51] yes, that is when it stopped [16:33:00] at this point I would like to do that and confirm it works again [16:33:05] but it seems surprising. Nothing really suspicious in that upgrade: https://www.gerritcodereview.com/3.4.html#345 [16:33:23] well, I think there is actually something pretty suspicious in it [16:33:30] hashar: ^ any objection to trying a revert of gerrit? [16:33:47] this https://www.gerritcodereview.com/3.4.html#jcraft-jsch-client-library-is-disabled-per-default [16:33:57] using different client lib [16:34:19] Deprecated JCraft JSch client library is replaced with MINA SSHD client library per default. There is still option to switch to using JCraft JSch client library. Support for JCraft JSch will be removed in the next gerrit release. [16:34:53] except that option was supposed to be the "allow deprecated kex algos" [16:34:56] afaict [16:35:02] and it did not fix it yet [16:36:18] was that specific to the 3.4.5 release? Or the 3.4.0 release? We went 3.4.4 → 3.4.5 on the 17th [16:39:17] anyway, I can revert the latest commit on the deploy server and re-deploy to see if the problem fixes itself? [16:39:49] if the timing matches up, it is probably worth trying. [16:41:45] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) @TheresNoTime : sorry to rephrase, Jenkins find the account disabled, we apparently have disabled... [16:41:53] (03PS1) 10Ahmon Dancy: Call tasks.clear_message_blobs after restarting php-fpm [tools/scap] - 10https://gerrit.wikimedia.org/r/825835 (https://phabricator.wikimedia.org/T263872) [16:41:57] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) [16:42:39] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) [16:42:53] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) [16:43:00] 10Release-Engineering-Team (Bonus Level 🕹️), 10Scap, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Patch-For-Review, and 2 others: Localisation cache must be purged after or during train deploy, not (just) before - https://phabricator.wikimedia.org/T263872 (10Krinkle) [16:43:04] 10Release-Engineering-Team (Radar), 10Scap, 10MediaWiki-Internationalization, 10Performance-Team: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 (10Krinkle) [16:43:09] thcipriani: the upgrade ticket did not really tell me what the previous version was [16:43:11] I am too bad at editing tasks [16:43:12] (03CR) 10Krinkle: [C: 03+1] Call tasks.clear_message_blobs after restarting php-fpm [tools/scap] - 10https://gerrit.wikimedia.org/r/825835 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [16:43:24] thcipriani: yes I am objecting revert of gerrit [16:43:24] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) [16:43:39] mutante: thanks for jumping on T315942 so quickly :) [16:43:40] T315942: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 [16:43:48] 10Phabricator, 10Epic: [EPIC] Gather requirements from teams for Phab project management feature requests - https://phabricator.wikimedia.org/T105404 (10Pols12) [16:43:57] 10Phabricator, 10Discovery-Search, 10Elasticsearch: Fix provided search results in Wikimedia Phabricator - https://phabricator.wikimedia.org/T75854 (10Pols12) [16:44:03] 10Phabricator (2016-11-16), 10Upstream: Phabricator project auto-complete is arbitrary (in a bad way) - https://phabricator.wikimedia.org/T99739 (10Pols12) 05Resolved→03Open Currently, there is no easy way to get #translate. “Translate” is an alias, but if you type even “translat”, you don’t get it in sear... [16:44:08] but I would have to look up what ever is going on I guess ;) [16:44:24] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10TheresNoTime) Ah, thank you for the explanation 😌 [16:45:00] hashar: see https://phabricator.wikimedia.org/T315942#8176568 and fff [16:45:06] 10Phabricator, 10Upstream: Phabricator project auto-complete is arbitrary (in a bad way) - https://phabricator.wikimedia.org/T99739 (10Pols12) [16:45:09] bd808: sure, it seemed High prio [16:45:41] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10dancy) @TheresNoTime Apologies on behalf of Jenkins for targeting you. :-) [16:46:22] so that would be gerrit replication broken [16:46:23] hashar: what's your objection? What's happening: https://phabricator.wikimedia.org/T315942 seems it aligns with the upgrade, but we haven't figured out why, exactly. [16:46:30] 10Phabricator, 10Upstream: Phabricator project auto-complete is arbitrary (in a bad way) - https://phabricator.wikimedia.org/T99739 (10Krinkle) a:05ksmith→03None [16:46:44] I note that 3.4.5 came with a jgit upgrade [16:46:56] yes, it is gerrit replication broken. I pasted the errors, the upstream bug, the supposed fix, what I tried ... [16:46:59] that's the only thing that looks suspect in that [16:48:45] and that would have started immediately after upgrading to 3.4.5? [16:48:50] what was the version we upgraded frm? [16:49:00] it started on the same day, August 17 [16:49:06] matches the git log on replica [16:49:32] we upgraded from 3.4.4 (according to git logs) [16:50:08] looks like 3.4.4 went out 2022-04-28 judging by scap tags [16:50:54] 10Phabricator, 10Epic: [EPIC] Gather requirements from teams for Phab project management feature requests - https://phabricator.wikimedia.org/T105404 (10Krinkle) [16:50:57] 10Phabricator, 10Discovery-Search, 10Elasticsearch: Fix provided search results in Wikimedia Phabricator - https://phabricator.wikimedia.org/T75854 (10Krinkle) [16:51:09] 10Phabricator, 10Upstream: Phabricator project auto-complete is arbitrary (in a bad way) - https://phabricator.wikimedia.org/T99739 (10Krinkle) 05Open→03Resolved a:03ksmith Re-closing. Please file a new bug for new issues. @Pols12 The suggestions I get for `translat` seem quite reasonable. There has to... [16:52:02] I am trying to find out what is wrong with the cofw replica [16:52:14] given the replication to github works at least [16:54:25] thank you jgit [16:54:26] whether github works or not is based on which of the attempted fixes is currently applied [16:54:46] puppet is disabled and I edited both gerrit config and sshd client config [16:54:57] based on which combo replication to github fails as well [16:55:12] sshd_config options are set on both source and target [16:55:22] client config only on the source [16:56:01] if you want to try it yourself then I can revert the changes [16:56:59] I think I fixed github replication [16:57:16] Caused by: org.apache.sshd.common.SshException: KeyExchange signature verification failed for key type=ssh-rsa [16:57:33] could you see the updates I pasted? [16:58:02] so that would be the rsa key pair being rejected either by jgit on the clent side or by the ssh daemon on the remote side [16:58:04] I already spent a couple hours on that [16:58:12] and there are all the known details [16:58:17] including the upstream bug link [17:08:06] I finished reading [17:08:32] so I don't think that is kex exchange related and `sshd.enableDeprecatedKexAlgorithms` would not be needed [17:08:35] well maybe it could [17:08:51] but the error is about ssh-rsa which looks like jgit no more accepts a ssh-rsa key pair [17:09:04] do you want me to revert changes I made and re-enable puppet? [17:09:04] which sounds similar to https://phabricator.wikimedia.org/T276486 debug1: send_pubkey_test: no mutual signature algorithm [17:09:39] it will also need another service restart probably [17:09:47] we can't reload replication config by itself [17:10:43] changes made are in gerrit.config and /var/lib/gerrit2/.ssh/config [17:11:08] I tried different combinations of KexAlgorithms [17:11:20] have your changed modified something in the Gerrit log output? [17:11:43] no, I have not edited logs or logging config [17:11:46] I see things such as `ignoring unknown algorithm 'curve25519-sha256@libssh.org' in KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256` [17:11:52] I have edited sshd and ssh client config [17:12:01] and tried different kombox of kex algos [17:12:01] yea [17:12:30] that's one of the things that was mentioned by other users to work-around the mina bug [17:12:52] it would change again after the next service restart [17:13:02] Gerrit/Jgit use a java client and I don't think they honor everything from ssh config or might well come with their own config file [17:13:06] or once I revert the edit of ssh client config [17:13:17] this IS its own config file [17:13:27] in the gerrit home [17:14:45] we should either sync on the status or I should revert it [17:17:45] so... which one is it [17:19:04] that jgit update is massive [17:19:44] I think it would make more sense if I simple revert everything and bring it back to the "before" state. [17:19:45] wouldn't be the first time a jgit update caused some unintended knock-on effects [17:21:13] so yeah hmm [17:21:32] I think we could theorically regenerate the ssh key pair used for replication which is apparently using ssh-rsa [17:21:37] to well something else [17:21:48] with no guarantee it actually fixes it [17:22:21] https://issues.apache.org/jira/browse/SSHD-1163 [17:22:30] or figure out some jgit parameters [17:22:43] or well as suggested rollback to 3.4.4 to unbreak the replication [17:25:17] if I had at least looked the replication dashboard or at the gerrit logs in kibana I would have caught it last week :-\ [17:25:47] mutante: may you reenable puppet again to restore the manual changes made? [17:26:04] hashar: yes, re-enabling puppet [17:26:47] - enableDeprecatedKexAlgorithms = true [17:27:05] this is removed from config on 1001 and 2002 [17:27:15] now cleaning up ssh client config [17:28:21] well, that was rm /var/lib/gerrit2/.ssh/config on 1001 [17:28:31] note how that is the gerrit ssh client [17:28:37] I will roll it back cause there is no obvious fixes :-\ [17:28:49] can you do a service restart then? [17:28:59] unless that is automatically part of it [17:29:08] I did like 3 earlier and then stopped myself [17:30:06] one thing left to note, I added gerrit2002 to var/lib/gerrit2/.ssh/known_hosts and that was needed as well but unrelated to the kex issue [17:30:08] and later we can try regenerating the ssh key used for the replication (that is a key pair for the gerrit2 user) [17:30:21] using a differnet algo than rsa [17:30:39] the known_hosts should be maintained by puppet [17:30:50] as long as we are using mina 2.7.0 I think it's not going to fix it [17:30:56] but we can still try [17:33:02] if we could downgrade just mina that would be nice [17:33:17] or upgrade..as long as it's not the current one [17:34:57] confirmed known_hosts file with gerrit2002 in it is in puppet repo [17:35:14] (03PS1) 10Hashar: Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) [17:35:25] mutante: that is the rollback ^ [17:36:45] (03CR) 10Dzahn: [C: 03+1] "let's try this if even just to confirm the kex algo error is gone and replication works again" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:36:57] hashar: even if just to know it's really gone with that [17:37:24] (03CR) 10Hashar: [C: 03+2] Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:37:45] (03Merged) 10jenkins-bot: Revert "Gerrit v3.4.5 and rebuild plugins" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:38:52] 10GitLab (Project Migration), 10Product-Analytics, 10wmfdata-python: Move Wmfdata-Python from Github to Gitlab - https://phabricator.wikimedia.org/T304544 (10Milimetric) a:05Milimetric→03None [17:40:36] !log Stopping Gerrit [17:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:41:05] err wrong channel ) [17:43:02] Project beta-code-update-eqiad build #405782: 04FAILURE in 0.8 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/405782/ [17:44:56] the revert fixed it indeed, it looks [17:45:15] a bunch of replication starting and dont see the exceptions anymore so far [17:45:57] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/extensions/WikimediaEvents.git completed in 42579ms [17:46:00] nice [17:46:30] (03CR) 10Dzahn: [C: 03+1] "it did indeed fix replication:" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/825845 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:46:53] hashar: by the way, paladox said "it's fixed in 3.6" earlier [17:47:11] possibly that is the other bug [17:47:27] hashar: can you see what is the mina version now? [17:47:34] I mean https://phabricator.wikimedia.org/T276486 [17:47:35] earlier I got it from the .war file [17:47:38] Yippee, build fixed! [17:47:38] Project beta-code-update-eqiad build #405783: 09FIXED in 1 min 51 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/405783/ [17:48:05] I think we should regenerate the ssh key pair used for replication (with huser `gerrit2`) [17:48:10] using a different algo [17:48:12] hashar: yea, I had that old ticket in mind too.. maybe it's both [17:48:15] ensure it works with 3.4.4 [17:48:17] then upgrade to 3.4.5 [17:48:36] there is another ticket about changing the host key, but that one surely has a fairly large impact [17:48:56] what I don't get is which key that `KeyExchange signature verification failed for key type=ssh-rsa` is complaining about [17:49:04] is that the ssh key pair for the gerrit2 user [17:49:13] hashar: did you see the mina upstream bug? it's that exact error message [17:49:19] or is that the remote host server signature [17:49:22] and we have the exact version [17:49:40] that is why I asked what the version is now [17:50:59] 10Diffusion, 10Gerrit, 10Patch-For-Review: Diffusion mirrors of Gerrit repos not showing commits made since August 17 - https://phabricator.wikimedia.org/T315942 (10Dzahn) replication works again after gerrit was reverted to 3.4.4 Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/ex... [17:51:09] ./WEB-INF/lib/sshd-mina-2.7.0.jar [17:51:10] ./WEB-INF/lib/mina-core-2.0.21.jar [17:51:15] that's how it was before the revert [17:51:26] and https://issues.apache.org/jira/browse/SSHD-1163 [17:51:30] Affects Version/s: [17:51:31] 2.7.0 [17:51:43] $ git grep "SSHD_VERS =" v3.4.4 v3.4.5 v3.5.2 v3.6.1 [17:51:43] v3.4.4:tools/nongoogle.bzl: SSHD_VERS = "2.6.0" [17:51:43] v3.4.5:tools/nongoogle.bzl: SSHD_VERS = "2.7.0" [17:51:43] v3.5.2:tools/nongoogle.bzl: SSHD_VERS = "2.7.0" [17:51:43] v3.6.1:tools/nongoogle.bzl: SSHD_VERS = "2.8.0" [17:52:04] well, this matches again :) [17:52:09] " always identify like ssh-rsa)" [17:52:21] same error..same version.. only 3.4.4 vs 3.4.5 [17:53:07] it seems like we could have gerrit 3.4.5 if only we made sure mina was 2.6 or 2.8 [17:54:25] see now why I think it's that and not the fact that it's an RSA key [17:56:11] I have updated our task with some infos [17:56:37] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10hashar) [18:00:46] mutante: thank you for all the investigations ;) [18:01:37] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) To summarize from my side again: There is an upstream bug in mina sshd. It causes the exact error message we... [18:02:54] hashar: I left an update as well. thanks for reverting. now just waiting for mediawiki-core to be replicated and then if it shows up here: https://phabricator.wikimedia.org/source/mediawiki/history/master/ [18:03:26] that was the original user report [18:03:42] of course we had to fix Gerrit replication.. whether we had to fix Phabricator showing it.. is another story, heh [18:04:02] even the bug reporter already assumed it's just becuase we remove it from phab [18:05:16] mutante: Phabricator polls from gerrit-replica.wikimedia.org [18:05:32] which offloads the primary gerrit [18:06:16] hashar: yea, known. it's how all this started. user noticed it's not on phab, I start looking and see gerrit-replica and gerrit doesn't have same "git log" status [18:06:30] for mw-core [18:07:04] I think it is worth trying regenerating the ssh key pair for `gerrit2` user from ssh-rsa to whatever is more modern [18:07:09] hashar: but since you also fixed the reason for the offloading... [18:07:17] ed231352 [18:07:20] update those , restart Gerrit 3.4.4 and verify replication still works [18:07:24] hashar: does it mean we should go back to send all clients to the main gerrit only? [18:07:28] then try upgrade to 3.4.5 again [18:07:32] because there is no reason anymore to offload [18:07:39] and then we know the replica is not "production" [18:07:45] or maybe theorically I can reproduce locally using an openssh daemon with the same config we use in prod [18:08:10] the replica is production! [18:08:31] it became production once we put clients on it [18:09:08] I think it's pretty clear this is the mina 2.7.0 upstream bug and changing the version of that in any way, up or down, would be more worth it. [18:12:22] 10Release-Engineering-Team (Bonus Level 🕹️), 10Scap: scap: add progress reporting to php-fpm-restarts - https://phabricator.wikimedia.org/T302631 (10dancy) 05Resolved→03Open I've run into two cases of hangs here in train-dev since merging https://gerrit.wikimedia.org/r/824774 ` 17:10:46 Started php-fpm-res... [18:13:49] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) This is expected to work again per: ` [2022-08-23 17:44:49,216] Replication to gerrit2@gerrit2002.wikimedia.... [18:16:12] mutante: we can not upgrade mina [18:16:28] short of speed upgrading to Gerrit 3.6 but I am not ready yet for that one [18:16:42] though I wanted to do 3.5 this week or next week but with the replication bug that is not possible ;) [18:18:46] gotcha! ok. so there is yet another path [18:19:01] that is changing the ssh client config and tell it to use speficic kex algos [18:19:20] one user gave an example of what fixed it for them [18:19:34] that is why I had edited the .ssh/config in the gerrit home dir to try that [18:21:35] that does change the error message, for example to : Unable to negotiate key exchange for kex algorithms (client: ecdh-sha2-nistp256,ext-info-c / server: curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256 [18:21:41] sorry it is too late for me to follow up [18:21:57] yea, that's why it's all on the ticket. good night [18:21:58] what I don't get is wetherh it is a problem with the user ssh key pair (which apparently is ssh-rsa) [18:22:06] it's not [18:22:11] or with the kex algo (which are well various) [18:22:33] cause our openssh on port 22 of gerrit2002 should certainly support a wide range of algo [18:22:35] it's 'rsa-sha2-256 and rsa-sha2-512 always identify like ssh-rsa" [18:22:40] that's the bug [18:23:06] then there was another thread mentioning ssh-rsa key pair being an issue as well [18:23:09] gerrit 3.6 should fix it [18:25:54] let's stop here, you literally just said it's too late to follow and my answer to your next question would be that thing:) [18:26:51] I think it is mixing up things yeah [18:27:44] and it looks like the issue can be fixed by upgrading the key pair from `ssh-rsa` [18:27:47] I made my point on the ticket and on IRC a couple times. I can't deal with ad-hoc anymore now [18:28:03] because you said to stop and then you ask that exact thing [18:28:31] yeah [18:28:49] I am going to have dinner ;-] thanks for the debugging! [18:28:55] good night, I also need to take a walk and get a coffee, cya [18:29:10] enjoy dinner, thanks [18:47:37] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) Told Phabricator to reschedule updating the repo, via web UI diffusion -> .. -> manage repo,..-> update now (s... [18:53:37] 10Phabricator, 10Release-Engineering-Team: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) Since replications is fixed... now.. back to this. Note how it never says "0 processes", so it's not about phd failing. Instead it is sometimes 2, then 4 then 11(!) proce... [18:53:59] (03PS1) 10Ahmon Dancy: Add "scap php-fpm-restart" [tools/scap] - 10https://gerrit.wikimedia.org/r/825873 (https://phabricator.wikimedia.org/T302631) [18:55:26] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10hashar) >>! In T315942#8179284, @Dzahn wrote: > Told Phabricator to reschedule updating the repo, via web UI diffusio... [18:58:24] 10Phabricator, 10Release-Engineering-Team: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) The original NRPE command was/is: ` nrpe::monitor_service { 'check_phab_taskmaster': description => 'PHD should be supervising processes',... [19:00:13] 10Phabricator, 10Release-Engineering-Team, 10observability: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) [19:02:35] 10Phabricator, 10Release-Engineering-Team, 10observability: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) something is at odds here, see the existing puppet config above. It's how it's always been and that creates this file locally: ` @phab1001:/etc/nagios/... [19:05:25] 10Phabricator, 10Release-Engineering-Team, 10observability: Flapping Phabricator processes monitoring - https://phabricator.wikimedia.org/T315962 (10Dzahn) ` commit 132db90ecdb9a499a05a2d346b843616d6af61ef Author: Filippo Giunchedi Date: Mon Jul 11 13:13:33 2022 +0200 phabri... [19:08:21] 10Diffusion, 10Gerrit, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) Alright, thanks. That seemed to be at odds with the "`mediawiki/core.git completed in 86808ms`" but will just... [19:19:15] (03PS1) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [19:23:05] (03PS2) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [19:24:01] (03CR) 10Jeena Huneidi: "printed message to console in train-dev, just need to see it on irc" [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [19:26:54] (03CR) 10CI reject: [V: 04-1] scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [19:27:01] (03PS3) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [19:29:14] (03PS4) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [19:39:00] (03CR) 10Eileen: "that looks good - how do we co-ordinate" [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [19:39:12] 10Release-Engineering-Team, 10Gerrit (Gerrit 3.4): Upgrade Gerrit to 3.4.5 - https://phabricator.wikimedia.org/T315408 (10Dzahn) [19:40:12] 10Diffusion, 10Gerrit, 10serviceops, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) 05Open→03Resolved a:03Dzahn @Yaron_Koren https://phabricator.wiki... [19:40:19] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10greg) Moving back to Backlog now that Ty... [19:40:52] 10Diffusion, 10Gerrit, 10serviceops, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) [19:40:56] 10Release-Engineering-Team (Bonus Level 🕹️), 10Scap: Scap backport: Do not exit with error when a sync is cancelled - https://phabricator.wikimedia.org/T316045 (10jeena) [19:48:46] (03CR) 10Thcipriani: [C: 04-1] CiviCRM: Decommission crm/civicrm (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [19:58:43] (03CR) 10Eileen: CiviCRM: Decommission crm/civicrm (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [20:02:09] (03PS1) 10Ahmon Dancy: Revert "Add progress reporting to php-fpm-restarts" [tools/scap] - 10https://gerrit.wikimedia.org/r/825888 [20:11:04] (03CR) 10Eileen: "And here is the CiviCRM patch https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/825881" [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [20:20:23] (03CR) 10Ahmon Dancy: [C: 03+2] Revert "Add progress reporting to php-fpm-restarts" [tools/scap] - 10https://gerrit.wikimedia.org/r/825888 (owner: 10Ahmon Dancy) [20:42:34] (03Merged) 10jenkins-bot: Revert "Add progress reporting to php-fpm-restarts" [tools/scap] - 10https://gerrit.wikimedia.org/r/825888 (owner: 10Ahmon Dancy) [20:50:48] (03CR) 10Ahmon Dancy: scap backport: IRC notify upon testserver deploy (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [20:51:05] (03CR) 10Jeena Huneidi: [C: 03+2] Call tasks.clear_message_blobs after restarting php-fpm [tools/scap] - 10https://gerrit.wikimedia.org/r/825835 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [20:52:30] (03CR) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [20:54:28] (03PS5) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [20:55:28] (03CR) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [20:56:46] (03PS6) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [20:57:06] (03PS2) 10Thcipriani: CiviCRM: Decommission crm/civicrm [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) [20:57:11] (03CR) 10Thcipriani: [C: 03+2] CiviCRM: Decommission crm/civicrm [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [21:00:43] hrm zuul doesn't seem to be...doing anything [21:03:20] (03CR) 10Ahmon Dancy: scap backport: IRC notify upon testserver deploy (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [21:03:53] 10Diffusion, 10Gerrit, 10serviceops, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10QChris) Great debugging @Dzahn! As there was talk about mitigating the MINA u... [21:03:54] lazy [21:03:59] ^ [21:04:01] I guess [21:04:14] maybe zuul merge took a long time? [21:04:18] kinda what the logs looked like [21:05:30] zuul merger on contint1001 is working hard. ¯\_(ツ)_/¯ [21:05:30] (03Merged) 10jenkins-bot: CiviCRM: Decommission crm/civicrm [integration/config] - 10https://gerrit.wikimedia.org/r/823221 (https://phabricator.wikimedia.org/T314995) (owner: 10Thcipriani) [21:07:15] (03Merged) 10jenkins-bot: Call tasks.clear_message_blobs after restarting php-fpm [tools/scap] - 10https://gerrit.wikimedia.org/r/825835 (https://phabricator.wikimedia.org/T263872) (owner: 10Ahmon Dancy) [21:09:33] 10Release-Engineering-Team (Radar), 10Scap, 10MediaWiki-Internationalization, 10Performance-Team: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 (10dancy) [21:09:47] 10Release-Engineering-Team (Bonus Level 🕹️), 10Scap, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Patch-For-Review, and 2 others: Localisation cache must be purged after or during train deploy, not (just) before - https://phabricator.wikimedia.org/T263872 (10dancy) 05Open→03Resolved All set. [21:13:34] * thcipriani curses jjb [21:15:35] if the jjb command is different from the one in my bash_history it always takes me like five tries to get the new command right [21:17:34] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) Rereading the #upstream issue [[ https://issues.jenkins.io/browse/JENKINS-67981 | J... [21:19:55] (03PS7) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) [21:20:57] 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team, 10Upstream: Jobs started failing on https://releases-jenkins.wikimedia.org on 2022-08-21 - https://phabricator.wikimedia.org/T315897 (10hashar) Or well grab the build from their CI at https://ci.jenkins.io/job/Plugins/job/git-p... [21:21:05] maintenance-disconnect-full-disks build 415132 integration-agent-docker-1024 (/: 28%, /srv: 99%, /var/lib/docker: 47%): OFFLINE due to disk space [21:21:11] (03CR) 10Jeena Huneidi: scap backport: IRC notify upon testserver deploy (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [21:25:59] maintenance-disconnect-full-disks build 415133 integration-agent-docker-1024 (/: 28%, /srv: 78%, /var/lib/docker: 46%): RECOVERY disk space OK [21:28:07] thcipriani: integration/config has some helpers to invoke jjb `./jenkins-jobs` `./jjb-test` `./jjb-update` [21:28:08] ;) [21:28:46] hashar: that assumes I put my configuration ini in the right spot --- I'm way too old-school to have anything set up in a standard way :P [21:30:20] * thcipriani bespoke computing hipster [21:31:19] is jenkins in a maintenance window? it looks like it does, right? [21:31:41] thcipriani: guess you can improve the script to complain to the user about the ini file location ;) [21:32:07] mutante: if it is in maintenance (restarting) it says so on the web ui [21:33:36] so no not in maintainance [21:33:37] hashar: ok, ACK, it's not then. I think I can see what I am waiting for [21:34:11] the merger was acting a little sluggish. There's a giant civi change that I just rechecked that I think gummed up the works a bit. [21:34:38] contint2001:~$ zuul-gearman.py status|grep merger:merge [21:34:38] merger:merge 125 2 2 [21:34:41] correct [21:35:21] is the merger only active on 1001? [21:35:27] on both [21:35:42] it's alright now, got V+2 [21:35:50] the numbers above read as: 125 in queue, 2 being processed, 2 workers willing to take that function [21:36:30] oh [21:36:40] but they're both doing the same checkout [21:37:02] hmm [21:37:48] https://phabricator.wikimedia.org/P32857 [21:37:51] that's strange [21:38:13] yeah [21:38:19] ps -u zuul f [21:38:24] would give a bit of details [21:39:08] one is doing stuff such as git checkout-index -u -f -- civicrm/bower_components/jquery-ui/themes/ui-lightness/images/ui-bg_glass_100_f6f6f6_1x400.png [21:39:22] causes the merge is done file per file [21:39:53] the other does something similar: git checkout-index -u -f -- civicrm/vendor/ezyang/htmlpurifier/library/HTMLPurifier/HTMLModule/Image.php [21:39:59] tldr the disk are hell of slow [21:40:04] or they are dieing [21:40:25] and the later would not surprise me since they are good old HDD [21:40:46] that have been crunching data over and over to generate those patches or write all those jenkins build logs and artifacts [21:40:56] but should they both be checking out the same commit? Or are there multiple jobs in the gearman queue that require duplicate work? [21:41:03] so the slowness might be a sign of defect, or raid has an issue [21:41:22] it is a merge request per pipeline iirc [21:42:00] and if there are dependencies, the same change can be fetched twice [21:42:05] aka if you send a chain of A > B [21:42:10] there is a merge request for A [21:42:23] and one for A + B (which results in a `git merge A` AND a `git merge B` [21:42:35] so in that scenario A is indeed merged twice [21:43:15] ah [21:43:16] right [21:43:55] but it's in its own queue in its own pipeline, so I would still think one job [21:44:00] from `iotop` disk writes are at 60/70 M/s [21:46:43] that seems slow, but internet tells me it's nominal. Darn spinny disks. [21:47:11] 2022-08-23 20:37:38,492 DEBUG zuul.Scheduler: Adding merge complete event for build set: in test-prio> #builds: 0 merge state: PENDING> [21:47:11] 2022-08-23 20:44:53,880 DEBUG zuul.Scheduler: Adding merge complete event for build set: in test> #builds: 0 merge state: PENDING> [21:47:11] 2022-08-23 21:45:38,599 DEBUG zuul.Scheduler: Adding merge complete event for build set: in test-prio> #builds: 0 merge state: PENDING> [21:47:35] that is all for the same the crm patch [21:47:57] from three different QueueItem (which more or less represent a change in the queue) [21:48:07] but there isn't one in test-prio [21:48:25] ah [21:49:51] 2022-08-23 21:22:34,708 INFO zuul.Scheduler: Adding wikimedia/fundraising/crm, to [21:49:51] 2022-08-23 21:22:34,709 INFO zuul.Scheduler: Adding wikimedia/fundraising/crm, to [21:50:13] and in some previous log entries it tries other pipelines as well [21:50:49] that's interesting [21:51:56] or maybe at some point there was another change for crm with a depends-on from/to a change in another project [21:52:33] and somehow the scheduler would have kept that states and keep attempting to add new changes to the other pipelines [21:52:37] who knows really :-\ [21:53:33] that's strange that it's adding a queue for test-prio. Makes me wonder about all the different pipelines. [21:53:39] and how zuul-merger sees that [21:53:51] or how zuul sends events to gearman for that, rather [21:55:31] the zuul-merger is very basic [21:55:43] it is a function / a lambda [21:55:56] have some parameters, returns some result (a ref to fetch) [21:58:41] yeah, I wonder if the implementation of test-prio means the zuul merge job always runs for test-prio, even if a repo has no job in the test-prio queue? [21:58:56] since we don't actually do any additional filters for test-prio [21:59:47] hmm [22:00:15] test vs test-prio that defines the precedence set by the Zuul scheduler when triggering a job [22:00:26] and I don't think the merge requests are subjects to that [22:00:29] right, but the triggers are the same for both [22:00:38] so, like, zuul doesn't yet realize it shouldn't do a separate merge operation for test-prio [22:00:44] so the change get received, the merge requests are issued [22:00:51] yeah... [22:00:56] and I would guess they are all processed with the same precedence [22:01:10] then once the result from the merge is received, the jobs get triggered by calling the function [22:01:17] and the precedence is set (high, normal, low) [22:01:41] so do we do two merge operations for everything in the test queue? [22:01:44] :D [22:02:08] and thus there are two stages: 1) zuul-merger, everyone race for the two slots 2) trigger to jenkins with high precedence being run as long as there are any waiting, then normal, and finally low [22:02:26] (03PS1) 10Ahmon Dancy: Add progress reporting to php-fpm-restarts (v2.0) [tools/scap] - 10https://gerrit.wikimedia.org/r/825915 (https://phabricator.wikimedia.org/T302631) [22:02:37] for that CRM change triggering two merges, there is one for `þest` and another for `test-prio [22:02:44] and that second one, I have no idea why it exists [22:02:46] well [22:03:00] the scheduler has some arcane logic or a good reason to do that maybe [22:03:19] maybe cause it some how think something else requires that merge [22:04:00] contint2001 disk got IO starved https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=contint2001&var-datasource=thanos&var-cluster=ci&from=now-6h&to=now&viewPanel=6 [22:04:50] and contint1001 show something more or less similar https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=contint1001&var-datasource=thanos&var-cluster=ci&from=now-6h&to=now&viewPanel=6 though it seems it was at 60% rather than the 90% on contint2001 [22:05:18] and now it is midnight :@ [22:05:38] and you're looking at grafana :D [22:07:31] (03CR) 10Ahmon Dancy: "This is a redo since I reverted https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/824774 after noticing hangs in train-dev." [tools/scap] - 10https://gerrit.wikimedia.org/r/825915 (https://phabricator.wikimedia.org/T302631) (owner: 10Ahmon Dancy) [22:14:30] sleep & [22:20:06] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding for logging testing/improvements" [tools/scap] - 10https://gerrit.wikimedia.org/r/825873 (https://phabricator.wikimedia.org/T302631) (owner: 10Ahmon Dancy) [22:29:55] (03CR) 10Ahmon Dancy: [C: 03+2] "Tested in train-dev" [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [22:32:28] (03CR) 10Ahmon Dancy: "Tested in train-dev" [tools/release] - 10https://gerrit.wikimedia.org/r/825828 (https://phabricator.wikimedia.org/T315452) (owner: 10Ahmon Dancy) [22:36:38] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10Eileenmcnaughton) Woohoo - [22:37:08] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10Eileenmcnaughton) 05Open→03Resolved... [22:37:19] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10Eileenmcnaughton) Thank you very much @... [22:45:52] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10thcipriani) \o/ thanks @Eileenmcnaughton [22:50:02] (03Merged) 10jenkins-bot: scap backport: IRC notify upon testserver deploy [tools/scap] - 10https://gerrit.wikimedia.org/r/825876 (https://phabricator.wikimedia.org/T314613) (owner: 10Jeena Huneidi) [22:55:42] maintenance-disconnect-full-disks build 415151 integration-agent-docker-1024 (/: 28%, /srv: 100%, /var/lib/docker: 47%): OFFLINE due to disk space [23:01:13] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Releng - please help us decommission our crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10Eileenmcnaughton) Instructions to fr-tec... [23:05:36] maintenance-disconnect-full-disks build 415153 integration-agent-docker-1024 (/: 28%, /srv: 78%, /var/lib/docker: 46%): RECOVERY disk space OK [23:34:20] 10Release-Engineering-Team, 10Fundraising-Backlog, 10Wikimedia-Fundraising-CiviCRM, 10Fundraising Sprint Overused petting Zoo Memetics, 10Patch-For-Review: Decommission Fundraising's crm/civicrm git repo - https://phabricator.wikimedia.org/T314995 (10Aklapper)