[02:44:05] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:45:59] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:24:29] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:24:48] 10Release-Engineering-Team (Doing), 10Quibble: Establish communication channel for Quibble development (plot twist: Slack channel) - https://phabricator.wikimedia.org/T286770 (10Legoktm) Use of Slack is exclusionary as it forces users to agree to their privacy-infringing TOS and use non-free software. IRC is u... [03:26:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:49:54] (03PS2) 10Zoranzoki21: Add Juan90264 to the CI allowlist [integration/config] - 10https://gerrit.wikimedia.org/r/704455 [09:30:07] (03CR) 10Hashar: [C: 03+2] "Well done :)" [integration/config] - 10https://gerrit.wikimedia.org/r/705186 (https://phabricator.wikimedia.org/T286869) (owner: 10Samtar) [09:30:39] (03CR) 10Hashar: [C: 03+2] Add Juan90264 to the CI allowlist [integration/config] - 10https://gerrit.wikimedia.org/r/704455 (owner: 10Zoranzoki21) [09:31:13] (03Merged) 10jenkins-bot: Update email address for samtar in layout.yaml [integration/config] - 10https://gerrit.wikimedia.org/r/705186 (https://phabricator.wikimedia.org/T286869) (owner: 10Samtar) [09:31:47] (03Merged) 10jenkins-bot: Add Juan90264 to the CI allowlist [integration/config] - 10https://gerrit.wikimedia.org/r/704455 (owner: 10Zoranzoki21) [09:34:04] (03CR) 10Hashar: [C: 03+2] zuul: Add tox-docker for cloud/metricsinfra/prometheus-manager [integration/config] - 10https://gerrit.wikimedia.org/r/705174 (owner: 10Majavah) [09:35:05] (03Merged) 10jenkins-bot: zuul: Add tox-docker for cloud/metricsinfra/prometheus-manager [integration/config] - 10https://gerrit.wikimedia.org/r/705174 (owner: 10Majavah) [10:07:52] legoktm: How often does LibUp refresh its knowledge of repos' dependencies? We landed a patch dropping sentry/sentry from mediawiki/services/function-schemata on Tuesday but LibUp sill thinks it depends on it: https://libraryupgrader2.wmcloud.org/r/mediawiki/services/function-schemata?branch=main Is it just slow, or should I file a bug? [10:10:08] 10Continuous-Integration-Config, 10Patch-For-Review: Update email address for samtar in layout.yaml - https://phabricator.wikimedia.org/T286869 (10Samtar) 05Open→03Resolved [10:10:34] (03CR) 10Samtar: Update email address for samtar in layout.yaml (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/705186 (https://phabricator.wikimedia.org/T286869) (owner: 10Samtar) [12:30:51] 10Phabricator, 10Infrastructure-Foundations, 10SRE, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Also I don't think there is a way to get this... [12:34:18] 10Phabricator, 10Infrastructure-Foundations, 10SRE, 10CAS-SSO, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) >>! In T286904#7221038, @Majavah wrote: > Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Tur... [12:40:33] 10Phabricator, 10Infrastructure-Foundations, 10SRE, 10CAS-SSO, 10User-jbond: Add logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10RhinosF1) [14:52:27] 10Gerrit, 10Infrastructure-Foundations, 10SRE, 10CAS-SSO, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) [15:22:40] dancy: hi; is deployment-deleteme needed anymore (it seems to have some notes etc), or can it be deleted? [15:23:45] Hmm. I was sharing that with someone so I'll need to ask them if they're done with it. If so, I'll delete it within a few hours. Does that work for you? [15:27:13] sure, thanks [15:35:27] good morning [15:35:46] Hey Antoine [15:36:10] I forgot to send the gerrit upgrade announce last week but did so about an hour ago ;D [15:36:33] Hopefully that'll be sufficient [15:36:43] I am not too worried about it [15:37:43] addshore: seems like puppet is failing on gitlab-runner-addshore-1001.integration.eqiad1.wikimedia.cloud [15:39:32] D: [15:39:42] with any particular error? [15:39:56] that machine has very minimal things applied to it [15:41:06] (03PS4) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [15:45:16] legoktm: does LibUp supports packaging file being in subdirectories? mediawiki/libs/metrics-platform.git hosts java/php/nodejs/swift code each under a standalone directory so we get eg php/composer.json https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/676443 . Might need some config tweak in libup if that repo is eligible to be added to libup [15:50:42] hashar: that sounds like a more general version of https://phabricator.wikimedia.org/T228527 [15:51:00] (03CR) 10Hashar: "Michael do you still need this change? The mediawiki/libs/metrics-platform.git repo has some pending change that add ./php ./js ./swift di" [integration/config] - 10https://gerrit.wikimedia.org/r/697980 (owner: 10Mholloway) [15:51:20] Lucas_WMDE: ah yes definitely [15:52:33] (03PS5) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [15:55:10] (03PS6) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [16:28:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:30:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:44:28] Project beta-code-update-eqiad build #354469: 04FAILURE in 1 min 27 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/354469/ [16:44:58] 16:44:27 fatal: unable to access 'https://gerrit.wikimedia.org/r/wikimedia/portals.git/': The requested URL returned error: 503 [16:45:10] gerrit deploy related, presumably [16:46:12] there will be more alerts.. all the hosts pulling from gerrit.. depending how lucky [16:48:32] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:27] Project beta-code-update-eqiad build #354470: 04STILL FAILING in 1 min 26 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/354470/ [16:58:10] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:32] Yippee, build fixed! [17:04:33] Project beta-code-update-eqiad build #354471: 09FIXED in 1 min 32 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/354471/ [17:10:31] 10Gerrit, 10Release-Engineering-Team (Doing), 10Patch-For-Review: Upgrade Gerrit to 3.2.11 - https://phabricator.wikimedia.org/T278990 (10hashar) @dancy @brennen and I paired the update and we have updated the deployment notes while doing it: https://wikitech.wikimedia.org/wiki/Gerrit/Upgrade#Deploying The... [17:19:21] Gerrit is back up [17:20:57] hashar brennen dancy kudos on the gerrit eventful gerrit upgrade! [17:21:10] er...one too many gerrits [17:21:13] in that sentence [17:22:03] more or less :D [17:22:15] we went to encounter a glitch in ipv4 vs ipv6 dns resolution [17:22:21] that prevented gerrit to load [17:22:30] maybe something changed somewhere in the stack. Anyway: don't use fqdn [17:22:51] also our Gerrit hosts have 2 IPv4 each. I don't know why [17:23:24] because there is the service IP gerrit.wikimedia.org (which can be switched between hosts) and the server name, gerrit1001.wikimedia.org (or 2001) [17:24:11] fwiw IPv6 has been resolving and working before [17:24:18] since quite a long time [17:24:25] yeah I don't know what happened [17:24:37] either it prefered ipv4 when doing DNS [17:24:52] and thus listenAddress = gerrit.wikimedia.org always returned the ipv4 [17:25:08] seems to be a problem with the ssh java library specifically, maybe something changed in that logic if nothing changed with the ip config on the host [17:25:09] I guess something changed in gerrit or our stack somewhere which yields the ipv6 more often [17:25:25] so I have send https://gerrit.wikimedia.org/r/c/operations/puppet/+/705431 [17:25:29] to listen on the IPv4 [17:25:37] now I am not sure that is the proper one :D [17:27:50] I have pasted the puppet compiler output on it [17:27:55] $sshd_host is not simply set to a single hostname [17:28:10] it is set to "first of the replica hosts if on a replica" or to $host [17:28:22] this allows for more than one replica [17:28:42] OH truue [17:29:22] so it seems like it's wrong to set the listenAddress to that (which the patch fixes), is that right? [17:29:30] then each host has a profile::gerrit::ipv4 which has the service ip so I guess that is fine [17:29:36] I wonder how this worked before? [17:29:46] hosts/gerrit1001.yaml:profile::gerrit::ipv4: '208.80.154.137' [17:29:46] hosts/gerrit2001.yaml:profile::gerrit::ipv4: '208.80.153.107' [17:30:10] "Address already in use" sounds more like the problem is that (through puppet?) the sshd was already running and using that address [17:30:21] and then tried to bind it a second time when you started the service [17:30:42] that would be kind of unrelated to it being v6 [17:31:29] maybe stop/kill sshd service, then start gerrit again [17:31:37] could have been just bad luck in some race [17:31:57] the 29418 sshd service is part of the gerrit war controlled via that service [17:32:15] so stopping that service stops the jvm which stops ssh [17:32:25] did puppet start the service and then stopping it did stop gerrit but not the sshd ? [17:32:37] hmm, ok [17:33:03] the alternative would be to listen to '*:29418' aka listen on any available address [17:33:09] did the same thing happen in eqiad and codfw or just one? [17:33:27] the sshd service is managed by gerrit itself [17:33:45] from their notes it looks like it happened in eqiad and codfw [17:33:57] so if gerrit is stopped..we can confirm nothing listens on 29418 anymore? [17:34:07] because "already in use" sounds like it still was [17:34:18] so if it has listenAddress = gerrit.wikimedia.org listenAddress = [ipV6] , if it resolved the fqdn to an ipv6 it spawns the sshd there, and then fail/explode when trying to setup the second listen since it has the same ip [17:34:24] the root cause really: I don't know ;D [17:34:28] yes [17:34:44] port 29418 is listened by the Gerrit java process [17:34:48] I think gerrit tried to start sshd on the same ipv4 interface twice somehow [17:35:12] is my suspicion based on the solution that worked (i.e., specifying one specific ip) [17:35:14] last time I restarted gerrit it I guess managed to always resolve the fqdn to the ipv4 [17:35:26] so maybe it is a change in gerrit or a change in our dns stack somewhere [17:35:42] the FQDN always resolves to both IPv4 and IPv6 [17:36:18] I think the part "trying to bind a port twice" seems separate from DNS resolution [17:36:20] so I wonder if the fqdn responded with ipv4 and ipv6 and then we explicitly define ipv6 and so it was doing ipv6 twice? [17:36:37] or attempting to bind ipv6 twice [17:37:24] it was the ipv6 that was showing up in the error log fwiw [17:38:21] however you achieved it but right now it's listening on both v4 and v6 and only on the service IP, as it should.. and same on eqiad and codfw [17:38:39] so that seems like it's solved [17:38:57] per hashar 's email puppet is disabled on those hosts pending merging the patch he posted [17:39:07] ^ is that right hashar ? [17:39:14] (I haven't confirmed it's disabled) [17:39:28] oh, but the v6 IP is not the expected one [17:39:29] yeah [17:39:38] but it seems the ip address I filed are correct [17:39:39] it does listen on _a_ v6 IP on 29418 [17:40:06] but that does not look like the right one [17:41:01] interesting... [17:41:07] it should be using 2620:0:861:2:208:80:154:137 [17:41:44] for eqiad [17:42:15] ah, wait, it does [17:42:22] telnet 2620:0:861:2:208:80:154:137 29418 [17:42:27] SSH-2.0-GerritCodeReview_3.2.11 (APACHE-SSHD-2.4.0) [17:42:32] works [17:42:47] sure you need to fix anything? [17:43:40] I think the gerrit.config is live-hacked with https://gerrit.wikimedia.org/r/c/operations/puppet/+/705431 [17:43:47] to get it to start at all [17:44:05] it's live-hacked with a hard-coded IP. [17:44:29] it was doing the ipv6 twice, previously. [17:44:42] if you stop gerrit and start it again and it says that "address already in use" .. and then you run netstat.. I wonder what it shows that is supposed to listen on it [17:44:47] (if gerrit is really stopped) [17:45:28] I have updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/705431 , the @ipv4 from puppet does match the A dns records for each service. So I guess it is good [17:45:49] we confirmed gerrit was stopped and checked `sudo lsof -i :29418`, didn't seem like there was anything [17:45:56] mutante: when gerrit is stopped, there is nothing listening on 29418 [17:46:01] I have confirmed that [17:46:16] what happens is that Gerrit starts sshd very early in its start process [17:46:31] sshd.listenAddress = gerrit.wikimedia.org:29418 cause it to resolve the DNS entry [17:46:43] that somehow yields an IPv6 or the ipv6 is prefered [17:46:58] gerrit starts sshd on that ipv6 address [17:47:13] it then encounters the config sshd.listenAddress = [ipv6 here]:29418 [17:47:39] get sshd to start listening on that ipv6 address but since it previously already started listening on that, the ip/port is not available [17:47:49] and the socket bind fail: address already in use [17:48:01] maybe the part that changed is that APACHE-SSHD-2.4.0 is used now instead of some other sshd (version) before and this listens on both IPv4 and IPv6 if given a hostname..unlike the previous versions [17:48:27] which would explain why previously it needed the separate <%- if @ipv6 %> right below [17:48:39] if they have upgraded/changed sshd between 3.2.7 and 3.2.11 yeah that could be an explanation [17:48:48] maybe now it would also work with _just_ the hostname.. and doesnt need listenAddress = [<%= @ipv6 %>]:29418 anymore [17:49:30] I think you could just drop that other part [17:49:32] but that is the same sshd used [17:49:34] anyway [17:50:21] I don't think there is a need to spend hours figuring out the exact root cause. The fix is stop relying on a fqdn that resolve to v6/v4 and just use the v4 [17:50:53] +1: if the fix works, let's call it fixed :) [17:51:16] and I will file up a follow up task to check the impact on simply listening on all available ip [17:51:35] we have a separate service IP to avoid doing that part [17:52:07] but you could argue the firewall blocking it is enough [17:53:19] and potentially we could drop the service ips entirely ? [17:53:35] and thus just have gerrit.wikimedia.org being a CNAME to gerrit1001 or gerrit2001 [17:53:46] but that is a different topic :] [17:53:55] that will require a lot of work to remove a lot of previous work [17:55:23] random example: https://phabricator.wikimedia.org/rOPUP192f511d387404687553236be4d5bb3124a557d1 [17:55:41] cert issues when https://gerrit1001 works.. and that's just one thing [17:55:48] (03Restored) 10Mholloway: Run per-language pipelines for mediawiki/libs/metrics-platform [integration/config] - 10https://gerrit.wikimedia.org/r/697980 (owner: 10Mholloway) [17:55:56] yeah I guess there is a good reason for having those service ip [17:56:46] also iptables is open to any destination: 0.0.0.0/0 tcp dpt:29418 [17:57:10] so we gotta listen to explicit list of v4/v6 [17:57:19] https://phabricator.wikimedia.org/T165631 [17:57:29] and Icinga monitoring and more [17:58:02] from 2017 < paravoid> or let's add a service IP to gerrit2001, but also move them in private/behind LVS like phab[12]001-vcs, and also support port 22 for Gerrit [17:58:24] this would be moving back to before that, fwiw [17:59:17] so the service IP got added to normalize the gerrit machines to use VIP and have the host moved to 10/8 [17:59:20] hashar: yea, if the theory is right that giving it the hostname makes it listen on both v4 AND v6 then it should work like you suggest or to just drop the v6 line below. yea [17:59:43] something changed [17:59:46] we cant drop the v6 lines, cause we need to listen on Ipv6 and Ipv4 [18:00:01] but you say that using the hostname does exactly that [18:00:13] my patch just resolve the ambiguity that the hostname can end up being picked as an ipv6 leading to a dupe [18:00:18] so that would be what is desired.. except the second time is not needed anymore [18:00:19] (and lack of ipv4 listening) [18:02:40] yea, just try if it works then. I am just wondering if it was that simple to use $ipv4 why there is that logic to set sshd_host to the right one on each replica.. but maybe there really is none [18:02:54] we'll see what it does in cloud [18:03:32] I ran it through the puppet compiler, the ipv4 is correct for each hosts [18:03:59] oh right.. of course IPv4 doesnt work in cloud.. so that's why it needed that switch to turn it off [18:04:05] IPv6 i meant [18:04:35] hashar: ACK [18:05:23] (03PS1) 10TK-999: Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 [18:05:28] `hieradata/cloud.yaml:profile::gerrit::ipv6: ~` ? [18:05:37] is that magic to state it is undefined? [18:05:56] yea [18:06:05] that's a way to avoid errors [18:06:18] because IPv6 is not supported there unfortunately [18:06:33] (03CR) 10jerkins-bot: [V: 04-1] Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 (owner: 10TK-999) [18:07:22] (03CR) 10Dzahn: "Line 5: Line should be <=72 characters :)" [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 (owner: 10TK-999) [18:07:48] and what we did in production: [18:08:00] on gerrit-replica just stop/start the service until it resolved to the ipv4 [18:08:11] so it is now listening on v6 and v4 [18:08:29] for gerrit1001 we manually hacked the config to replace the gerrit.wikimedia.org with the ip address [18:08:35] which is what my patch is doing ;) [18:09:21] all I can say is that "stop/start the service until it resolve to the ipv4" is not something that had to be done in the past [18:09:33] yes [18:09:58] but I am not going to spend time figuring out the root cause when the fix is just to use the ipv4 instead of the hostname :] [18:10:11] (03PS2) 10TK-999: Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 [18:11:11] (03CR) 10jerkins-bot: [V: 04-1] Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 (owner: 10TK-999) [18:15:34] going to watch the kids, they are asking me [18:17:38] (03PS3) 10TK-999: Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 [18:17:54] (03CR) 10Mholloway: "Hi hashar, thanks for these suggestions. I experimented late last week on a couple of outstanding patches to use a single pipeline, but th" [integration/config] - 10https://gerrit.wikimedia.org/r/697980 (owner: 10Mholloway) [18:18:38] (03CR) 10jerkins-bot: [V: 04-1] Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 (owner: 10TK-999) [18:20:07] mutante: I don't want to leave puppet in its current state on gerrit1001, what do we need to do to either get https://gerrit.wikimedia.org/r/c/operations/puppet/+/705431 merged or a different solution so we can make sure puppet is running there? [18:24:29] thcipriani: ah, you need someone to +2 it? it needs someone to test if it works [18:24:52] It misbehaved on the Gerrit replica so we could test there first. [18:24:57] I don't want to merge something and then walk away but I have to i a few minutes [18:25:08] yep, needs a +2, I could test it on the replica [18:25:49] hmmthat means I have to be here until after the tests [18:26:05] rushing seems bad but I guess we have to then [18:26:27] could loop in a different serviceopsen if you have a good one to point me to :) [18:26:54] I don't [18:26:59] :( [18:27:08] is puppet disabled on both? [18:27:23] it doesn't look like it's disabled on gerrit2001 [18:27:33] someone who has IPv6 at home? [18:27:44] I...don't :( [18:27:46] small town isp [18:27:52] ^ dancy or brennen ? [18:28:03] note that the puppet change doesn't affect how ipv6 is bound (which was already correct). It affects ipv4 [18:29:07] I can test ipv4 ;) Also, I think the primary problem was that gerrit wouldn't start, which I can definitely test [18:29:25] we saw weirdness on gerrit2001, but it did eventually start. [18:29:36] I can verify via IPv6 [18:29:47] ok, first thing that needs to be solved is ..there are 4 pending changes on master..sigh [18:30:28] (03PS2) 10Mholloway: Run per-language pipelines for mediawiki/libs/metrics-platform [integration/config] - 10https://gerrit.wikimedia.org/r/697980 [18:30:42] trying to ping people [18:30:55] I always forget that mutante is dzahn. :-) [18:30:55] define "weirdness" ? [18:31:23] weirdness = mutliple restart attempts because it couldn't bind the sshd server address. [18:31:28] ^ [18:33:04] alright! we got merge..running puppet in codfw [18:33:13] it changed gerrit.config [18:33:25] woot! [18:33:47] ok, it's done. please test [18:34:23] ok, i will test restart on 2001 [18:34:30] <3 [18:34:37] (03PS4) 10TK-999: Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 [18:34:40] (if it was just bad luck or works after a couple times it might be hard to test) [18:34:55] but yea, should work [18:35:00] seems to work :) [18:35:16] well, it wasn't restarted yet [18:35:40] (03CR) 10jerkins-bot: [V: 04-1] Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 (owner: 10TK-999) [18:37:41] mutante: restarted gerrit-replica and it came up first try! [18:38:09] (03PS5) 10TK-999: Set maximum line length to 72 characters [integration/commit-message-validator] - 10https://gerrit.wikimedia.org/r/705438 [18:39:05] thcipriani: cool! (maybe it would have also done that without the change but I agree this makes it clear) [18:39:14] should I enable puppet on 1001 then? [18:39:26] ah brennen already does that [18:42:24] !log gerrit1001: ran puppet; noted that quotes were added to jvm configuration values [18:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:44:20] it feels like more changed on this puppet run than should have; we're investigating [18:48:47] mutante: we ran puppet and restarted on gerrit1001 as well just for good measure: looks good [18:48:51] it all _seems_ fine, the restart was fast [18:48:52] new gerrit comes back pretty fast [18:49:04] I watched the log and it took ~20seconds [18:49:47] thcipriani: cool, thanks for confirming [18:51:11] also seems to work for me (https on gerrit-replica, ssh to gerrit -p 29418 ) [18:51:22] 20 seconds is a marked improvement from the days when I could sing https://www.youtube.com/watch?v=zSGWoXDFM64&t=51s in its entirety waiting for it to restart [18:51:55] the time it took to restart was defined as "just long enough to make you start worrying if it comes back but then it does 5 seconds later" [18:52:03] so yea :) [18:52:07] ^ exactly [18:53:26] alright, i'll go afk then. text me if you need something and can't find another merger [18:53:45] mutante: as always, thanks much for the assist. [18:53:45] <3 thanks mutante [18:53:57] yep, np:) [18:54:28] for a moment I had forgotten you needed me to +2 it, tbh , laters [18:54:54] thcipriani: sidebar, i'm just impressed you know the words to i am the very model of a modern major general. [18:55:39] relevant content: https://www.youtube.com/watch?v=TbQ-y589mx8 [18:56:35] the reason for that is because of frequent gerrit restarts [19:19:46] (03PS1) 10TrainBranchBot: Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705487 [19:19:48] (03CR) 10TrainBranchBot: [C: 03+2] Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705487 (owner: 10TrainBranchBot) [19:20:58] (03Merged) 10jenkins-bot: Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705487 (owner: 10TrainBranchBot) [19:25:56] mutante: brennen dancy thcipriani thank you for the config tweak :D [19:33:02] 10Release-Engineering-Team (Doing), 10GerritBot, 10Developer Productivity, 10Regression: Gerritbot turns "+" into space, thus breaking most Gerrit URLs - https://phabricator.wikimedia.org/T280197 (10hashar) [19:33:04] 10Release-Engineering-Team (Doing), 10Gerrit (Gerrit 3.3): Upgrade Gerrit to 3.3 - https://phabricator.wikimedia.org/T262241 (10hashar) [19:33:48] 10Gerrit, 10Release-Engineering-Team (Doing), 10Patch-For-Review: Upgrade Gerrit to 3.2.11 - https://phabricator.wikimedia.org/T278990 (10hashar) 05Open→03Resolved Solved with @brennen and @dancy ! [19:35:23] ahhhh gerrit upstream and their multiple branches merge dance :/ [19:36:38] 10Release-Engineering-Team (Next), 10MW-on-K8s, 10serviceops: Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10dduvall) [19:37:04] 10Release-Engineering-Team (Next), 10MW-on-K8s, 10serviceops: Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10dduvall) [19:39:43] 10Release-Engineering-Team (Radar), 10MW-on-K8s, 10SRE, 10serviceops: The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dduvall) >>! In T285232#7215756, @dduvall wrote: > Helm supports hooks. What if we define pre-install hook and a k... [19:45:58] (03PS7) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [19:58:10] 10Release-Engineering-Team (Next), 10MW-on-K8s, 10serviceops: Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10dduvall) [19:59:18] (03PS8) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [20:00:12] (03PS9) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [20:13:46] (03PS1) 10TrainBranchBot: Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705496 [20:13:48] (03CR) 10TrainBranchBot: [C: 03+2] Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705496 (owner: 10TrainBranchBot) [20:14:40] (03Merged) 10jenkins-bot: Update state/train-versions.json [tools/release] - 10https://gerrit.wikimedia.org/r/705496 (owner: 10TrainBranchBot) [20:29:47] (03PS1) 10Hashar: [WMF] its-phabricator: Urlencode POST to conduit [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) [20:49:33] (03CR) 10Hashar: "The update is a single commit from upstream :]" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705499 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [21:16:15] 10Release-Engineering-Team, 10GitLab, 10User-brennen: Document long-term requirements for GitLab job runners - https://phabricator.wikimedia.org/T286958 (10brennen) [22:03:10] (03PS10) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [22:13:15] (03PS11) 10Ahmon Dancy: Prototype of incremental image build process [tools/release] - 10https://gerrit.wikimedia.org/r/705003 (https://phabricator.wikimedia.org/T286505) [22:35:54] 10Release-Engineering-Team (Deployment Training Requests): Deployment training request for **cjming** - https://phabricator.wikimedia.org/T285898 (10cjming) Thanks @thcipriani -- I'm on the schedule to attend more trainings and initiated the process for shell access at T286961. I wasn't sure who should be assign... [23:11:23] 10Release-Engineering-Team (Doing), 10GitLab (Initialization), 10User-brennen: Create a GitLab settings script / repo - https://phabricator.wikimedia.org/T284336 (10brennen) 05Open→03Resolved a:03brennen Multiple-value settings: https://gitlab.wikimedia.org/releng/gitlab-settings/-/merge_requests/1 [23:12:03] 10Release-Engineering-Team (Next), 10GitLab, 10User-brennen: gitlab-settings: Figure out how to handle options that take an array of strings - https://phabricator.wikimedia.org/T285907 (10brennen) [23:12:05] 10Release-Engineering-Team (Doing), 10GitLab (Initialization), 10User-brennen: Create a GitLab settings script / repo - https://phabricator.wikimedia.org/T284336 (10brennen) [23:12:30] 10Release-Engineering-Team (Next), 10GitLab, 10User-brennen: gitlab-settings: Figure out how to handle options that take an array of strings - https://phabricator.wikimedia.org/T285907 (10brennen) 05Open→03Resolved [23:12:32] 10Release-Engineering-Team (Doing), 10GitLab (Initialization), 10User-brennen: Create a GitLab settings script / repo - https://phabricator.wikimedia.org/T284336 (10brennen) [23:30:36] 10Phabricator: Add a Herald rule for User-MediaJS - https://phabricator.wikimedia.org/T286077 (10Aklapper) 05Stalled→03Declined Declining as per my previous comment. Please feel free to reopen once there is a less broad scope, plus an explanation which specific problem(s) you would like to see solved - thank... [23:46:34] 10Release-Engineering-Team (Deployment Training Requests): Deployment training request for **cjming** - https://phabricator.wikimedia.org/T285898 (10thcipriani) >>! In T285898#7222956, @cjming wrote: > Thanks @thcipriani -- I'm on the schedule to attend more trainings and initiated the process for shell access a...