[06:54:49] FYI, netmon is being rebooted [06:56:36] and also rebooting bast6003 now [07:38:41] do we have puppet-lint settings that are able to enforce our indent style? aka 4 spaces rather than 2? [07:52:46] vgutierrez: git grep chars_per_indent4 should guide you, I didn't check it it's included only in some cases [07:53:05] s/chars_per_indent4/chars_per_indent/ [07:56:03] yeah.. so using .puppet-lint.rc from our puppet repo should be enough, thx [09:21:57] moritzm: FYI openjdk-17-jre-headless unattended update (17.0.12+7-2~deb12u1, 17.0.13+11-2~deb12u1) broke puppetservers in WMCS [09:22:12] see T377803 [09:22:17] T377803: Cloud VPS: cloud-wide puppet problem related to puppet-enc 2024-10-22 - https://phabricator.wikimedia.org/T377803 [09:25:56] arturo: o/ yes we are aware of this issue, when openjdk is installed we need to immediately restart puppetserver :( [09:26:02] IIUC a restart fixed it right? [09:26:18] elukey: apparently yes [09:26:22] yeah, you should exempt openjdk-17 from unattended-upgrades for the puppetserver role [09:26:40] puppetserver needs an immediate restart after the upgrade [09:27:35] I think it's related to jruby, jruby-compiled artefacts generated by two different JREs don't mix [09:30:17] in modules/profile/manifests/puppetserver/wmcs.pp [09:30:23] I see [09:30:25] # To ensure the server is restarted on unattended java upgrades [09:30:25] profile::auto_restarts::service { 'puppetserver': } [09:30:31] I guess that's not working as expected? [09:32:22] it's working as expected, but doesn't help here [09:32:37] the auto_restarts are splayed across the day [09:33:04] but once openjdk-17 is upgraded, puppetserver fails with the next catalogue compile [09:33:43] so it would remain broken for up to 23:59 hours in the worst case scenario [09:34:12] ah, I see [09:34:21] we already have some exceptions in the unattended-upgraded config, best to simply add openjdk-17 to it for the puppetserver role [09:34:58] or deploy some script which checks if java updates are available [09:35:14] and which if that's the case installs them along with a subsequent puppetserver.service restart [09:35:46] it's really surprising the puppetserver even stumbles over minor updates like 17.0.x [09:36:24] I'm having difficulties finding that setting to prevent openjdk-17 from auto-updating [09:36:28] do you have a pointer? [09:36:46] and Puppet Enterprise possibly bundles Java like any good Enterprise application :-) [09:38:41] hmmh, all I can find is apt::unattendedupgrades which doesn't appear to have exceptions [09:39:07] but we definitely had these in the past, as some updates caused issues on toolforge in the gridengine days [09:39:29] maybe that config vanished when the old tools setup was retired, not sure [09:44:56] if the exceptions were via normal apt pinnings, then most likely yes, we deleted all these [09:55:05] arturo: alternatively just ping openjdk-17-jre-headless to a specific version for now within the role? and when we can periodically bump that along with puppetserver.service restarts via cloudcumin? [09:55:13] /ping/pin [09:55:54] yeah, I'll cook a patch soon and let you know [09:56:09] sgtm [09:57:18] I found what I mentioned earlier: [09:57:45] profile::toolforge::apt_pinning has the setting for the packages which were pinned to avoid toolforge errors [10:02:43] oh, I thought we had dropped that file [11:32:59] TIL about git blame -C (and -CC and -CCC): it makes git try and find the original author of a line. Very useful if e.g. a section was moved between files. It will even annotate what file the line was originally committed to. [11:40:10] thanks for the pointer, I never heard about that before! [11:44:30] https://blog.gitbutler.com/git-tips-1-theres-a-git-config-for-that/ <- where I got it, also more [12:50:06] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082201 [14:47:10] hnowlan, swfrench-wmf: re: T363996, I have a ~1 hr meeting in 15 mins, then one free hour, followed by another meeting of 30 mins in length [14:47:10] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [14:49:39] we should probably (have) set this up as a calendar event [14:50:07] heh, true. Want to aim for the 1600 UTC window then? [14:55:54] 16:00 sounds good to me. looks like there are patches scheduled for the puppet window, plus if things run over, we can always take some of the 17:00 infra window too. [14:56:03] *there are no patches [14:56:07] lol, typing [15:17:27] hnowlan, swfrench-wmf: wfm (if an hour block is enough) [15:23:04] anyone who knows pybal enough to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076420 please? that is ultimately blocking me from fixing a broken link in a totally unrelated script [15:26:07] taavi: what's the unrelated script/error? [15:27:11] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082226 :D [15:28:11] taavi: in a meeting so forgive me but that is passing? [15:29:28] sukhe: yes, since that depends on the pybal patch? if you rebased that patch to not depend on it then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1082227 (which is needed for the original patch to pass) would start failing [15:34:20] should be OK for this case, +1 [16:02:57] swfrench-wmf, hnowlan: ok, I have an hour! :) [16:03:41] cool, I'm good to go :) [16:03:54] am here as well [16:05:12] heads up arnoldokoth, cdanis: we're doing some migration work on sessionstore. We're following a successful pattern that was done for the other kask deployment but sessionstore is obviously pretty sensitive [16:05:18] ack! [16:05:20] I'll start by depooling codfw [16:05:25] thanks hnowlan [16:05:37] hnowlan: are we starting with codfw then? [16:05:44] we said eqiad in the ticket [16:05:50] oh, my bad [16:06:35] eluke.y did anyway [16:07:03] they're roughly equivalent in terms of traffic level so it's much of a muchness but let's follow the ticket [16:07:35] depooling eqiad [16:07:52] hnowlan: Thank you. [16:10:05] FYI, https://grafana.wikimedia.org/goto/bn-RNjmHR?orgId=1 is what I have up for looking at mediawiki -> sessionstore latency measured at the downstream envoy [16:10:48] (units are ms) [16:11:10] nice [16:11:17] * swfrench-wmf wishes this could display as a semi-log plot [16:11:33] For badness I'm also watching https://grafana.wikimedia.org/goto/YAPVNCiHg?orgId=1 https://grafana.wikimedia.org/goto/-KZSHCmNR?orgId=1 [16:12:23] ah, good idea [16:13:08] if things start showing up in those it's definitive time to panic [16:15:57] alright, we're close to 0 on eqiad. I'll merge and apply there [16:16:01] wait! [16:16:04] did you wait? [16:16:09] hnowlan: wait? [16:16:23] yep [16:16:29] urandom: sup? [16:16:30] ok, we should baseline with siege first, no? [16:16:36] ah, go for it [16:17:10] ok, running [16:19:44] hnowlan: when we get to that part, after you merge but before you run destroy, try a helmfile diff - I almost forgot to do this when migrating echostore, and it was only at that point that I uncovered the values layering surprises [16:20:21] I set `-t 5m` (5 minutes), at least for the first run. The urls.txt is using generated keys/values, so it needs some warmup (probably not 5 minutes, but...) [16:20:50] swfrench-wmf: ack [16:21:44] actually, I'm starting to wonder if 5m is enough, I expected the 404 rate to have fallen by now [16:25:30] take your time, things look reasonably stable [16:26:20] I upped concurrency, and dropped the delay. I want to see the 404s go away, so that we're apples-to-apples when comparing [16:36:40] hnowlan: Ok, I think we're good to go [16:37:48] alright, great [16:41:12] swfrench-wmf: the diff looks reasonable enough to me - do you want to check it out before I apply? [16:42:44] as long as it renders, it's probably fine :) [16:42:48] happy to take a look, though [16:42:51] (doing) [16:44:19] that looks good! [16:44:31] cool :) [16:45:24] alright, as expected the apply fails. doing the destroy [16:46:44] hnowlan: Hey! Any objections to deploy this one: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064013 [16:47:11] nemo-yiannis: might need to come back to me on this one, in the middle of something a bit scary [16:47:25] urandom: you should be good to go [16:48:09] nemo-yiannis: it might need to wait until we're done here, sorry [16:48:28] ok, i will revisit it tomorrow early my day no worries [16:50:35] siege is running btw [16:50:46] I never did get the 404s to hit zero... which bugs me [16:51:20] because we definitely should have run through the entire urls file (multiple times) [16:51:56] to be clear, this wouldn't be related the work at hand [16:52:36] would hitting an invalid key consistently return 404s? [16:52:52] seeing like 1/s now out of a total transaction rate of ~630/s [16:53:26] hrmm... maybe it's a result of the averaging [16:55:05] from operations: `FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts` [16:55:10] looking [16:55:59] hum, I understand that error but not how to fix it [16:56:07] have the taints gone away? [16:56:10] same [16:56:10] kill the pods, let them respawn on the right nodes [16:56:23] oh, that simple :D [16:56:33] if they don't, might be worth checking why [16:56:52] two of them are in the wrong place. I'll wait until the siege run is finished [16:56:56] ah, yes ... the best-effortness of scheduling [16:57:19] (did confirm the node affinities are still there) [16:57:23] swfrench-wmf: yeah, save from an actual separate cluster, there's no real way to *force* pods to be scheduled on a specific set of nodes [16:57:44] It will try to comply with your request, but if it can't it'll just put the pods where it wants to [16:57:54] hnowlan, swfrench-wmf: so the latency histograms show a bit of regression [16:58:03] claime: indeed, yeah [16:58:21] urandom: oh dear. How bad a regression? [16:58:36] https://grafana-rw.wikimedia.org/d/000001590/sessionstore?orgId=1&viewPanel=50 [16:59:12] wait, that's measured _in_ kask ... how could that regress? [16:59:20] oh, right [16:59:30] oh... because of the deploy [16:59:39] yeah, kask is cold [16:59:43] nm [16:59:48] A cold kask sounds good rn [16:59:54] :D [16:59:59] urandom: so we just do it again? [17:00:00] https://phabricator.wikimedia.org/T363996 [17:00:28] the before and after shows a difference too, but probably a cold kask there as well [17:01:02] https://phabricator.wikimedia.org/T363996#10251270 v. https://phabricator.wikimedia.org/T363996#10251366 [17:01:12] that's less concerning, but is it worth trying now that it's warm? [17:01:28] I mean, I could keep siege running for a bit [17:01:40] if you have the time, and want to watch it for a bit [17:02:37] I've got another hour or so [17:02:45] it's been running this whole time, and is improving fwiw [17:02:46] if the difference is that small I am willing to see what repooling does though [17:02:58] yeah, I think it's find [17:03:02] s/find/fine/ [17:03:43] alright, I'm happy to repool unless there's other objections (swfrench-wmf?) [17:03:50] looking at the histogram, it does seem like it's converging to pre-depool behavior (e.g., the >5ms bucket is now basically the same as before) [17:03:56] no objections on my end! [17:04:16] lessgo [17:05:07] ahh damnit, didn't fix those pods first [17:06:01] hnowlan, swfrench-wmf: are we doing codfw after, or letting eqiad marinate for a day? [17:06:11] (i vote for marinating) [17:06:50] I'm okay with marinating, but I will need to a change to undo the codfw-level mesh config [17:06:53] which is no biggie [17:07:22] is siege still running and if so should it be stopped as we're repooled [17:07:31] I'll kill it [17:07:44] thanks [17:08:24] no objections to marinating, as it does give us a "quick" out of depooling (though if the concern is a latency regression, depooling is only a very temporary solution, as it adds 30ms+ to every call) [17:12:56] to whatever degree I can trust this envoy metric, eqiad p50 latency is back to pre-depool values, and p95 is continuing to settle (not quite there yet) [17:17:51] login timings look reasonable also [17:32:42] I'm ok if you both want to continue, I have pretty high confidence. I just assumed that putting 12ish hours between the steps wouldn't change much, and would be even greater confidence [17:33:08] s/be/provide/ [17:35:08] I'm in no rush at this hour my time [17:53:35] all done for now - a change is in place to keep codfw config the same and eqiad is using the mesh.