[05:39:39] <_joe_>	 re: last night, we had the same issue on kafka when going eqsin -> eqiad
[05:40:03] <_joe_>	 so there's definitely something that's not codfw-specific 
[05:40:25] <_joe_>	 topranks: ^^ for whenever you're online next (I hope you properly rest)
[06:58:58] <topranks>	 _joe_: huh ok
[06:59:50] <topranks>	 do you mean we had the same issue to eqiad and it also away when we made the transport path change
[07:00:09] <topranks>	 *went away
[07:09:54] <topranks>	 yep I can see a point mid-afternoon where we moved the purged backend to eqiad?  and yep same exact symptoms 
[07:10:16] <topranks>	 https://grafana.wikimedia.org/goto/jfZcCqsHR
[07:11:33] <topranks>	 thanks for the info, probably doesn't radically change the thinking but an interesting data-point 
[07:13:32] <topranks>	 eqiad -> eqsin will route back via codfw and traverse the same link 
[07:13:32] <_joe_>	 ah ok
[07:13:32] <_joe_>	 that's the one thing I wasn't sure about
[07:13:32] <topranks>	 but it is a different set of IPs, so if it's an ECMP thing with the carrier that is interesting we see the same 
[08:03:40] <Krinkle>	 I could use a pointer or two on writing a varnish test for GeoIP. I left a comment at https://gerrit.wikimedia.org/r/1168038 
[09:21:42] <jynus>	 btullis: not causing issues on production, but I belive https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167878 may have made puppet fail on your hosts
[09:22:48] <jynus>	 (I was debugging why there was widespread puppet failures, but it seems it is only on hadoop hosts)
[09:23:17] <btullis>	 jynus: Ah, yes. Sorry about that. I have silenced the 52 systemd failure alerts, but I had forgotten that there would be puppetfailure alerts, too. Thanks for the pointer.
[09:23:51] <jynus>	 oh, no need to be sorry
[09:24:05] <jynus>	 just in case you weren't aware
[09:24:12] <btullis>	 I had hoped to fix puppet so that it wouldn't alert when setting these hosts to decommisioning, but it was tricky, so I skipped it. :-)
[09:24:36] <btullis>	 Thanks for the heads-up, though.
[09:54:38] <hnowlan>	 I'm going to deploy a changeprop change to deal with an issue that's affecting mobile content freshness. not ideal for a friday but necessary (cc tappof akosiaris)
[09:55:35] <tappof>	 ack hnowlan 
[10:02:22] <akosiaris>	 ack
[11:33:41] <andrewbogott>	 I upgraded several 5 ceph servers to Boookworm yesterday, and and least three of them have locked up with messages like:
[11:33:43] <andrewbogott>	 md: resync of RAID array md1
[11:34:06] <andrewbogott>	 in this case md1 is a sw raid 1 containing the swap partition
[11:34:12] <andrewbogott>	 Is that issue familiar to anyone?
[11:41:04] <Emperor>	 locked up because doing rebuild of the swap is killing swap performance? Do you have headroom to disable swap until the rebuilt finishes?
[11:43:14] <andrewbogott>	 Yes, but I don't trust it to not happen again so the 'fix' is probably to roll back to bullseye
[12:07:33] <Emperor>	 I checked my bookworm ceph nodes, but they don't RAID the swap partition (/ is on a sw-raid, but the swap partitions aren't RAIDed together)
[12:40:12] <andrewbogott>	 that's interesting, in theory I'm using the same partman recipe as you I think...
[12:40:45] <andrewbogott>	 which speaking of partman, I'm trying to revert this host to bullseye but now the debian installer won't pass the partitioning phase :(
[12:41:08] <andrewbogott>	 in my experience this recipe only works with a new drive, it fails if it finds existing partitions or sw raids
[12:44:20] <Emperor>	 it's tricky if you've got systems where you need to definitely-not-wipe some drives
[12:46:34] <Emperor>	 andrewbogott: I don't think so - my ceph nodes are using partman/custom/cephadm_raid1_leavelvm.cfg or partman/custom/boss_leavelvm.cfg and you've got (I think) partman/custom/boss_leavelvm.cfg or partman/custom/cephosd.cfg
[12:47:04] <andrewbogott>	 oh yeah, I'm using partman/custom/cephosd.cfg
[13:50:02] <akosiaris>	 andrewbogott: how can I help?
[13:51:38] <andrewbogott>	 I think we have a fix underway, but it involves partman so testing is tedious. Discussion is mostly in #wikimedia-ceph.
[13:52:00] <akosiaris>	 ok, joining there
[13:52:00] <andrewbogott>	 Meanwhile beta is down which I suspect is unrelated but I'm about to start looking at that.
[13:52:03] <andrewbogott>	 thx!
[15:25:33] <sukhe>	 heads-up: topranks and I are going to depool eqsin so that topranks can gather some network data to report to Arelion (the primary eqsin -> codfw link).
[15:26:00] <sukhe>	 this is to debug the purged issue we saw yesterday in T399221
[15:28:09] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[15:28:09] <sukhe>	 topranks: ^ done
[15:28:09] <sukhe>	 TTL is 180s but let's give it another 10mins or so 
[15:28:09] <topranks>	 sure 
[15:28:11] <topranks>	 thanks :) 
[15:28:21] <sukhe>	 $deityspeed!
[15:31:56] <topranks>	 hahaha 
[15:55:13] <andrewbogott>	 how can I bypass a cookbook reimage lockfile? (and bonus points, is that documented someplace? I cannot find it if so)
[15:57:01] <andrewbogott>	 ah, it's --no-locks but you have to pass it before the cookbook name
[16:50:29] <sukhe>	 topranks has finished his testing; we will repool eqsin shortly
[17:10:50] <sukhe>	 (repooled)
[17:34:49] <Krinkle>	 sukhe: are fridays fine for beta-specific puppet merges? I don't mean the varnish change which is prod code that should be a no-op for prod but is still prod code. I mean the other ones here: https://gerrit.wikimedia.org/r/q/topic:%22beta-wmcloud%22 
[17:38:07] <sukhe>	 Krinkle: I don't think we have a defined policy yet at least for beta && varnish changes. (we are actively working on that and including the ownership of that in beta). 
[17:38:48] <Krinkle>	 The other ones don't change varnish or other prod code. Each touched file is exclusively loaded in beta.
[17:38:52] <sukhe>	 it's pretty much a case by case basis for now; we typically don't do these on Friday, or any other CDN stuff, unless UBN or other emergency (such as fr-tech requests, even that was once)
[17:39:53] <sukhe>	 Krinkle: ah ok, I misread your "varnish change" thing above
[17:40:03] <sukhe>	 yeah not sure about those tbh.
[17:42:36] <sukhe>	 I think the ownership of beta is the question here. at least for that and how it relates to the CDN changes, we are trying to come up with a more formal policy for T358887
[17:42:37] <stashbot>	 T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887
[17:42:56] <sukhe>	 for the rest of the stuff in that change -- only one of which is a varnish prod change -- I am really not sure
[17:43:01] <sukhe>	 s/change/chain
[19:48:59] <Krinkle>	 They've been deployed in beta for a few days already. This is mostly a permission issue, not policy or ownership. I've cherry picked them on the puppetmaster but need review/merge in the repo since we generally limit puppet.git to production roots. I do have root in a number of production services (perf-roots, varnish, memc, mw) but not as an SRE root. And I do appreciate code review of course :)
[19:49:16] <Krinkle>	 the beta puppetserver, that is.
[20:03:07] <sukhe>	 yeah I meant ownership in the context of reviews/merges too. happy to take care of it on Monday, as I am heading out now.
[20:03:41] <sukhe>	 (will do Monday morning)