[05:39:39] <_joe_> re: last night, we had the same issue on kafka when going eqsin -> eqiad [05:40:03] <_joe_> so there's definitely something that's not codfw-specific [05:40:25] <_joe_> topranks: ^^ for whenever you're online next (I hope you properly rest) [06:58:58] _joe_: huh ok [06:59:50] do you mean we had the same issue to eqiad and it also away when we made the transport path change [07:00:09] *went away [07:09:54] yep I can see a point mid-afternoon where we moved the purged backend to eqiad? and yep same exact symptoms [07:10:16] https://grafana.wikimedia.org/goto/jfZcCqsHR [07:11:33] thanks for the info, probably doesn't radically change the thinking but an interesting data-point [07:13:32] eqiad -> eqsin will route back via codfw and traverse the same link [07:13:32] <_joe_> ah ok [07:13:32] <_joe_> that's the one thing I wasn't sure about [07:13:32] but it is a different set of IPs, so if it's an ECMP thing with the carrier that is interesting we see the same [08:03:40] I could use a pointer or two on writing a varnish test for GeoIP. I left a comment at https://gerrit.wikimedia.org/r/1168038 [09:21:42] btullis: not causing issues on production, but I belive https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167878 may have made puppet fail on your hosts [09:22:48] (I was debugging why there was widespread puppet failures, but it seems it is only on hadoop hosts) [09:23:17] jynus: Ah, yes. Sorry about that. I have silenced the 52 systemd failure alerts, but I had forgotten that there would be puppetfailure alerts, too. Thanks for the pointer. [09:23:51] oh, no need to be sorry [09:24:05] just in case you weren't aware [09:24:12] I had hoped to fix puppet so that it wouldn't alert when setting these hosts to decommisioning, but it was tricky, so I skipped it. :-) [09:24:36] Thanks for the heads-up, though. [09:54:38] I'm going to deploy a changeprop change to deal with an issue that's affecting mobile content freshness. not ideal for a friday but necessary (cc tappof akosiaris) [09:55:35] ack hnowlan [10:02:22] ack [11:33:41] I upgraded several 5 ceph servers to Boookworm yesterday, and and least three of them have locked up with messages like: [11:33:43] md: resync of RAID array md1 [11:34:06] in this case md1 is a sw raid 1 containing the swap partition [11:34:12] Is that issue familiar to anyone? [11:41:04] locked up because doing rebuild of the swap is killing swap performance? Do you have headroom to disable swap until the rebuilt finishes? [11:43:14] Yes, but I don't trust it to not happen again so the 'fix' is probably to roll back to bullseye [12:07:33] I checked my bookworm ceph nodes, but they don't RAID the swap partition (/ is on a sw-raid, but the swap partitions aren't RAIDed together) [12:40:12] that's interesting, in theory I'm using the same partman recipe as you I think... [12:40:45] which speaking of partman, I'm trying to revert this host to bullseye but now the debian installer won't pass the partitioning phase :( [12:41:08] in my experience this recipe only works with a new drive, it fails if it finds existing partitions or sw raids [12:44:20] it's tricky if you've got systems where you need to definitely-not-wipe some drives [12:46:34] andrewbogott: I don't think so - my ceph nodes are using partman/custom/cephadm_raid1_leavelvm.cfg or partman/custom/boss_leavelvm.cfg and you've got (I think) partman/custom/boss_leavelvm.cfg or partman/custom/cephosd.cfg [12:47:04] oh yeah, I'm using partman/custom/cephosd.cfg [13:50:02] andrewbogott: how can I help? [13:51:38] I think we have a fix underway, but it involves partman so testing is tedious. Discussion is mostly in #wikimedia-ceph. [13:52:00] ok, joining there [13:52:00] Meanwhile beta is down which I suspect is unrelated but I'm about to start looking at that. [13:52:03] thx! [15:25:33] heads-up: topranks and I are going to depool eqsin so that topranks can gather some network data to report to Arelion (the primary eqsin -> codfw link). [15:26:00] this is to debug the purged issue we saw yesterday in T399221 [15:28:09] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [15:28:09] topranks: ^ done [15:28:09] TTL is 180s but let's give it another 10mins or so [15:28:09] sure [15:28:11] thanks :) [15:28:21] $deityspeed! [15:31:56] hahaha [15:55:13] how can I bypass a cookbook reimage lockfile? (and bonus points, is that documented someplace? I cannot find it if so) [15:57:01] ah, it's --no-locks but you have to pass it before the cookbook name [16:50:29] topranks has finished his testing; we will repool eqsin shortly [17:10:50] (repooled) [17:34:49] sukhe: are fridays fine for beta-specific puppet merges? I don't mean the varnish change which is prod code that should be a no-op for prod but is still prod code. I mean the other ones here: https://gerrit.wikimedia.org/r/q/topic:%22beta-wmcloud%22 [17:38:07] Krinkle: I don't think we have a defined policy yet at least for beta && varnish changes. (we are actively working on that and including the ownership of that in beta). [17:38:48] The other ones don't change varnish or other prod code. Each touched file is exclusively loaded in beta. [17:38:52] it's pretty much a case by case basis for now; we typically don't do these on Friday, or any other CDN stuff, unless UBN or other emergency (such as fr-tech requests, even that was once) [17:39:53] Krinkle: ah ok, I misread your "varnish change" thing above [17:40:03] yeah not sure about those tbh. [17:42:36] I think the ownership of beta is the question here. at least for that and how it relates to the CDN changes, we are trying to come up with a more formal policy for T358887 [17:42:37] T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes - https://phabricator.wikimedia.org/T358887 [17:42:56] for the rest of the stuff in that change -- only one of which is a varnish prod change -- I am really not sure [17:43:01] s/change/chain [19:48:59] They've been deployed in beta for a few days already. This is mostly a permission issue, not policy or ownership. I've cherry picked them on the puppetmaster but need review/merge in the repo since we generally limit puppet.git to production roots. I do have root in a number of production services (perf-roots, varnish, memc, mw) but not as an SRE root. And I do appreciate code review of course :) [19:49:16] the beta puppetserver, that is. [20:03:07] yeah I meant ownership in the context of reviews/merges too. happy to take care of it on Monday, as I am heading out now. [20:03:41] (will do Monday morning)