[08:52:44] I'll be removing the firewall connection tracking alert from Icinga. Sadly that will create a little noice (unknown conntrack_table_size service) in Icinga. It's expected to go away after a few Puppet runs. I'll attempt to help it along to shorten the time span. [08:57:08] slyngs: can you downtime it first? [08:57:24] It's on pretty much every single host with a firewall [08:57:28] but if not, it's not a big deal neither [08:58:41] I talked to o11y last week, and sadly there's not really a good way remove the check, without rewriting parts of the Puppet code. [09:31:10] All done, I think. [13:00:29] heads up, in ~30 min I'll be moving k8s prometheus instances from prometheus2005 to prometheus2007 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126934 [13:00:35] no impact expected [13:19:03] ack [13:47:54] jclark-ctr: db1257 is in site.pp [14:02:02] ok I'm trying to debug why prometheus2007 / 10.192.9.11 can't talk to k8s pods e.g. thumbor at 10.194.162.84:8084 [14:03:10] actually nevermind, double checking [14:17:47] that was my bad btw, all good [15:05:17] I have a reimage slightly stuck at [16/50, retrying in 48.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title ms-be2075 not found yet [15:05:17] [15:05:28] ...is there something amiss? [15:06:22] does the role exist in site.pp? [15:06:44] Emperor: https://puppetboard.wikimedia.org/report/ms-be2075.codfw.wmnet/2c26286d3c416560c5c0baca67dd5026805199de [15:06:54] puppet fails [15:07:32] volans: that looks like a failure of puppetserver1001? [15:08:06] sukhe: yeah, all ms-be2* nodes have role(swift::storage) [15:08:15] if you ssh on the host via install_console from a cumin host [15:08:16] this isn't a new node, it's just one that has had a lot of h/w work done [15:08:17] try to run this: [15:09:00] puppet agent -t --noop &> /dev/null [15:09:15] (or without the redirect if it fails quickly, depending how much output you need) [15:09:25] * volans in a meeting too [15:10:10] presumably host in codfw should be able to talk to puppetserver in eqiad, yes? [15:10:22] [that puppet agent rune has been sitting for a minute or so now] [15:11:46] finished, exit code 4 [15:12:49] Emperor: did the cookbook resume? [15:13:10] I see it on puppetboard as a normal noop run [15:13:28] Applied catalog in 29.87 seconds [15:13:38] ah, yes now the cookbook has continued [15:14:24] thanks volans, nice magic :) [15:14:26] so yeah I guess a transient failure of puppet, never seen it [16:04:57] hi folks, do you have any concerns about this patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1128435: GlobalContributions: Use higher concurrency for foreign wiki requests [16:08:40] <_joe_> mszabo: help me understand [16:08:59] <_joe_> because in abstract I'd say "a lot of concerns" [16:09:06] <_joe_> where is this code executed? [16:09:26] <_joe_> in a live request? in a job? and it can fan out up to 250 concurrent requests? [16:29:56] _joe_: so this is the backing system for https://meta.wikimedia.org/w/index.php?title=Special:GlobalContributions/Mr._Tamarize [16:30:40] it presently needs to fetch local permissions for each wiki being looked up because there is currently no other mechanism to perform permission checks in the context of a foreign wiki [16:31:51] we'll likely be adding MC around these as a next step, although the efficacy of that may be limited since it'd be a per-user per-wiki cache [16:32:16] <_joe_> mszabo: I have so many questions :) [16:32:33] <_joe_> but I'm in a meeting, ping me tomorrow "morning" [16:32:51] <_joe_> simply put: as written, yes that is an absolutely no-no [16:32:53] _joe_: I can type out some more details now and you can process them / get back at your convenience if that works [16:33:25] <_joe_> but the problem is deeper, I think we need to re-think how that's implemented [16:33:29] <_joe_> and, maybe not here [20:48:57] in attempting to add a package, it seem reprepro is in a bad state "Error: packages database contains unused 'bullseye-wikimedia|component/pcre2|amd64' database.". Is this a known issue? [20:59:30] urandom, swfrench-wmf: ^ related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125539 maybe? [21:00:07] urandom: ah, whoops - that is definitely my fault! [21:00:40] I didn't realize there was a timing constraint on running the clearvanished [21:00:44] I can do that now [21:01:16] no worries, I added my package [21:01:21] rzl: urandom: are either of you around in case this explodes in my face? [21:01:31] :) [21:01:34] swfrench-wmf: to render first aid? [21:01:48] heh, I just mean for assistance with apt repo debugging [21:02:12] yeah, happy to help (capable might be another issue) [21:02:16] urandom: just to confirm, the include succeeded, but emitted the above error? [21:03:05] I ignored it using --ignore=undefinedtarget [21:03:15] I'm around but I'm definitely more afraid of reprepro than it is of me [21:04:41] lol [21:04:46] urandom: ah, that makes sense [21:04:55] cool, well I'll give this a try now [21:05:08] anyone happen to know if `clearvanished` has a dry-run mode? [21:07:59] aaand nevermind [21:08:04] all done [21:08:34] (I realized that if there _were_ changes other than my component/pcre2 deletion lurking, we would already have been seeing errors) [21:09:23] verified that fixed the error with a no-op `checkupdate` [21:09:31] nice [21:09:32] apologies again for the noise :)