[08:00:44] <hashar>	 hi, the beta cluster is not working anymore due to Etcd TLS certificate having expired yesterday 
[08:01:02] <hashar>	 that got filed as https://phabricator.wikimedia.org/T393855
[08:01:53] <hashar>	 I don't know how Etcd is configure or how the certificate can be regenerated :-\
[08:02:34] <hashar>	 maybe that is automatically regenerated by Puppet but it is not running anymore on that instance (missing `prometheus::instances_defaults`) which is the sub task prometheus::instances_defaults
[08:05:14] <hashar>	 sub task is prometheus::instances_defaults
[08:05:18] <hashar>	 err copy past
[08:05:23] <hashar>	 https://phabricator.wikimedia.org/T393866
[08:24:05] <jelto>	 federico3: The CI job in your link is stuck because it does not come from a protected branch. The commit was made to "devel-ci" branch. You should double check https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/settings/repository#js-protected-branches-settings and make sure to merge into a protected branch or use a protected tag if you want to run jobs on the Trusted Runners.
[08:24:22] <federico3>	 ah, thank you!
[08:58:36] <elukey>	 hashar: o/ IIUC from your output the TLS error part for etcd05 should be fixed when puppet runs (so it gathers the new cert etc..)
[11:45:29] <Raine>	 I got kicked out of _security and -private (possibly because I changed my primary nick?), could someone please re-invite me?
[12:59:22] <_joe_>	 Raine: {{done}}
[12:59:27] <_joe_>	 sorry I didn't see it earlier
[13:23:25] <Raine>	 ty _joe_ <3
[13:27:49] <Krinkle>	 godog: fyi, seems the cardinality reduction on editResponseTime isn't working yet, and the recording rules don't cover the dashboard needs. https://phabricator.wikimedia.org/T391677#10811591
[13:30:07] <godog>	 Krinkle: ack, I'll check and report back
[14:21:37] <inflatador>	 has there been some kind of rate-limiting applied to the LVS endpoints recently? I have a script I run against the Elastic endpoints (search.svc.eqiad.wmnet)  before I reimage and it does 5 or 6 API calls...I've noticed that I have to run it a few times to get it to finish
[14:22:42] <akosiaris>	 not that I know of
[14:36:30] <inflatador>	 ACK, sounds like a "me" problem then ;)
[14:37:08] <inflatador>	 I'll hit one of the cirrussearch hosts directly and see if I get a different result
[14:47:36] <jynus>	 this is up to date, right? https://wikitech.wikimedia.org/wiki/Maintenance_scripts aka mwscript-k8s is the preferred way to entry mediawiki for a maintenance job, right?
[14:49:43] <Lucas_WMDE>	 AFAIK yes
[14:50:01] <jynus>	 I just want to call import images, but I rarely use the cli, so wanted a double check
[14:50:49] <jynus>	 I am the person that will recover images if I break them, so I was doubtful about the wrapping, as last time I did it we didn't have k8s
[14:50:56] <jynus>	 *only
[15:13:45] <jynus>	 I was able to run it, but I now have to understand why the maintenance script doesn't detect any files
[15:14:29] <jynus>	 There is a helpful "If you receive errors about files not existing, try making the file world-readable.", but that didn't work for me
[15:17:58] <claime>	 jynus: it would be helpful to have the full invocation you tried
[15:19:05] <jynus>	 I did: mwscript-k8s --comment="Reupload due to missing file - T393049" -- importImages.php --wiki=commonswiki --sleep=1 --comment-ext="Reupload due to missing file - T393049" --user="JCrespo_(WMF)" /tmp/T393049 --overwrite
[15:19:06] <stashbot>	 T393049: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049
[15:19:34] <jynus>	 I got: "v7a-app\nImporting Files\nNo suitable files could be found for import"
[15:19:47] <jynus>	 maybe it is looking for those inside the container?
[15:20:20] <claime>	 yeah that won't work for now, because the image is not in the container
[15:20:43] <jynus>	 any suggestion for forcing an overwrite of a file on commons?
[15:20:44] <claime>	 We have a way to pass a file (text) through stdin, but I'd make a phab task for us to support that use case (cc rzl)
[15:21:08] <jynus>	 I don't really need to run anything, just force the write - other than writing to swift directly
[15:21:13] <rzl>	 the plan of record is to not do this with mwscript-k8s but build something else - https://phabricator.wikimedia.org/T377497
[15:21:23] <jynus>	 that's ok
[15:21:24] <rzl>	 we haven't talked about it in a while though, we should probably revisit
[15:21:47] <jynus>	 not needing it, any hack for a one time fix?
[15:21:54] <jynus>	 you can think of?
[15:22:22] <rzl>	 jynus: for today, use mwscript on mwmaint despite the deprecation warning
[15:22:22] <jynus>	 I can handle swift, just I woulf prefer not to
[15:22:26] <jynus>	 ah, I see
[15:22:43] <jynus>	 I thought it was hard-deprecated, that helps!
[15:22:44] <rzl>	 medium-term patch might be to dump it into a persistentvolume and let the script read from that, but I still don't love it
[15:22:54] <jynus>	 yep, no worries :-D
[15:23:04] <jynus>	 thank you
[15:23:29] <rzl>	 it will be harder-deprecated later today but we'll post an answer to this
[15:23:54] <jynus>	 It is that I rarely use these scripts, so I catch up on a lot of things when I do, so I was a bit overwhealmed
[15:24:01] <jynus>	 thanks for the help
[15:26:15] <jynus>	 it worked nicely now
[15:27:04] <jynus>	 actually, it didn't, but it is not an infra issue
[15:36:05] <jynus>	 wait, maybe it did work?
[15:37:53] <jynus>	 could someone from another continent confirm they can see https://upload.wikimedia.org/wikipedia/commons/2/22/Yankees_Baseball_%282%29_%2810561961695%29.jpg ?
[15:38:38] <jynus>	 the script said "skipped" but, either someone else did something, or the "hitting it hard until it works" worked, despite no logs
[15:39:20] <cdanis>	 jynus: a good use case for tunnelencabulator :D
[15:39:45] <jynus>	 you can see the file, then?
[15:40:03] <cdanis>	 I can, but like you, I also hit eqiad
[15:40:08] <jynus>	 ah, true
[15:40:28] <jynus>	 let me confirm on swift, just to be sure
[15:40:29] <cdanis>	 if you have the wmf-sre-laptop package installed you can `tunnelencabulator -d codfw` and try it
[15:41:07] <jynus>	 I think mw script complained, but it actually did the thing, which has happened in the past (a bug that does the right thing)
[15:42:00] <jynus>	 (and also would explain why the images disappear)
[15:52:56] <Lucas_WMDE>	 FWIW I can see that file and I’m hitting esams (not sure which continent jynus is on)
[15:54:08] <jynus>	 yeah, I actually meant hitting codfw, my bad, but no worries, I went directly to swift and checked directly on the backend
[15:54:24] <Lucas_WMDE>	 ok :)
[15:54:24] <jynus>	 on both dcs (eqiad and codfw)
[15:54:30] <jynus>	 thanks for the help everybody
[17:36:12] <Krinkle>	 cwhite: new grafana version supports switching between x and Mixed and Mixed to x on panels without losing individual metrics :)
[17:36:30] <Krinkle>	 e.g. graphite -> Mixed (to start conversion). Or from Mixed -> Thanos (to finish conversion)
[17:54:04] <cwhite>	 Wonderful!
[21:40:34] <inflatador>	 `cirrussearch2091` has been down for at least 2 weeks, PyBal didn't depool it (ref https://www.irccloud.com/pastebin/9NLBjlPS/ ). We just manually depooled it, but has anyone seen that happen before? We're just using the default health ceck (ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#2158 ) 
[21:40:38] <inflatador>	 cc ryankemper 
[21:41:00] <inflatador>	 We're still investigating but we think there may be other hosts w/the same issue
[22:25:07] <swfrench-wmf>	 inflatador: can you expand what you mean by "PyBal didn't depool it"?
[22:26:09] <inflatador>	 swfrench-wmf I'd expect it would be set to `enabled: false` on config-master if it were automatically depooled by a failed healthcheck. At least that's what I think I've seen in the past? 
[22:28:49] <swfrench-wmf>	 inflatador: ah, got it. so, what's on config-master is just a snapshot of etcd state (i.e., only reflects what you've explicitly done w/ `confctl`)
[22:30:05] <swfrench-wmf>	 if the host is down, but was inadvertently left pooled, then the fact that pybal will exclude the host if health checks are failing (if permitted by depool_threshold) will not be reflected there
[22:30:06] <inflatador>	 Damn, I guess I made a bad assumption there. How can I see which nodes are considered healthy, then?
[22:33:18] <swfrench-wmf>	 so, at least as of right now, it _seems_ like they're all healthy given that there's no `PyBal backends health check` alert firing
[22:34:34] <inflatador>	 Do you see any older alerts for `cirrussearch2091`? I must not be getting pybal alerts at all
[22:35:27] <swfrench-wmf>	 2091 is not pooled, so the alert would not fire for that
[22:35:53] <swfrench-wmf>	 ah, you mean to ask, are there any hosts that are down (from the standpoint of health checks) but pooled?
[22:35:58] <inflatador>	 It was pooled and hard down for at least a couple of weeks
[22:36:30] <inflatador>	 yeah, I assumed that when a server failed health checks, that would be reflected in config-master
[22:37:24] <inflatador>	 but I was wrong about that and I don't seem to be getting any alerts for PyBal backends (at least based on email and IRC highlights)
[22:38:44] <inflatador>	 anyway, nothing urgent. I have to step out but I appreciate you setting me straight re: config-master
[22:40:32] <inflatador>	 swfrench-wmf now that I think about it, cirrussearch2091 alerts have been suppressed for awhile, so that might explain it. Sorry to bug you on this
[22:42:36] <swfrench-wmf>	 inflatador: you're good - thanks for noticing in the first place :) alas, the best info I have at the moment is that there are 4 cirrussearch hosts in codfw consistently failing pybal health checks, but none are pooled
[22:43:07] <swfrench-wmf>	 (so pybal is happy with that, and has disabled those hosts as backends)
[22:43:40] * swfrench-wmf needs to drop to run an errand
[22:44:54] <inflatador>	 Thanks again! I'll take a look at the 'down but not pooled' alerts and see if we can use those concepts in a dashboard or something