[08:00:44] hi, the beta cluster is not working anymore due to Etcd TLS certificate having expired yesterday [08:01:02] that got filed as https://phabricator.wikimedia.org/T393855 [08:01:53] I don't know how Etcd is configure or how the certificate can be regenerated :-\ [08:02:34] maybe that is automatically regenerated by Puppet but it is not running anymore on that instance (missing `prometheus::instances_defaults`) which is the sub task prometheus::instances_defaults [08:05:14] sub task is prometheus::instances_defaults [08:05:18] err copy past [08:05:23] https://phabricator.wikimedia.org/T393866 [08:24:05] federico3: The CI job in your link is stuck because it does not come from a protected branch. The commit was made to "devel-ci" branch. You should double check https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/settings/repository#js-protected-branches-settings and make sure to merge into a protected branch or use a protected tag if you want to run jobs on the Trusted Runners. [08:24:22] ah, thank you! [08:58:36] hashar: o/ IIUC from your output the TLS error part for etcd05 should be fixed when puppet runs (so it gathers the new cert etc..) [11:45:29] I got kicked out of _security and -private (possibly because I changed my primary nick?), could someone please re-invite me? [12:59:22] <_joe_> Raine: {{done}} [12:59:27] <_joe_> sorry I didn't see it earlier [13:23:25] ty _joe_ <3 [13:27:49] godog: fyi, seems the cardinality reduction on editResponseTime isn't working yet, and the recording rules don't cover the dashboard needs. https://phabricator.wikimedia.org/T391677#10811591 [13:30:07] Krinkle: ack, I'll check and report back [14:21:37] has there been some kind of rate-limiting applied to the LVS endpoints recently? I have a script I run against the Elastic endpoints (search.svc.eqiad.wmnet) before I reimage and it does 5 or 6 API calls...I've noticed that I have to run it a few times to get it to finish [14:22:42] not that I know of [14:36:30] ACK, sounds like a "me" problem then ;) [14:37:08] I'll hit one of the cirrussearch hosts directly and see if I get a different result [14:47:36] this is up to date, right? https://wikitech.wikimedia.org/wiki/Maintenance_scripts aka mwscript-k8s is the preferred way to entry mediawiki for a maintenance job, right? [14:49:43] AFAIK yes [14:50:01] I just want to call import images, but I rarely use the cli, so wanted a double check [14:50:49] I am the person that will recover images if I break them, so I was doubtful about the wrapping, as last time I did it we didn't have k8s [14:50:56] *only [15:13:45] I was able to run it, but I now have to understand why the maintenance script doesn't detect any files [15:14:29] There is a helpful "If you receive errors about files not existing, try making the file world-readable.", but that didn't work for me [15:17:58] jynus: it would be helpful to have the full invocation you tried [15:19:05] I did: mwscript-k8s --comment="Reupload due to missing file - T393049" -- importImages.php --wiki=commonswiki --sleep=1 --comment-ext="Reupload due to missing file - T393049" --user="JCrespo_(WMF)" /tmp/T393049 --overwrite [15:19:06] T393049: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049 [15:19:34] I got: "v7a-app\nImporting Files\nNo suitable files could be found for import" [15:19:47] maybe it is looking for those inside the container? [15:20:20] yeah that won't work for now, because the image is not in the container [15:20:43] any suggestion for forcing an overwrite of a file on commons? [15:20:44] We have a way to pass a file (text) through stdin, but I'd make a phab task for us to support that use case (cc rzl) [15:21:08] I don't really need to run anything, just force the write - other than writing to swift directly [15:21:13] the plan of record is to not do this with mwscript-k8s but build something else - https://phabricator.wikimedia.org/T377497 [15:21:23] that's ok [15:21:24] we haven't talked about it in a while though, we should probably revisit [15:21:47] not needing it, any hack for a one time fix? [15:21:54] you can think of? [15:22:22] jynus: for today, use mwscript on mwmaint despite the deprecation warning [15:22:22] I can handle swift, just I woulf prefer not to [15:22:26] ah, I see [15:22:43] I thought it was hard-deprecated, that helps! [15:22:44] medium-term patch might be to dump it into a persistentvolume and let the script read from that, but I still don't love it [15:22:54] yep, no worries :-D [15:23:04] thank you [15:23:29] it will be harder-deprecated later today but we'll post an answer to this [15:23:54] It is that I rarely use these scripts, so I catch up on a lot of things when I do, so I was a bit overwhealmed [15:24:01] thanks for the help [15:26:15] it worked nicely now [15:27:04] actually, it didn't, but it is not an infra issue [15:36:05] wait, maybe it did work? [15:37:53] could someone from another continent confirm they can see https://upload.wikimedia.org/wikipedia/commons/2/22/Yankees_Baseball_%282%29_%2810561961695%29.jpg ? [15:38:38] the script said "skipped" but, either someone else did something, or the "hitting it hard until it works" worked, despite no logs [15:39:20] jynus: a good use case for tunnelencabulator :D [15:39:45] you can see the file, then? [15:40:03] I can, but like you, I also hit eqiad [15:40:08] ah, true [15:40:28] let me confirm on swift, just to be sure [15:40:29] if you have the wmf-sre-laptop package installed you can `tunnelencabulator -d codfw` and try it [15:41:07] I think mw script complained, but it actually did the thing, which has happened in the past (a bug that does the right thing) [15:42:00] (and also would explain why the images disappear) [15:52:56] FWIW I can see that file and I’m hitting esams (not sure which continent jynus is on) [15:54:08] yeah, I actually meant hitting codfw, my bad, but no worries, I went directly to swift and checked directly on the backend [15:54:24] ok :) [15:54:24] on both dcs (eqiad and codfw) [15:54:30] thanks for the help everybody [17:36:12] cwhite: new grafana version supports switching between x and Mixed and Mixed to x on panels without losing individual metrics :) [17:36:30] e.g. graphite -> Mixed (to start conversion). Or from Mixed -> Thanos (to finish conversion) [17:54:04] Wonderful! [21:40:34] `cirrussearch2091` has been down for at least 2 weeks, PyBal didn't depool it (ref https://www.irccloud.com/pastebin/9NLBjlPS/ ). We just manually depooled it, but has anyone seen that happen before? We're just using the default health ceck (ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#2158 ) [21:40:38] cc ryankemper [21:41:00] We're still investigating but we think there may be other hosts w/the same issue [22:25:07] inflatador: can you expand what you mean by "PyBal didn't depool it"? [22:26:09] swfrench-wmf I'd expect it would be set to `enabled: false` on config-master if it were automatically depooled by a failed healthcheck. At least that's what I think I've seen in the past? [22:28:49] inflatador: ah, got it. so, what's on config-master is just a snapshot of etcd state (i.e., only reflects what you've explicitly done w/ `confctl`) [22:30:05] if the host is down, but was inadvertently left pooled, then the fact that pybal will exclude the host if health checks are failing (if permitted by depool_threshold) will not be reflected there [22:30:06] Damn, I guess I made a bad assumption there. How can I see which nodes are considered healthy, then? [22:33:18] so, at least as of right now, it _seems_ like they're all healthy given that there's no `PyBal backends health check` alert firing [22:34:34] Do you see any older alerts for `cirrussearch2091`? I must not be getting pybal alerts at all [22:35:27] 2091 is not pooled, so the alert would not fire for that [22:35:53] ah, you mean to ask, are there any hosts that are down (from the standpoint of health checks) but pooled? [22:35:58] It was pooled and hard down for at least a couple of weeks [22:36:30] yeah, I assumed that when a server failed health checks, that would be reflected in config-master [22:37:24] but I was wrong about that and I don't seem to be getting any alerts for PyBal backends (at least based on email and IRC highlights) [22:38:44] anyway, nothing urgent. I have to step out but I appreciate you setting me straight re: config-master [22:40:32] swfrench-wmf now that I think about it, cirrussearch2091 alerts have been suppressed for awhile, so that might explain it. Sorry to bug you on this [22:42:36] inflatador: you're good - thanks for noticing in the first place :) alas, the best info I have at the moment is that there are 4 cirrussearch hosts in codfw consistently failing pybal health checks, but none are pooled [22:43:07] (so pybal is happy with that, and has disabled those hosts as backends) [22:43:40] * swfrench-wmf needs to drop to run an errand [22:44:54] Thanks again! I'll take a look at the 'down but not pooled' alerts and see if we can use those concepts in a dashboard or something