[07:44:47] gehel: Yes I did, the back pressure was increasing so I tried to ease that. [07:49:44] pfischer: thanks for the reactivity! [07:50:00] I was surprised. But good ! [14:03:41] my monitor doesn't want to go back to work [15:15:04] \o [15:15:07] o/ [15:15:27] Happy New Year! [15:16:00] Happy NY! [15:16:01] indeed, you too! Everthing go well over december? [15:18:48] Thanks! With my SUP hat on, I’d say yes. However, it seems like the consumer is troubled by the number of update records: back pressure coming from the busy fetch operator. Nonetheless, the apps ran smoothly all December. [15:20:23] sounds pretty good :) Will have to figure out something with the fetch, but should be solvable [15:21:33] Yes, my current approach: Partition the update topic and increase the task manager replicas and therefore the parallelism. [15:27:27] I also looked a the connection pool utilisation but that metric appears rather volatile (prometheus polls every 10s and that gauge changes within milliseconds). I wanted to test a larger HTTP connection pool, but annoyingly, helmfile --set does not encode numbers as strings, which in turn leads to an error, that the args array should only consist of strings. So I haven’t tested that yet. [15:30:23] Maybe some extra parameter that is then encoded via templating could work, like {{ - range .Values.cli.args }} {{ . | quote }} {{ - end }} [15:31:02] hmm, ya perhaps something like that would work. Certainly seems like we ought to be able to inject a stringification in there somewhere. [15:41:50] yay, golang templating ;P [15:42:58] Yeah, other helm CLI binaries accept --set-string but not helmfile :-( [16:25:42] forgot about this script...should probably fix/polish https://gitlab.wikimedia.org/repos/search-platform/sre/decom-tix [16:32:11] inflatador: there is an official form for decommissioning, you shouldn't hardcode it elsewhere, see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission [16:42:01] volans thanks, sounds like we should dynamically load that form instead [16:44:28] if you have a specific need that could be useful to others get in touch with dcops [16:45:26] will do...mainly it's to save copy/pasting and the manual errors that come with it ;) [17:08:40] dinner [17:20:12] * ebernhardson has finally clicked through all the mail from last few weeks, now a dozen new tabs to review :P [17:29:01] late to the party I'm sure, but I just signed up for Kagi...highly recommended by my tech nerd friends [17:46:53] started a ticket for migrating ES plugins repo to gitlab...more details at https://phabricator.wikimedia.org/T353275 . Please let me know if you have any concerns/suggestions [18:47:37] lunch, back in ~1h [19:17:57] Pharmacy’s taking a while, will be ~5-10 mins late to pairing [19:25:24] back [21:12:41] i realized, the job queue lag too high is getting incorrect values. It's suppose to alert >6h of backlog, and reported 8h16m of backlog, I'm not sure what that is though. The dashboard shows ~8 minutes of backlog at that point [21:13:11] i thought maybe it was in ms, but 8h16m is ~30k seconds, if it was ms it would be 30s which is too small [21:34:54] weird [21:41:04] applying an arbitrary /180 gets the two graphs pretty close, but i've no clue why that number :P Working up a patch to use histogram_quantile (which the dashboard has moved to), just trying to figure out how to make promtool pass [21:41:44] ah OK, was just looking at the git history around that alert [21:42:35] i'm pretty sure we copied the rule out of the dashboard to create it, maybe :) [21:43:38] you're already ahead...histogram change is here https://gerrit.wikimedia.org/r/c/operations/alerts/+/936070 [21:45:00] oh, interesting. Looks like it's changed again since then