[07:29:38] hi folks! [07:29:54] o/ [07:30:02] can we do something about all the wmf-auto-restart failed units on testreduce1001? [07:30:17] (reviewing the alerts :) [07:33:52] I am not even sure what these are about [07:34:02] I think we own this on paper, but in reality... [07:34:50] gonna take a look [07:35:02] there's even something about a php service with php not being installed ... [08:31:40] yeah I was confused as well [09:58:10] welp, we hit a 2h backlog on parsoidCachePreWarm during the night, with a bonus dip way under 50% idle for the jobrunners [09:58:38] https://grafana.wikimedia.org/goto/lglDTWr4z?orgId=1 [09:58:49] I think we may need to add yet more capacity after all [10:25:22] some template edit again? [10:26:31] Maybe, but if we're dipping to 30% idle every time a template gets edited, we may have a problem long term, no? [10:26:43] s/template/big template/ [10:27:22] well, depends ? Edits to big templates isn't a new issue. It has always existed and has even caused incidents [10:28:40] since this is the jobrunner cluster, we don't have to provide the same level of resilience for idle php-fpm workers that we do for the appserver cluster [10:28:45] it's jobs, they will be retried [10:29:23] so, we can definitely survive with small backlogs [10:29:52] huge backlogs will probably cause secondary issues that we might want to avoid, mostly cause it's going to be a mess to debug the incidents they will cause [10:30:14] yeah [10:30:21] and all of the above is utterly qualitative and not quantative [10:30:30] I haven't checked if it caused backlogs on other jobs yet [10:30:53] I 'd expect that it did not since we didn't run out of workers [10:31:02] but it would if we had [10:33:00] past history in grafana isn't yielding any other recent-ish (6 months situations [10:33:50] aside from one ~2023-05-08 [10:34:06] have we added enwiki yet? [10:34:11] yep [10:34:17] ah, so we are essentially done [10:34:18] we were planning to add all the other wikis today [10:34:40] yeah, go ahead. dewiki, frwiki and enwiki are huge compared to everything else (wikidata aside) [10:35:57] jobrunners/parsoid are about the same CPU usage currently [10:36:13] so, if we are to steal some more capacity from the parsoid cluster, it can't be many hosts [10:36:16] Maybe 1 or 2 ? [10:37:03] the API cluster is actually faring better [10:37:26] maybe we can re-image some of those 62 nodes and add them to the jobrunner cluster instead [10:37:43] yeah, that's what I was thinking, parsoid has its hands pretty full [10:38:03] Now that the major internal API consumer isn't around anymore (parsoid-nodejs), maybe we can trim down that cluster a bit [11:11:02] 10serviceops, 10SRE: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) Reverted because rsync::quickdatacopy wants fqdns, we're giving it IPs, nothing gets deployed. I will prepare a fix and we can try again. [11:42:11] 10serviceops, 10SRE: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) a:03Clement_Goubert [11:42:49] 10serviceops, 10SRE: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) 05Open→03In progress [13:12:47] 10serviceops, 10Beta-Cluster-Infrastructure, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Install wikidiff2 1.14.0 deb on deployment-prep & test - https://phabricator.wikimedia.org/T340542 (10dom_walden) I tested API:Compare on most of the beta wikis[1], just checking that it cou... [13:23:22] 10serviceops, 10SRE, 10Traffic, 10envoy: Refactor envoy.filters.http.router and envoy.filters.listener.tls_inspector - https://phabricator.wikimedia.org/T337405 (10JMeybohm) 05Open→03Resolved [13:23:31] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [15:13:28] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream: Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments - https://phabricator.wikimedia.org/T340059 (10Ottomata) p:05Triage→03High [15:44:05] elukey: maybe I just fixed testreduce for now [15:45:06] akosiaris: thanks!