[02:54:18] inflatador: Looks like 2088 is still having issues. Back to being ssh unreachable. A few hours ago I was able to get ssh'd in and run puppet I believe successfully (I had to kill my local tmux session when ssh later hung but pretty sure), but now the host is back to behaving badly and not ssh reachable. https://phabricator.wikimedia.org/T361525#9721137 [02:54:30] Moving it back to insetup [08:15:53] reading the aws-sdk source (S3 client that's used by the flink-presto-s3 connect) it appears that it does not retry on SocketTimeoutException... [08:54:25] actually the presto client does seem to wrap the client with its own retry strategy but it's only done for read operations, I suppose that retrying a write is too dangerous [10:47:54] lunch [13:11:31] o/ [13:22:29] dcausse: is https://www.wikidata.org/wiki/User:DCausse_(WMF)/WDQS_Split_Refinement ready to be moved to the main namespace? [13:22:37] gehel: yes [13:22:59] could you do that while I prepare the communication? Please? [13:23:07] sure [13:23:11] thanks! [13:25:34] how's codfw wdqs doing? It looks like lag is dropping for at least some of the hosts [13:25:41] gehel: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/WDQS_Split_Refinement [13:27:12] inflatador: indeed, seems like older hosts are past the corrupted backlog and are now catching up "normal" lag [13:27:33] dcausse: thanks! [13:29:49] dcausse ACK...you were saying that the new hosts are about 66% as fast? Might be time to look at T336443 again [13:29:53] T336443: Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 [13:31:16] gehel: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/WDQS_Split_Refinement#Feedback has a placeholder "[link to announcement]" that will have to change (and also remove the draft template) once you have a link [13:32:07] inflatador: yes, the slowdown is very noticeable under heavy writes (data-reload and backfills) [13:32:34] during normal updates it's not much of a problem I think [13:32:40] O [13:32:57] I'll get a patch up to set CPU performance governor. Not sure if that's our problem but now's a good time to try it [13:33:58] no clue :/ [13:36:27] Update published on https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/April_2024_scaling_update ! [13:37:31] gehel: the split refinment link is pointing to the example queries [13:38:37] thanks! correcting [13:43:34] dcausse https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020834 is up if you wanna take a look [13:49:44] dcausse I've disabled puppet on all codfw hosts except wdqs2022, that will be our test case for the performance governor settings [13:56:57] I've also set wdqs2023 to have "performance/maximum performance" in BIOS and rebooted, we can see what effect (if any) these have [13:57:18] thanks! [15:02:01] Wednesday meeting is open: https://meet.google.com/eki-rafx-cxi [15:02:12] and we want to talk a bit about future planning / projects [15:02:25] dcausse, ebernhardson, Trey314159, inflatador, ryankemper [15:02:44] need 5 min [16:38:24] back [17:35:45] realized i don't know when march is ... all of the march stats earlier were for apr1-11, [17:49:03] i suppose on the upside, saneitizer is helping to get through all of the bad responses earlier instead of having the long tail of these failures over time [18:02:20] lunch, back in ~40 [18:19:25] shipped saneitizer updates, it hasn't failed in the last 15m but will keep an eye on it [18:31:44] inflatador: we're starting, we'll record [19:08:24] Subject:Search Platform roadmap brainstorm, April 2024 - Google Form sent [19:15:39] ebernhardson: While running T358349-number-of-searches I do get a lot of warnings like [19:15:40] T358349: Search Metrics - Number of Searches - https://phabricator.wikimedia.org/T358349 [19:15:42] > Utils: Service 'sparkDriver' could not bind on port 12015. Attempting port 12016 [19:16:01] After block 3 [19:16:03] pfischer: yea, spark likes to be very verbose. It's almost all ignorable [19:16:13] Alright. [19:16:20] pfischer: that basically says that multiple people are using jupyterl]ab at the same time, it always start at the same port # and counts up iirc [19:33:28] pfischer: re pool counter, indeed we wouldn't be able to define a particular thoughput. It seems though that a particular throughput isn't whats important, it's that we have a clear maximum on the amount of concurrent load we put on the mw api cluster. [19:34:38] The fetch limits do that today, but it is certainly crude. I suppose i hadn't considered running a poolcounter, i'm still thinking in the mediawiki side where you just use the one that's running. But not clear they would want a separate app talking to that cluster [19:40:19] at the point where we start running it in a sidecar pod, probably not worth it [20:55:49] hmm, looks like wikitech and labtestwiki might not be available over the regular proxy we use [20:57:08] ebernhardson: they do not live in the prod hosting cluster (metal or k8s) so that seems likely [20:58:21] wikitech lives on cloudweb100* hosts. labtestwikitech lives on one host in codfw that I'm not remembering the name of at the moment. [21:06:33] bd808: hmm, yea that makes sense. i'll have to poke around for the right service that accesses them [21:06:56] or perhaps none, and have to setup some k8s egress rules i guess [21:08:36] maybe labweb-ssl, but thats only wt [22:06:30] looks like the tide has turned for wdqs2022 (our experimental host w/the CPU changes) https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-9h&to=now&viewPanel=8 [22:07:31] it took about ~7h for wdqs2007 (the old, faster host) to catch up after finishing the crazy backlog from this weekend [22:12:23] ryankemper just re-downtimed the wdqs20* hosts for the next 24 H...but you can probably cancel downtime/repool the working hosts. Will need to use confctl to repool the entire DC a la https://wikitech.wikimedia.org/wiki/Conftool#Depool_all_nodes_in_a_specific_datacenter