[08:53:48] Weekly update: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-04-19 [09:20:28] ryankemper: I've closed a few of the SPARQL federation requests. I'm leaving that one to you: T346455 [09:20:28] T346455: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 [13:13:50] nikerabbit mentioned some translatewiki query performance issues: https://docs.google.com/document/d/1G7OTmBmzl5GVwoanCrzQNHMMXIrAvROpmKumdWtZfuo/edit?disco=AAABLL_Cbew . Is there anything we could do to help? Mainly I'm asking for a lesson on how to be a better "Elasticsearch DBA" [13:25:56] o/ [13:27:39] inflatador: we should look at the queries I think, iirc ttm uses 2 different queries, one that is fairly slow because it is "fuzzy" version of the morelike query and the other uses facets, knowing which one is having issues would be a good start I think [13:28:13] inflatador: note that ttm might need our plugins [13:28:36] so this means that we need to port them to opensearch if you want to include ttm [13:28:56] dcausse ah, thanks for catching that. I'll add that to our plan [13:29:55] As far as the TTM queries, do we need to look at their upstream code for that? We don't have a query log, right? [13:30:29] I don't think they measure their query performances but I could be wrong [13:31:00] I was just wondering how to see what queries they're actually making [13:31:20] We could sniff them with tcpdump but I'm guessing there is a better way [13:32:34] ah sure the queries we might know where they are, lemme find the source code [13:34:47] that's the one that's being slow I guess: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Translate/+/refs/heads/master/ttmserver/ElasticSearchTTMServer.php#81 [13:35:52] * dcausse realizes that visualizing an elasticsearch query by looking at the php code that generates it might not be the best idea [14:12:45] Yeah, DTO building code tends to be a lot more bloated than the resulting DTO… [14:48:41] experimenting with spark Dataset and Encoder seems like a nice way to strongly type your DataFrame [15:05:10] \o [15:10:08] o/ [15:12:01] ebernhardson: SUPtime: ~2days 🎉 shall we start migrating wikis from cirrus to SUP? Or would you rather observe a full saneitizer loop first? [15:12:21] strongly typed spark seems tempting, but i have no clue how that will work great with the type system...types in spark would be super hairy [15:12:43] pfischer: probably soon, i suppose the only thing it's not doing right now is the old document updates, i had pondered cherry-picking those forward to this weeks train but didn't manage [15:13:03] pfischer: the old document updates are where ever 8th doc is flagged as rerender by saneitizer [15:13:13] but we are reasonably certain it handles the capacity (with tuning, perhaps) [15:16:19] * ebernhardson separately wishes defining additional charts in superset was as simple as in a jupyter notebook, where i write one function and invoke it in different ways [15:44:29] early lunch, back in ~1h [16:44:38] finding supersets implementation of template variables a bit tedious....we can use variables in a chart's sql. But we can't provide any example values so the chart can't work in the chart editor ui. And we can't add a chart multiple times to a dashboard with different variables... [16:45:30] not really looking to click around in a ui and create all the bits for what was, in python, in a single function that you invoke with different columns to aggregate over :S [16:57:12] 🙄 dinner [17:08:15] back [17:22:33] * ebernhardson now remembers the horrors of creating superset dashboards before...fun things like defining dropdown options with a saved sql query: SELECT * FROM (VALUES ('Option 1'), ('Option 2')) [17:31:51] Seems a long shot, but I'm wondering if the WDQS probedown alerts we're getting are related to T362977 [17:31:52] T362977: WDQS updater missed some updates - https://phabricator.wikimedia.org/T362977 [17:48:43] We're getting some prometheus alerts for WDQS...looks like some hosts are sending a lot of 403s [17:58:33] Will take a look [18:10:27] ryankemper ACK, LMK if you wanna pair on it. I'm still trying to compare whether or not the healthy hosts are also returning 403s to the prometheus. wdqs2013, 2016 and 2020 seem to have the most 403s. Will depool them [18:13:59] working out of T363004 [18:13:59] T363004: Investigate WDQS ProbeDown alerts - https://phabricator.wikimedia.org/T363004 [18:15:48] ryankemper I have a feeling that we'll have to do a data-transfer on them, if you wanna focus on T362983 I can look at the alerting hosts [18:15:50] inflatador: seeing some uptick of failed requests in grafana, and corresponding increase in qps. https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=now-2d&to=now&refresh=1m&viewPanel=43 (failed queries) & https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=now-2d&to=now&refresh=1m&viewPanel=18 (qps) [18:15:50] T362983: Investigate/fix WDQS data-transfer cookbook - https://phabricator.wikimedia.org/T362983 [18:16:02] It's not an insanely strong signal but looks like a real one [18:18:42] looks like it's counting 4xx responses as failed queries, so the 403s sent to prometheus are probably part of that metric. Need to check if it's just prom getting the 403s [18:22:13] inflatador: I can hop on a meet in 5', need to take care of something real quick [18:23:22] ACK [18:26:00] FWiW, at least on 2013, it's only prom and pybal that are getting 403s from nginx [18:34:54] Looks like prom and pybal got banned by the throttling filter on those hosts [19:09:48] Have there been any changes to WDQS throttling filter lately? I'm not sure why prom and pybal are being throttled/banned [19:15:07] not that i'm aware of, but haven't been paying that close attention [19:18:15] inflatador: if the system is overloaded and slow enough, the monitoring request might start taking enough time that they will be throttled. Throttling is based on query time and query time is related to load. [19:18:52] This is actually a somewhat desirable effect. Hosts are depooled when they get too overloaded [19:19:16] the systems aren't overloaded [19:20:05] at least not from what we can tell...almost all of this is happening in passive DC [19:20:20] Then I have no idea :/ [19:20:32] we would get throttled if a single user agent was over a certain % of traffic? [19:20:42] It's already depooled, so we don't have user impact? [19:21:00] b/c the prom and pybal UAs would be a huge % of traffic in passive [19:21:21] Maybe dr0ptp4kt has an idea. Or we might need to wait until Tuesday and have dcausse have a look [19:21:25] but that doesn't explain why it just started happening [19:21:40] yeah, we can stabilize the service by banning CODFW again [19:21:48] > It's already depooled, so we don't have user impact? [19:21:51] no it's not depooled [19:22:03] if we keep seeing bans we will depool, right now we just restarted BG to clear out the bans [19:22:04] I don't think we have actual user impact right now though [19:22:23] We've cleared out the throttle state and currently nothing is banned. We're monitoring to see if that changes [19:22:24] and the hosts are removed by pybal when it gets banned, so user impact should be minimal [19:22:34] There have been alerts during the whole day for a few servers, see #wikimedia-analytics-alerts [19:22:56] yeah, I saw them start yesterday, we didn't restart services until about an hour ago [19:23:24] ebernhardson: While running the discolytics CI pipeline, it runs into the following errors: Err:7 http://mirrors.wikimedia.org/debian buster-backports Release 404 Not Found [IP: 208.80.154.139 80]; E: The repository 'http://mirrors.wikimedia.org/debian buster-backports Release' does not have a Release file; full log at https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/jobs/244551 [19:23:42] but yeah, no worries either way, there's plenty we can do to keep stable thru Tues [19:24:39] inflatador: any idea if we removed the Buster back ports? [19:24:59] * gehel will start the weekend for real. Scream if you need me [19:25:04] gehel: upstream did that, it's a thing.. [19:25:24] Y was gonna say, I remember ebernhardson hitting that too, maybe? [19:25:40] i suppose the generous interpretation is that debian dropped the -backports 6 months before ending LTS support on june 30th [19:25:45] kind of a reminder that you are running old things [19:26:11] i guess thats not even 6 months, thats a month and a half [19:26:46] So we have to update our blubber file to no longer use that repo or do we require it? [19:27:33] for temporary hax, i basically used `grep -v` to remove -backports from sources.list and used an updated container. But for real fixes we have to migrate off buster, probbaly to bookworm [19:28:13] oh i guess i'm reading thi s wrong, it would be bullseye i guess [19:36:15] Hm, which base image would I choose? I tried docker-registry.wikimedia.org/bookworm:20240414 with http://apt.wikimedia.org/wikimedia/dist/bookworm-wikimedia but that fails as it can resolve neither conda nor openjdk-11-jdk [19:37:42] https://apt.wikimedia.org/wikimedia/dists/bookworm-wikimedia/thirdparty/ for sure does not have conda [19:38:18] https://apt.wikimedia.org/wikimedia/dists/bullseye-wikimedia/thirdparty/ does… [19:40:06] unfortunately it probably means we need the packages built and tested with newer debian versions [19:41:01] pfischer: as a short-term fix, change https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/blob/main/.pipeline/blubber.yaml?ref_type=heads#L3 to `docker-registry.wikimedia.org/buster:latest` [19:44:40] for the wider fix, probably coordinate with the other dse teams, i suppose part of the hope of using their conda abastraction is we could use their fixes too :) [19:48:20] taavi: thanks! So far, the build keeps running against bullseye… [20:05:44] Running to lunch. Will take another look at the throttling/banning of pybal and/or prometheus. If we keep seeing issues I don't have a great idea of the path forward currently; seems unlikely that depooling codfw would help the situation although it's still something to try [20:12:07] y, I think depooling codfw would mainly just reduce alert noise [21:00:58] Alright, things are still looking good currently. No banned requests (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1713474720461&to=1713560421827&viewPanel=23), and no failing probes (https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=1713474289962&to=1713560443218) [21:01:25] Won't be surprised if we see it resurface at some point over the weekend...if so first response should be to do a rolling restart, then monitor from there [21:29:00] ACK, good to hear everything's looking OK