[10:23:48] process_sparql_query_hourly runtime went from ~3h to 20+h since 2023-21-09T16:00 [10:45:53] lunch [13:31:41] o/ dcausse: have you already picked a back-port-window for your page_rerender config change? I’d have another stream config related patch for update and fetch_error events: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/960616 [13:32:52] pfischer: no I did not yet [13:35:59] the rerender is not using the "development" folder not using a version suffix, should I adapt it? [13:43:52] errand, back in 20' [13:48:20] We seem to have quite a few 95-percentile alerts on Cirrus. Should we be worried? They seem to recover fairly quickly... [13:49:54] we should go back in history and count these alerts, but we it's not completely unusual to see some from time to time [14:12:48] errand, back in ~1h [14:21:33] pfischer: did you see the message from joal about JVM Stewardship presentation? [14:21:43] Let me know if you can take it over, or I'll improvise something [14:22:02] message = slack message in #wmf-java [14:35:30] gehel: not yet. 👀 [14:38:00] dcausse: regarding the schema: I’d only opt for development if you expect the schema for re-render to change. [14:38:54] gehel: Hm, I’d have to improvise too. This is just to inform about the effort and what we’re currently working on? Is there a shared slide deck? [14:54:28] Just to inform about the effort, that we will expect help from everyone, and that we have a phab board. [14:54:46] And maybe a word about what you want to achieve with this effort [14:55:17] It seems that the tasks currently selected are about documentation and a shared parent pom [15:00:44] actually, agenda is too full already, we'll talk JVM next time :/ [17:04:04] dinner [17:51:04] Patch to (hopefully) fix the WDQS endpoint: https://gerrit.wikimedia.org/r/c/operations/puppet/+/960664 is ready for review [18:19:08] WDQS LDF endpoint, that is [18:19:16] re: https://phabricator.wikimedia.org/T347284 [18:19:53] inflatador: do you also need to repoint whatever routes the queries? [18:20:47] ebernhardson I think that's already taken care of, it's in hieradata/common/profile/trafficserver/backend.yaml ...what I"ve seen so far is the nginx config is missing the LDF stuff [18:21:31] https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/trafficserver/backend.yaml [18:21:37] inflatador: ahh ok, i see there was already a patch. yea seems like it could work [18:22:19] not 100% sure, but so far it just looks like the enable_ldf value is getting set to false somewhere [18:23:38] curiously it's set to be generally true in hieradata/role/common/wdqs/public.yaml [18:25:39] ebernhardson good point. Then the problem is likely that we map LDF traffic to wdqs1016 only, which is NOT a public host [18:26:05] oh, we probably shouldn't point the public address at a non-public host :) [18:27:19] Y, I'll have to check again, but my guess is that wdqs1003 (previous LDF host) was a public host. In which case, we'd probably just want to point to a public host rather than enable for internal [18:27:32] i mean it's probably not a problem, the general idea was that the internal hosts serve known traffic that wont have crazy queries in it, giving hopefully better latency consistency. The LDF endpoint can't make crazy queries, so shouldn't hurt [18:27:51] but probably better to keep separate anyways [18:28:58] gehel: 3' late to 1:1 [18:32:38] I think it makes more sense to use a public host considering we already enable it there anyway. I'm not sure why we even have a separate traffic server mapping just for that endpoint, but I haven't looked at the traffic stuff much [18:33:29] Well the separate mapping is so it just routes to 1 host I assume [18:33:46] I'll make a new patch with the traffic change shortly [18:33:49] probably something like pagination requires state or something, i'm not sure exactly [18:34:59] yes I remember this as well, pagination might be heavily backend dependent and a different hosts might decide to return triples in a slightly different order [18:38:20] filed T347333 for the problem with the airflow task process_sparql_query_hourly, tried to restart it but it did not help, so I guess we'll have to tune this a bit more, it's already given 16G tho... [18:38:21] T347333: Tune process_sparql_query_hourly so that it does not get killed by yarn - https://phabricator.wikimedia.org/T347333 [18:39:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/960687 OK, patch for the LDF change [18:41:10] will look after my 1:1 [18:41:15] and, another alert for CirrusSearch percentiles... [18:42:25] oops, got rid of the Hosts:, since this change no longer touches wdqs hosts [18:51:33] one curious thing, elastic2037-2054 are showing elevated disk utilization, > 50%. 2055-2086 are showing ~15% [18:53:52] from per-node latency metrics, 2037-2054 are giving avg p95's of ~200ms, where 2055+ are seeing ~150ms [18:54:19] could be real hardware differences, could be configuration [18:56:01] was also noticing that the avg query latency for eqiad has been super high for the last ~90m except for a brief dip [18:56:18] oh, i totally didn't realize eqiad was alerting [18:56:26] with so few queries running through eqiad the numbers are useless [18:57:01] Ah. I do think the codfw stuff you mentioned is worth tracking down. The last incident was mitigated by banning 2044 [18:57:56] eqiad reports qps numbers of like 0.0002, I wonder if we can tune the alerts so there has to be at least 1 qps (or some arbitrary low number) [18:58:07] otherwise we are really reporting that a single query was issued and it took awhile [18:58:29] yeah, that makes the ol' SLO look kinda bad ;) [19:03:35] wrote a small comment explaining how we/I messed up the 1016 ldf thingy here: https://phabricator.wikimedia.org/T347284#9196905 [19:03:56] looking at the patch now, and then we can look at altering the prometheus metrics for alerts to account for qps [19:05:21] inflatador: gave my +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/960687, I will go ahead and merge that [19:05:30] ryankemper nice, thanks [19:13:28] https://phabricator.wikimedia.org/T347338 is up for the codfw performance differences [19:19:19] Made a ticket about only alerting if we have enough qps as well: T347341 [19:19:20] T347341: Only alert for high latency if there is enough data to make a sensible average - https://phabricator.wikimedia.org/T347341 [19:19:38] looks like the LDF patch fixed https://query.wikidata.org/bigdata/ldf [19:19:53] \o/ [19:20:21] wrt performance; old servers: https://phabricator.wikimedia.org/T198169 new servers: https://phabricator.wikimedia.org/T237571 [19:21:20] 2x1.6TB & 128GB ram vs 2x 1.9TB and 256GB ram. It would not surprise me that the RAM largely explains the performance gap, with some contribution perhaps from slightly faster SSDs [19:21:59] hmm, yea the extra ram directly reduces the amount it needs to go out to disk, makes sense that we see higher utilization then [19:22:04] so, basically hardware differences :) [19:23:45] I'm looking into https://downloadmoreram.com/ but the website has a graphical interface so I'm not sure how to get it work from curl when ssh'd into the hosts directly [19:23:55] :) [19:24:08] `/s` just incase any onlookers make the mistake of taking me seriously :D [19:29:12] on the upside, i suppose this reinforces that the decision to upgrade to 256G was the right call :) [19:29:29] too bad it takes 4-ish years to implement a decision though [19:37:40] ebernhardson: So the `cirrussearch_eqiad_95th_percentile` alert is graphite-based. Do we have graphite metrics that expose the qps or something similar that can derive qps? Not seeing anything relevant under the `MediaWiki.CirrusSearch.eqiad` "directory" [19:39:23] ryankemper: yea, sec [19:40:13] ryankemper: can use MediaWiki.CirrusSearch.eqiad.requestTimeMs.prefix.sample_rate as a proxy, thats only the autocomplete rate but if we are live that should be seeing traffic [19:40:36] oh wait not prefix though, you want comp suggest [19:40:50] ack [19:40:57] just s/prefix/comp_suggest/ [19:47:50] As an aside, looking at `modules/role/manifests/elasticsearch/alerts.pp` I'm noticing that `cirrussearch_codfw_95th_percentile` means something different from `cirrussearch_eqiad_95th_percentile`; the latter is looking at `more_like` specifically [19:50:00] https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/role/manifests/elasticsearch/alerts.pp#17 [19:50:42] huh, that doesn't seem intentional [19:51:11] (For posterity here's the link with the commit hash hardcoded https://gerrit.wikimedia.org/g/operations/puppet/+/11654eca9fc8c6c6d92b148132e91edd5bfc3d49/modules/role/manifests/elasticsearch/alerts.pp#17) [19:53:52] Ah, so it's actually written that way to get around an issue similar to (but I think not quite the same) as the one we're now seeing in eqiad: https://gerrit.wikimedia.org/g/operations/puppet/+/a28b126b67d8ab247fd182cbc8316688e728db4e [19:54:17] oh, interesting. yea sounds like it [20:05:08] back in ~15 [20:24:09] back [21:04:14] ebernhardson: inflatador and I are gonna talk the alerting stuff https://meet.google.com/fde-tbpf-wqh?authuser=1 [21:04:24] kk, sec [21:15:08] ryankemper: how many servers do we have for wdqs in each cluster/ DC ? [21:16:11] Not urgent, but I was trying to get that from confctl I I had numbers that did not make sense to me [21:16:41] Must be too late for my brain [21:17:03] gehel looking at Cumin, there seems to be 11 in eqiad and 19 in CODFW [21:17:57] Why the imbalance? Is that the new servers we were planning for the split the graph experiment? [21:18:27] Or the expansion when we did not have enough budget for all? [21:18:33] gehel all of the above [21:18:57] Or do some of those need to be decommed ? [21:19:02] plus our delayed refresh of some hosts in eqiad [21:19:28] we've already started decomming hosts in eqiad, I don't think any in codfw are up for replacement though [21:19:34] The servers for the experiment are already in service ? [21:20:17] we have not explicitly set them aside, but I've told d-causse we can offer him 3 hosts in CODFW at least [21:20:50] And those numbers are for both public and internal servers, right? [21:21:09] Y [21:22:19] Thanks! I'll stop bothering you for today ! [21:22:30] no bother for me. Get some sleep! [21:31:15] based on chatter in operations, looks like more eqiad wdqs hosts are coming online soon [22:47:04] Alright, we resolved the issue with the eqiad cirrus alert firing despite low qps. We also brought the codfw metric into line with how we do eqiad so now there's no difference between how the two metrics work. Fixing the former problem also fixed the issue that led to switching codfw's alert to be solely based on more_like [22:47:17] https://phabricator.wikimedia.org/T347341#9197646 for the explanation of how the current graphite metric works