[07:42:41] Hi! SUP: Saneitizer is using 100% of its connection pool since 6:00 UTC and puts that part of the graph under pressure. The saneitizer events dropped from 6/s to 5/s down to 2/s and I wonder why. Envoy does not show an increased error rate that could explain the fully stretched connection pool. [07:42:47] https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&var-k8sds=eqiad%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops&var-site=eqiad&var-app=flink-app-consumer-search&var-operator_name=rerender_fetch_failure_sink:_Writer&var-fetch_operator_name=fetch&var-fetch_operator_name=rerender_fetch&var-quartile=0.5&var-fetch_by=authority&var-top_k=1&from=now-3h&to=now [07:43:11] o/ [07:43:15] pfischer: looking [07:52:33] pfischer: is this causing undesirable effects? I see that this pool might have been saturated in the past in codfw as well [07:59:39] perhaps latencies of the cirrus check api have dropped... unsure [08:02:51] loop lateness is slowly increasing for multiple wikis, but nothing alarming, around ~4h, just ffwiki around 20h which perhaps is the one misbehaving? [08:06:16] elastic latencies are quite bad actually (https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=now-6h&to=now) [08:10:43] Missing triage, will be at sre meeting after [08:12:24] pfischer: I think the sanitizer is suffering from the latencies issue on the search cluster, unsure what's going on but seems like the search clusters (both eqiad & codfw are struggling) [08:27:36] dcausse: okay, thank you for looking into it! [08:28:33] dcausse: Is there anything we can do about the ES clusters? [08:29:20] pfischer: for now I guess it's about investigating the cause of the surge in load/latencies [08:50:55] unrelated but logrotate does not seem to cleanup logs on the elastic nodes [08:56:27] I *think* (not sure yet) that backpressure from search is what's causing the php-fpm saturation issue on mw-api-ext [08:57:06] claime: yes I think too, the php-fpm graph correlates pretty with the latencies increase in elastic@eqiad [08:59:58] do you have an sre on hand/can we help? [09:01:35] claime: still trying to figure out the cause, perhaps trying to see if there's an new client doing weird stuff related to search on mw-api-ext might help? [09:02:00] I know we can use turnillo/superset but not super familiar with these tools [09:08:51] seeing an increase in GeoData spatial search from 30qps to ~60qps [09:10:37] down to ~20 qps since 20mins and latencies have stabilized since then [09:10:54] *10mins [09:11:22] yeah, saturation seems to have stabilized as well [09:11:33] I didn't see a smoking gun in turnilo or superset btw [09:12:35] will look further into this GeoData api, the doubling of the query rate seems suspicious, it's one of these slow APIs that are not super heavy traffic but are quite expensive [09:13:05] might need some protection with a dedicated pool counter [09:13:37] dcausse / pfischer: I don't have much to help, but scream if you think I should do anything! [09:22:04] latency issue in codfw started at 5am and correlates exactly with an increase from 30qps to 60qps of the geodata searches [09:22:32] in eqiad it started at 6am and correlates perfectly with the increase of geodata search from 30qps to 60qps as well [09:23:04] going to add a new pool for this type of searches [09:41:06] dcausse: do you have a task we can sub to? [09:41:25] claime: just finishing one, will tag you [09:41:45] also did you see two bumps in the query rps just now? 0932 and 0938 [09:43:00] claime: yes [09:43:18] the client is still making requests I think [09:43:25] ok so it lines up perfectly with two other bumps in fpm worker sat [09:43:58] I think I have a workaround, will try to prep that asap [10:15:46] lunch [10:56:17] lunch [12:46:02] sounds like we had some "fun" this morning [12:50:40] o/ [12:52:30] yes... a search api that happily served us without issues for the last 5+ years... but started to be hammered by a single client causing the whole cluster to slowdown [12:53:37] Where did you find the info about the Geodata API? Guessing logstash? [12:55:08] inflatador: no, logstash was not super helpful, almost all kind of searches were having issues, I saw a bump in https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1 (section Per Query Type Metric) [12:55:34] but had to change the "QPS By type" to display the top-20 instead of the top-5 shown by default [13:01:45] do we have other APIs like this? Just wondering if there are any follow-up actions around monitoring/docs/etc [13:08:27] inflatador: all searches should be guarded by the poolcounter, this one was somehow missed (because it's in another extension names GeoData) [13:11:26] dcausse ACK, I guess all the other APIs come thru Elastica/CirrusSearch extensions? [13:12:21] inflatador: yes, given we missed this one I cannot guarantee that there are no other instances of this tho... we might have to check Translate [13:13:37] maybe codesearch for the elastic endpoints? But the endpoints would probably be in config as opposed to code [13:14:08] inflatador: yes... codesearch might be hard to use in that case... :( [13:18:06] https://codesearch.wmcloud.org/things/?q=elasticsearch&files=&excludeFiles=&repos= [13:18:41] dunno what BlueSpice or GrowthExperiments are [13:19:02] probably not an exhaustive list either [13:21:39] BlueSpice is not installed on wmf wikis, GrowthExperiments is but I think it should connect directly to elastic [13:21:54] should *not* [13:28:12] ACK, just spitballing [13:44:44] \o [13:45:12] it's always curious how i expect new servers to not be that much faster than old servers, but then whenever there is an issue it's quite clear in the load graphs that the new ones handle it better [13:45:34] o/ [13:46:02] that was the opposite for wdqs, but we found why in the end :) [13:46:09] :) [13:47:01] thanks for finding the fix, surprised it's taken this long for someone to hammer that api [13:47:10] and that it didn't take that much qps :S [13:47:22] yes... [13:47:54] haven't look deeply but IIRC that search using a sort so possibly quite costly... [13:49:01] also the qps a month ago was more around 9qps, then increased to ~30ish then 60... [13:49:52] oh interesting, i suppose i didn't even have an idea of what normal qps was there. 30 seemed plausible [13:51:03] looks like we just lost 6 elasticsearch nodes in codfw. See #wikimedia-data-platform-alerts [13:51:15] Note that we haven't enable the performance governor for ES hosts like we did for WDQS. We got the OK from DC Ops but I didn't want to use the extra electricity if we didn't have to [13:51:22] Do you think it's a good idea? [13:51:27] gehel just saw that, looking now [13:51:34] incident started in #wikimedia-operations [13:51:41] so probably something more general than just search [13:52:09] guessing these are on the same switch but will check [13:52:34] inflatador: yea, on-site bumped something iiuc [13:53:46] we're yellow on the main cluster but everything seems OK [13:55:43] inflatador: for the perf governor, i dunno. I suspect everything is fine without it, but without hard numbers it would be difficult to say. I suppose it might be nice to run a test and at least know if we have some trick to pull later if needed [13:57:53] * ebernhardson sees the SSO emails for wikitech and regrets having a different username on wikitech than everywhere else :P [13:58:02] Ebernhardson vs EBernhardson (WMF) [14:16:20] looks like that email has a typo...got confused by that a bit [15:03:28] dcausse, pfischer: search triage: https://meet.google.com/eki-rafx-cxi [15:50:35] dr0ptp4kt: just created : https://phabricator.wikimedia.org/T370661 [15:51:27] thx gehel [15:53:21] ebernhardson: same on https://phabricator.wikimedia.org/T370662 [16:06:19] i totally read things wrong, beta cluster deploy is two weeks away :( [17:24:05] dinner [17:49:53] lunch, back in ~40 [19:16:07] sorry, been back while