[04:12:15] if anyone shows up responding to the page: There have been two spikes of Wikimedia\Rdbms\DBUnexpectedError affecting a significant portion of logged-in page loads [04:15:15] o/ [04:15:21] taking a look now [11:29:58] jelto, tappof - let's talk in here [11:30:06] ack [11:30:08] ack [11:30:31] so the kubernetes deployment of ratelimiting service looks ok to me, all container ready and resources also ok [11:31:02] I checked `kubectl logs api-gateway-production-5fc4d86886-d4rgd -n api-gateway production-ratelimit` for wikikube-eqiad and I see a lof of reference related to liftwing's revertrisk, and the IP mentioned seems likely belonging to Enterprise [11:31:15] but the logs are not super clear [11:31:22] and what is is weird for me is https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57 [11:31:36] rate_limit (the envoy cluster) is returning HTTP 504s? [11:31:49] that is why we get the alarm, and it confuses me [11:32:02] ftr: the pag.e resolved just now [11:32:05] at the beginning I thought it was "too much throttling" or similar [11:32:18] but I'd have expected 429s [11:32:39] isn't this the same problem of last week that hu.gh was looking at and one of the outcome was that probably we have a too lower limit? [11:32:50] sorry if I'm missing newer context (also in a meeting right now) [11:32:57] (last week might be 2 weeks ago) [11:33:18] no idea, it may be it yes [11:33:39] there are some useful logs on the production-ratelimit container [11:33:44] (see above) [11:33:57] but why the graph says "HTTP 504s" returned? [11:34:40] the client IP mentioned in the log is from enterprise (at least thats the IP they mentioned in the email) [11:34:47] ack perfect [11:35:04] those are errors, and IIRC possibly returned by the service behind, not the rate limit itself [11:35:25] if hnowlan is around might know more [11:36:00] from https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&from=now-6h&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revertrisk&var-component=All&var-model_name=revertrisk-language-agnostic I don't see a big change in traffic) [11:38:05] ahh wait I see lw_inference_reference_need [11:38:34] they had problems with that backend [11:39:02] the right dashboard is https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&from=now-6h&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revision-models&var-component=All&var-model_name=reference-need [11:39:07] but nothing big stands out [11:41:27] I'm currently reading about the https://wikitech.wikimedia.org/wiki/Ratelimit service (which I did not hear before). The logs state "applying limit: 250000 requests per HOUR" and I'm not sure if this applies to WM Enterprise [11:42:08] but I can see the same traffic pattern in superset from WME: https://superset.wikimedia.org/superset/dashboard/p/5eEB8JVO9Y1/ [11:42:24] yes they call Lift Wing with authentication, otherwise the "anonymous" tier wouldn't be enough [11:44:53] I was reading https://github.com/envoyproxy/envoy/issues/990, and maybe there is nothing really wrong except our timeout to the rate_limit cluster? [11:45:08] like more traffic causes an extra bit of latency and 504s [11:46:02] default is 20ms [11:48:11] ah wait we have .Values.main_app.ratelimiter.envoy_timeout [11:48:23] I'm not sure if raising the timeout makes sense or if we have to increase the limit requests_per_unit (for WME)? [11:48:44] We fail open anyways [11:49:27] And ratelimit is supposed to be internal calls, not rate limiting external calls [11:49:45] The only thing it's implemented for is mw-api-int [11:50:11] we use it for the api-gateway to rate limit anonymous and authenticated traffic no? [11:51:37] like, in my head envoy goes through the internal call to the rate_limit cluster before reaching the backends [11:51:51] ugh yeah [11:52:08] and it's considered an interneal call because api-gateway to mw-api-int [11:52:15] and 504s to the rate limit may mean that the rate-limit cluster itself is somehow not capabale of processing the whole requests [11:52:25] We can bump it up [11:52:31] in https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/rate_limit.proto "timeout" seems ms [11:52:38] it's 3 replicas, we can bump it to 4 and see what happnes [11:52:40] happens* [11:52:43] I can also reach out to WME folks and let them know they are hitting a threshold and should reduce the rps [11:52:49] ah okok we have 0.5s for timeout [11:52:52] in our configs [11:52:54] although again, we fail open [11:53:25] yes the replicas are just 3, we can try that. But resource usage looks quite reasonable https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s&var-namespace=ratelimit&var-pod=ratelimit-main-79c85f9554-2lv8w&var-container=All&from=now-6h&to=now [11:54:23] ahhh TIL that it is not something internal to envoy, it is actually deployed [11:54:39] okok let's bump it to 6 manually, it is easy to revert in case [11:54:47] ok if I do it? [11:54:56] sounds good to me, let's try that [11:55:02] go ahead [11:55:50] done, pods up [11:55:53] all 6 are running and ready [11:56:06] let's see if 5x go down :) [11:58:13] I can't see a drop in 5xx so far [11:58:49] yep probably not it [12:01:00] mmm from the envoy config the ratelimiter seems to be on localhost:8081 [12:01:47] that seems a local container [12:01:51] not the rate limit pods [12:02:10] it's a local container in the api-gateway pod I think [12:02:12] (no idea if they then connect to the rate-limit pods) [12:03:39] hmmm no they don't [12:03:48] they have a completely different config [12:05:23] the 5xx are going down [12:05:27] from https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/rate_limit.proto "timeout" is in ms, we set "0.5s" afaics that should be ok but.. [12:08:24] going to revert back the rate-limit pods to 3 if everybody agrees [12:08:29] +1 [12:08:34] +1 [12:08:43] yeah go ahead [12:08:49] done [12:09:01] I'm seeing a lot of "descriptor does not match any limit, no limits applied" in the ratelimiter [12:09:09] (the api-gateway internal one) [12:09:51] which to my knowledge should not happen, it should default to the "default" (ofc) ratelimit class [12:10:31] totally ignorant about this bit [12:11:25] the 504s dropped, they may page again though [12:11:47] from looking at the graphs every 2 to 3 hours on the hour [12:11:58] another test could be to increase the api-gw pods [12:12:07] more load spread to more rate-limiters [12:12:20] assuming it is not a backend issue (IIRC they use redis) [12:12:53] we have nutcracker on the pods [12:13:05] If we want to try the same thing with the rate-limit containers relevant to the API gateway, should we scale up the API gateway itself if the page is triggered again? [12:13:13] eh ok elukey ... same thing [12:13:19] :) [12:13:51] it uses redis, and redis is fine [12:13:51] we can try that but the pods look pretty bored to me as well? https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s&var-namespace=api-gateway&var-pod=api-gateway-production-5fc4d86886-d4rgd&var-container=All&from=now-30m&to=now [12:13:52] let's bump it to more replicas, if it works it should reduce the even super low 1 rps 504s [12:15:04] Would you mind waiting for the next page to increase the replicas and see what happens? [12:19:31] definitely yes, I am going out for lunch now :D [12:19:36] I also filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130574 [12:19:48] this is surely not the issue, but leaving it there anyway [12:20:54] o/ ttl [12:23:19] anyway I agree with you jelto that the pods don't seem to be really 'stressed,' but we can try the same approach we used with the "wrong" rate-limit containers and see what happens [12:24:02] speaking of changes, there is also an open change from Hugh to make the alert less noisy: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1127564 [12:26:17] jelto: we should deploy that [12:28:13] Should I take take care of that? [12:28:26] jelto, claime: wouldn't we wait for the next event to happen before adjusting the threshold to try changing the number of replicas? [12:29:17] I'm happy with both options, although we should be able to track the 5xx rps in https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-7d&to=now regardless of the alert [12:29:24] tappof: yes, but that will fire a crit at the former threshold [12:29:30] it just won't p.age [12:29:52] aaaah okok jelto claime thank you [12:34:48] so should I merge and deploy the alert nerf from Hugh? Or should we wait for the next spike? [12:35:37] for me, it's good to merge [12:43:00] ok I'll proceed with merging and deploying the change for the alert, one sec [12:56:25] this change should be merged and deployed (I triggered puppet on the prometheus hosts in eqiad) [13:19:11] hnowlan: When is codfw being repooled again? [14:33:40] marostegui: we plan to repool codfw thursday 14:00 utc [14:36:06] ymksci: thanks [16:49:47] hi, I have a patch to set the lookup_option of the `abuse_networks` Hiera key to "deep_merge" which let us easily extend it on WMCS : https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128859 [16:49:48] I have cherry picked it last week on deployment-prep and that made Varnish to reject a bot that was crawling deployment-prep. [16:51:01] I could use a review & merge to resolve the task [17:08:36] any idea why I can't edit wikitech anymore? (username "AYounsi (WMF)") [17:13:19] XioNoX: I think https://phabricator.wikimedia.org/T389433 is likely [17:13:32] there were issues with cookies [17:13:44] arnaudb: how'd you fix it for you? [17:14:35] XioNoX: try deleting all the cookies for .wikitech and wikitech [17:16:20] tgr_: ^^ your fix may not have worked :( [17:17:11] I tried clearing cookies and even local storage with no luck [17:17:43] weirdly, the move/delete buttons are there [17:18:15] You do not have permission to edit this page, for the following reason: [17:18:15] You must confirm your email address before editing pages. Please set and validate your email address through your user preferences. [17:19:19] alright, solved :) [17:19:35] thanks for the replies [17:20:05] haha [17:20:44] I think there's a task about turning that off [21:43:14] Reedy: at least, I add a note in the permission error to make sure people confirm their emails [21:43:30] I will do it asap