[04:12:15] <AntiComposite>	 if anyone shows up responding to the page: There have been two spikes of Wikimedia\Rdbms\DBUnexpectedError affecting a significant portion of logged-in page loads
[04:15:15] <swfrench-wmf>	 o/
[04:15:21] <swfrench-wmf>	 taking a look now
[11:29:58] <elukey>	 jelto, tappof - let's talk in here
[11:30:06] <jelto>	 ack
[11:30:08] <tappof>	 ack
[11:30:31] <jelto>	 so the kubernetes deployment of ratelimiting service looks ok to me, all container ready and resources also ok
[11:31:02] <elukey>	 I checked `kubectl logs api-gateway-production-5fc4d86886-d4rgd -n api-gateway production-ratelimit` for wikikube-eqiad and I see a lof of reference related to liftwing's revertrisk, and the IP mentioned seems likely belonging to Enterprise
[11:31:15] <elukey>	 but the logs are not super clear
[11:31:22] <elukey>	 and what is is weird for me is https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57
[11:31:36] <elukey>	 rate_limit (the envoy cluster) is returning HTTP 504s?
[11:31:49] <elukey>	 that is why we get the alarm, and it confuses me
[11:32:02] <jelto>	 ftr: the pag.e resolved just now
[11:32:05] <elukey>	 at the beginning I thought it was "too much throttling" or similar
[11:32:18] <elukey>	 but I'd have expected 429s
[11:32:39] <volans>	 isn't this the same problem of last week that hu.gh was looking at and one of the outcome was that probably we have a too lower limit?
[11:32:50] <volans>	 sorry if I'm missing newer context (also in a meeting right now)
[11:32:57] <volans>	 (last week might be 2 weeks ago)
[11:33:18] <elukey>	 no idea, it may be it yes
[11:33:39] <elukey>	 there are some useful logs on the production-ratelimit container
[11:33:44] <elukey>	 (see above)
[11:33:57] <elukey>	 but why the graph says "HTTP 504s" returned?
[11:34:40] <jelto>	 the client IP mentioned in the log is from enterprise (at least thats the IP they mentioned in the email)
[11:34:47] <elukey>	 ack perfect
[11:35:04] <volans>	 those are errors, and IIRC possibly returned by the service behind, not the rate limit itself
[11:35:25] <volans>	 if hnowlan is around might know more
[11:36:00] <elukey>	 from https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&from=now-6h&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revertrisk&var-component=All&var-model_name=revertrisk-language-agnostic I don't see a big change in traffic)
[11:38:05] <elukey>	 ahh wait I see lw_inference_reference_need
[11:38:34] <elukey>	 they had problems with that backend
[11:39:02] <elukey>	 the right dashboard is https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&from=now-6h&to=now&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revision-models&var-component=All&var-model_name=reference-need
[11:39:07] <elukey>	 but nothing big stands out
[11:41:27] <jelto>	 I'm currently reading about the https://wikitech.wikimedia.org/wiki/Ratelimit service (which I did not hear before). The logs state "applying limit: 250000 requests per HOUR" and I'm not sure if this applies to WM Enterprise 
[11:42:08] <jelto>	 but I can see the same traffic pattern in superset from WME: https://superset.wikimedia.org/superset/dashboard/p/5eEB8JVO9Y1/
[11:42:24] <elukey>	 yes they call Lift Wing with authentication, otherwise the "anonymous" tier wouldn't be enough
[11:44:53] <elukey>	 I was reading https://github.com/envoyproxy/envoy/issues/990, and maybe there is nothing really wrong except our timeout to the rate_limit cluster?
[11:45:08] <elukey>	 like more traffic causes an extra bit of latency and 504s
[11:46:02] <elukey>	 default is 20ms
[11:48:11] <elukey>	 ah wait we have .Values.main_app.ratelimiter.envoy_timeout
[11:48:23] <jelto>	 I'm not sure if raising the timeout makes sense or if we have to increase the limit requests_per_unit (for WME)?
[11:48:44] <claime>	 We fail open anyways
[11:49:27] <claime>	 And ratelimit is supposed to be internal calls, not rate limiting external calls
[11:49:45] <claime>	 The only thing it's implemented for is mw-api-int
[11:50:11] <elukey>	 we use it for the api-gateway to rate limit anonymous and authenticated traffic no?
[11:51:37] <elukey>	 like, in my head envoy goes through the internal call to the rate_limit cluster before reaching the backends
[11:51:51] <claime>	 ugh yeah
[11:52:08] <claime>	 and it's considered an interneal call because api-gateway to mw-api-int
[11:52:15] <elukey>	 and 504s to the rate limit may mean that the rate-limit cluster itself is somehow not capabale of processing the whole requests
[11:52:25] <claime>	 We can bump it up
[11:52:31] <elukey>	 in https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/rate_limit.proto "timeout" seems ms
[11:52:38] <claime>	 it's 3 replicas, we can bump it to 4 and see what happnes
[11:52:40] <claime>	 happens*
[11:52:43] <jelto>	 I can also reach out to WME folks and let them know they are hitting a threshold and should reduce the rps 
[11:52:49] <elukey>	 ah okok we have 0.5s for timeout
[11:52:52] <elukey>	 in our configs
[11:52:54] <claime>	 although again, we fail open
[11:53:25] <jelto>	 yes the replicas are just 3, we can try that. But resource usage looks quite reasonable https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s&var-namespace=ratelimit&var-pod=ratelimit-main-79c85f9554-2lv8w&var-container=All&from=now-6h&to=now
[11:54:23] <elukey>	 ahhh TIL that it is not something internal to envoy, it is actually deployed
[11:54:39] <elukey>	 okok let's bump it to 6 manually, it is easy to revert in case
[11:54:47] <elukey>	 ok if I do it?
[11:54:56] <jelto>	 sounds good to me, let's try that
[11:55:02] <claime>	 go ahead
[11:55:50] <elukey>	 done, pods up
[11:55:53] <jelto>	 all 6 are running and ready
[11:56:06] <jelto>	 let's see if 5x go down :)
[11:58:13] <jelto>	 I can't see a drop in 5xx so far
[11:58:49] <elukey>	 yep probably not it
[12:01:00] <elukey>	 mmm from the envoy config the ratelimiter seems to be on localhost:8081
[12:01:47] <elukey>	 that seems a local container
[12:01:51] <elukey>	 not the rate limit pods
[12:02:10] <tappof>	 it's a local container in the api-gateway pod I think 
[12:02:12] <elukey>	 (no idea if they then connect to the rate-limit pods)
[12:03:39] <claime>	 hmmm no they don't
[12:03:48] <claime>	 they have a completely different config
[12:05:23] <jelto>	 the 5xx are going down
[12:05:27] <elukey>	 from https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/rate_limit.proto "timeout" is in ms, we set "0.5s" afaics that should be ok but..
[12:08:24] <elukey>	 going to revert back the rate-limit pods to 3 if everybody agrees
[12:08:29] <jelto>	 +1
[12:08:34] <tappof>	 +1
[12:08:43] <claime>	 yeah go ahead
[12:08:49] <elukey>	 done
[12:09:01] <claime>	 I'm seeing a lot of "descriptor does not match any limit, no limits applied" in the ratelimiter
[12:09:09] <claime>	 (the api-gateway internal one)
[12:09:51] <claime>	 which to my knowledge should not happen, it should default to the "default" (ofc) ratelimit class
[12:10:31] <elukey>	 totally ignorant about this bit
[12:11:25] <elukey>	 the 504s dropped, they may page again though
[12:11:47] <jelto>	 from looking at the graphs every 2 to 3 hours on the hour
[12:11:58] <elukey>	 another test could be to increase the api-gw pods
[12:12:07] <elukey>	 more load spread to more rate-limiters
[12:12:20] <elukey>	 assuming it is not a backend issue (IIRC they use redis)
[12:12:53] <elukey>	 we have nutcracker on the pods
[12:13:05] <tappof>	 If we want to try the same thing with the rate-limit containers relevant to the API gateway, should we scale up the API gateway itself if the page is triggered again?
[12:13:13] <tappof>	 eh ok elukey ... same thing
[12:13:19] <elukey>	 :)
[12:13:51] <claime>	 it uses redis, and redis is fine
[12:13:51] <jelto>	 we can try that but the pods look pretty bored to me as well? https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s&var-namespace=api-gateway&var-pod=api-gateway-production-5fc4d86886-d4rgd&var-container=All&from=now-30m&to=now
[12:13:52] <elukey>	 let's bump it to more replicas, if it works it should reduce the even super low 1 rps 504s
[12:15:04] <tappof>	 Would you mind waiting for the next page to increase the replicas and see what happens?
[12:19:31] <elukey>	 definitely yes, I am going out for lunch now :D
[12:19:36] <elukey>	 I also filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130574
[12:19:48] <elukey>	 this is surely not the issue, but leaving it there anyway
[12:20:54] <elukey>	 o/ ttl
[12:23:19] <tappof>	 anyway I agree with you jelto that the pods don't seem to be really 'stressed,' but we can try the same approach we used with the "wrong" rate-limit containers and see what happens
[12:24:02] <jelto>	 speaking of changes, there is also an open change from Hugh to make the alert less noisy: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1127564
[12:26:17] <claime>	 jelto: we should deploy that
[12:28:13] <jelto>	 Should I take take care of that?
[12:28:26] <tappof>	 jelto, claime: wouldn't we wait for the next event to happen before adjusting the threshold to try changing the number of replicas?
[12:29:17] <jelto>	 I'm happy with both options, although we should be able to track the 5xx rps in https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-7d&to=now regardless of the alert 
[12:29:24] <claime>	 tappof: yes, but that will fire a crit at the former threshold
[12:29:30] <claime>	 it just won't p.age
[12:29:52] <tappof>	 aaaah okok jelto claime thank you
[12:34:48] <jelto>	 so should I merge and deploy the alert nerf from Hugh? Or should we wait for the next spike?
[12:35:37] <tappof>	 for me, it's good to merge
[12:43:00] <jelto>	 ok I'll proceed with merging and deploying the change for the alert, one sec
[12:56:25] <jelto>	 this change should be merged and deployed (I triggered puppet on the prometheus hosts in eqiad)
[13:19:11] <marostegui>	 hnowlan: When is codfw being repooled again? 
[14:33:40] <ymksci>	 marostegui:  we plan to repool codfw thursday 14:00 utc
[14:36:06] <marostegui>	 ymksci: thanks
[16:49:47] <hashar>	 hi, I have a patch to set the lookup_option of the `abuse_networks` Hiera key to "deep_merge" which let us easily extend it on WMCS : https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128859
[16:49:48] <hashar>	 I have cherry picked it last week on deployment-prep and that made Varnish to reject a bot that was crawling deployment-prep.
[16:51:01] <hashar>	 I could use a review & merge to resolve the task
[17:08:36] <XioNoX>	 any idea why I can't edit wikitech anymore? (username "AYounsi (WMF)")
[17:13:19] <mutante>	 XioNoX: I think https://phabricator.wikimedia.org/T389433  is likely
[17:13:32] <mutante>	 there were issues with cookies
[17:13:44] <mutante>	 arnaudb: how'd you fix it for you?
[17:14:35] <mutante>	 XioNoX: try deleting all the cookies for .wikitech and wikitech
[17:16:20] <Reedy>	 tgr_: ^^ your fix may not have worked :(
[17:17:11] <XioNoX>	 I tried clearing cookies and even local storage with no luck
[17:17:43] <XioNoX>	 weirdly, the move/delete buttons are there
[17:18:15] <XioNoX>	 You do not have permission to edit this page, for the following reason:
[17:18:15] <XioNoX>	 You must confirm your email address before editing pages. Please set and validate your email address through your user preferences.
[17:19:19] <XioNoX>	 alright, solved :)
[17:19:35] <XioNoX>	 thanks for the replies
[17:20:05] <Reedy>	 haha
[17:20:44] <Reedy>	 I think there's a task about turning that off
[21:43:14] <Amir1>	 Reedy: at least, I add a note   in the permission error to make sure people confirm their emails
[21:43:30] <Amir1>	 I will do it asap