[08:16:37] <isaranto>	 o/ folks
[08:21:32] <isaranto>	 I need edit permissions for the LW API portal docs https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_goodfaith_prediction. Can any1 point me towards the right place to file a request? (phabricator search isn't helping me)
[08:26:26] <elukey>	 o/
[08:26:43] <elukey>	 in theory we all should have it, I requested it to the API portal's team
[08:26:44] <elukey>	 lemme see
[08:27:12] <isaranto>	 I thought I had checked it at the time, but probably I was wrong as I don't have access
[08:27:16] <elukey>	 https://phabricator.wikimedia.org/T330634
[08:29:16] <isaranto>	 hmm I used to have two users attached to my phabricator account (WMF and personal one) and at some point I got a request to remove my personal one. perhaps that one was attached
[08:29:26] <isaranto>	 thanks Luca I'll follow up on that
[08:30:40] <elukey>	 ahhh yes maybe something like that happened
[08:35:13] <elukey>	 one interesting thing that I am checking now
[08:35:49] <elukey>	 if I made calls from ores-legacy I'd expect them to show up on the istio gateway's access log dashboard
[08:35:59] <elukey>	 and what I find is usually a single HTTP/2 connection logged
[08:36:24] <elukey>	 now I am wondering if we multiplex multiple calls on http/2, logging only one
[08:36:33] <elukey>	 and maybe the timeouts could be related to http/2 limits
[08:39:47] <elukey>	 yeah I see some bugs related to envoy setting max streams 100 for http/2 conns
[08:40:12] <isaranto>	 aha! I would have never thought about this
[08:41:13] <isaranto>	 nice catch. can we override this though only for LW/ores-legacy?
[08:42:09] <isaranto>	 quick (irrelevant) question: the host header in LW is set from the MODEL_NAME/INFERENCE_NAME right?
[08:42:34] <elukey>	 isaranto: yes yes the one that kserve needs
[08:43:00] <isaranto>	 it is a blind spot for me where that is set
[08:43:32] <elukey>	 I think it uses the inference name
[08:44:09] <elukey>	 for the http2 vs http1 conns - maybe we could disable http2 in the istio gateway, and deal only with http1.1 conns.. 
[08:44:54] <klausman>	 e\o
[08:45:37] <klausman>	 elukey: HTTP/2 vs HTTP/1.1 seemed to improve the 503 situation on RR-LA as well, but it's very close to the noise floor in that regard. It also reduced performance quite a bit.
[08:45:55] <klausman>	 i.e. with 1.1, the 503s seemed to go away completely
[08:46:05] <klausman>	 But that might also because the qps dropped
[08:48:11] <elukey>	 klausman: o/ so http1.1 set by the go client that you are using right?
[08:51:10] <elukey>	 klausman: also, do you have any limit in the max http2 streams used?
[08:51:25] <elukey>	 I am wondering what happens if you use http/2 and increase that value
[08:55:31] <elukey>	 in the ores-legacy case it is probably the tls-proxy that tries to upgrade from http1 to http2
[08:56:50] <elukey>	 I am looking for a way to force http 1.1 on the istio gw but it seems not straightforward
[09:07:35] <klausman>	 elukey: yes, the Go http client defaults to HTTP/2 with servers that support it.
[09:07:49] <klausman>	 as for # of streams, I haven't checked yet if that is configurable
[09:09:58] <klausman>	 I _think_ I am using one stream, since I create a new Request object (and thus client object) for every request I make
[09:10:30] <klausman>	 The library I am using (Go's stdlib `net/http`) does not seem to expose stream count as configurable.
[09:13:14] <elukey>	 need to go afk for some errands, ttl!
[09:13:18] <klausman>	 \o
[09:13:31] <klausman>	 I'll do some code rfactoring to reuse the client object, see what happens
[09:35:14] <wikibugs>	 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for testing - https://phabricator.wikimedia.org/T342417 (10LDlulisa-WMF)
[09:39:29] <wikibugs>	 10Machine-Learning-Team, 10Epic: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10Gehel)
[09:39:32] <wikibugs>	 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10Gehel) 05Open→03Resolved
[10:36:52] * klausman lunch
[10:40:14] <wikibugs>	 10Machine-Learning-Team, 10Research: Update torch's settings in the Knowledge Integrity repo - https://phabricator.wikimedia.org/T325349 (10CodeReviewBot) mnz closed https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/10  Remove torch dependency
[10:59:32] <wikibugs>	 10Machine-Learning-Team, 10Research: Index out of range in revert risk multi-lingual - https://phabricator.wikimedia.org/T340811 (10MunizaA) >  I managed to separate the headers from sections in the code so it's much cleaner now and seems to run fine for your list of revisions with timeout set to True. @Isaac...
[11:28:58] * isaranto lunch!
[12:25:51] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10Sgs) a:05Sgs→03Trizek-WMF @Trizek-WMF I can confirm all the wikis from this round have produced abundant re...
[12:26:51] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10Sgs)
[12:27:51] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10Sgs) Status update, as per today the maintenance script is processing `lbwiki`. Growth engineers are trying to come up with an agreem...
[13:11:04] <elukey>	 very weird, if I make a lot of calls to the api-gateway for say goodfaith I can only see one request logged in logstash, even if it is http1.1
[13:11:16] <elukey>	 it makes zero sense
[13:41:18] <klausman>	 That is really weird. even with connection pooling, I'd expect more than one
[13:42:01] <elukey>	 I am trying to figure out if the ingress gws do have the logs but we drop them somewhere or not
[13:43:36] <elukey>	 klausman: I thought about setting http2 on your side - in theory it is fine, since we want http2 up to the CDN, but we'd probably prefer http/1.1 for internal conns
[13:43:50] <elukey>	 and we have multiple envoys after the CDN
[13:43:53] <elukey>	 api-gateway
[13:43:57] <elukey>	 istio gateway
[13:44:01] <elukey>	 sidecars
[13:44:02] <elukey>	 etc..
[13:44:12] <elukey>	 and  they seem to upgrade to http/2
[13:44:14] <klausman>	 Ack.
[13:44:29] <elukey>	 nono I was trying to brainbounce, does it make sense? :D
[13:44:35] <klausman>	 I haven't experimented with pooled HTTP/1.1 yet, still tweaking my code a bit.
[13:45:08] <klausman>	 I think we want to support HTTP/2 in the long term anyway, so let's not chase that "solution" too much.
[13:45:22] <klausman>	 I will try a test run with 1.1 after the current one has completed
[13:46:22] <elukey>	  we want to support http2 up to the CDN for sure, anything after it needs a discussion
[13:46:37] <klausman>	 also: I wis the errors we get back client side would include the request_id
[13:47:25] <elukey>	 the whole deal of multiplexing streams is a bit messy and it complicates all our metrics
[13:47:57] <klausman>	 I _think_ envoy et al should be able to provide adequate metrics. They definitely have enough information.
[13:48:46] <klausman>	 The errors that I still see look like this:
[13:48:56] <klausman>	 2023/07/21 13:48:33 id:063 503 time:48.190943ms
[13:48:58] <klausman>	                      H: map[Content-Length:[145] Content-Type:[text/plain] Date:[Fri, 21 Jul 2023 13:48:33 GMT] Server:[istio-envoy] X-Envoy-Upstream-Service-Time:[47]]
[13:49:00] <klausman>	                      B: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
[13:49:03] <klausman>	 H: is heade, B: is body
[13:49:32] <elukey>	 yes I think we are hitting some bottleneck
[13:49:39] <elukey>	 we have been seeing them in the past as well
[13:49:43] <klausman>	 (this is http/1.1 enforced from the client. Dunno if envoy secretly upgrades to /2
[13:49:46] <klausman>	 )
[13:50:15] <elukey>	 this is my point - at any given stage of our proxy pipeline of envoys, any envoy can upgrade to http/2
[13:50:38] <klausman>	 yeah, and that may make debugging and monitoring rather murky
[13:51:14] <klausman>	 One option we might have is to set everything to 1.1, debug it until we're confident, then try /2 in select spots and see what happens.
[13:51:51] <klausman>	 For this /1.1 run, out of 211870 requests in 5m, 325 (0.15%) were answered with 503s like the one above.
[13:52:57] <elukey>	 klausman: are you hitting revert risk via api-gateway?
[14:00:22] <klausman>	 No, I am using the internal endpoint of eqiad directly, to avoid as many rate limits as possible
[14:00:52] <elukey>	 okok because I saw something weird in the logs, namely that the pod serving revert risk was ingress-gateway-services
[14:01:03] <elukey>	 that is the one used only for non-kserve stuff, like ores-legacy
[14:01:15] <klausman>	 Huh.
[14:01:19] <elukey>	 just for confirmation, are you using port 30443?
[14:01:23] <klausman>	 So that must be someone else
[14:03:14] <elukey>	 I se ekubernetes.labels.service_istio_io/canonical-name
[14:03:15] <elukey>	 istio-ingressgateway-services
[14:03:25] <klausman>	 https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict and Host: revertrisk-language-agnostic.revertrisk.wikimedia.org
[14:03:33] <elukey>	 okok perfect
[14:03:43] <elukey>	 because in theory we use -services only on port 31443
[14:04:53] <elukey>	 ah no I see it for others as well
[14:05:06] <klausman>	 ack. I also wasn't running any tests for several minutes until 14:04:30
[14:06:01] <elukey>	 ok I see logs of revert risk on ingress gateways services, that is not how it was intended
[14:06:40] <klausman>	 fwiw, I am working from stat1004, so if that IP is in forwarded-for, it may be me
[14:07:18] <elukey>	 mmm it seems all "upstream_cluster":"outbound|80||revertrisk-language-agnostic-predictor-default-00008.revertrisk.svc.cluster.local"
[14:07:29] <elukey>	 this is usually the knative-internal-gateway
[14:08:03] <elukey>	 maybe that one is deployed on all gateways, mmm
[14:08:05] <elukey>	 will check
[14:09:43] <elukey>	 of course, the selector is istio: ingressgateway and both gateway pods have it
[14:09:56] <elukey>	 it shouldn't matter a lot for the current issues, but we want to keep the things split
[14:09:59] <elukey>	 I thought I fixed it
[14:10:19] <klausman>	 Can you clarify what the different traffic patterns are, and where they -should_ go?
[14:11:29] <elukey>	 we have two istio gateways deployed, one called ingress-gateway and another one called ingress-gateway-services
[14:11:36] <elukey>	 this is defined in the custom.d yaml file
[14:11:47] <elukey>	 one listening on nodeport 30443 and one on 31443
[14:12:01] <wikibugs>	 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) @elukey thanks for the additional context.  there are ongoing discussions in the Research Team around the level of commitment we ca...
[14:12:02] <klausman>	 Ah, so the distinciton criteria is ores-legacy vs everything else?
[14:12:17] <elukey>	 the idea was to have the kserve stuff going through the former, and the serviceops-like services in the other one
[14:12:21] <elukey>	 yes correct
[14:12:32] <elukey>	 in -services I'd love to see flowing only ores-legacy for the moment
[14:12:35] <elukey>	 and recommendation-api etc..
[14:13:06] <elukey>	 the main issue is that in the Gateway resources (istio) that we define in the Knative chart, we use selector "istio: ingressgateway"
[14:13:10] <elukey>	 that is common to both
[14:13:29] <elukey>	 so when we deploy the Gateway resource, all istio gw pods are entitled to route requests for it
[14:13:42] <elukey>	 (at least this is my current running theory)
[14:14:05] <klausman>	 Ah, I see.
[14:14:22] <klausman>	 Do you think that maybe we sometimes hit the -services IG and thus get 503s?
[14:17:32] <elukey>	 in theory no, but let's fix it
[14:17:34] <elukey>	 sending a patch
[14:17:37] <klausman>	 ack.
[14:17:52] <klausman>	 also, is there an easy way to see what pod a given IP:port routes to?
[14:18:08] <klausman>	 nvm, that is logged alongside anyway
[14:19:35] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940361/
[14:20:12] <elukey>	 of course -1 from ci sigh
[14:22:46] <klausman>	 I think you may need to quote service.istio.io/canonical-name since it has a - in it
[14:23:07] <klausman>	 But your YAML-fu is better than mine anyway
[14:23:37] <elukey>	 nono good point lemme try
[14:24:53] <klausman>	 also: the 503s I see always come in one or two big blobs in like 1-3 seconds adn the rest of the run (minutes!) is quiet
[14:30:35] <elukey>	 ok now it should work, I quoted + there was an extra "-" before Values
[14:43:00] <klausman>	 reading
[14:43:21] <klausman>	 LGTM!
[14:45:37] <elukey>	 thanks :)
[14:46:13] <klausman>	 lmk when you roll out, so I can pause experimenting (and maybe see if it makes a difference)
[14:52:54] <elukey>	 I am going to test in staging first, should be a little change but I don't want to risk a big problem on a friday :)
[14:58:06] <klausman>	 Yep, understood.
[14:58:38] <elukey>	 of course deployment failed and left helm in a weird state
[14:58:40] <elukey>	 lol
[14:59:10] <klausman>	 Another observation I have had: if I don't test for a while (O(hours)), it seems the 503s seem more common. Or put another way: warming up the pods makes the 503s less likely. But I don't have hard data
[14:59:19] <klausman>	 anything I can do to help re: helm?
[15:01:29] <elukey>	 lemme check wikitech first
[15:01:35] <elukey>	 one weird thing, I noticed it by mistake
[15:01:53] <elukey>	 NAME                                     READY   STATUS    RESTARTS       AGE
[15:01:56] <elukey>	 activator-57c6dffbfd-2c7sd               1/1     Running   10 (15m ago)   24h
[15:01:59] <elukey>	 activator-57c6dffbfd-l75z7               1/1     Running   8 (16m ago)    24h
[15:02:02] <elukey>	 there are a ton of errors logged in logstash too
[15:02:08] <elukey>	 I think they are related to your load tests
[15:03:10] <klausman>	 Huh.
[15:03:17] <klausman>	 What are the errors?
[15:04:20] <elukey>	 I think some timeouts
[15:04:40] <klausman>	 I'll have a look-see
[15:06:49] <klausman>	 I see lotsa `Failed to probe clusterIP 10.67.12.88:80`
[15:08:02] <elukey>	 I am trying to rollback via https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Advanced_use_cases:_using_helm
[15:08:03] <klausman>	 what app did you see the errors in?
[15:08:48] <elukey>	 ??
[15:09:08] <klausman>	 You mentioned a lot of errors/timeouts in logstash.
[15:09:36] <klausman>	 So I went there to find them, but I see nothing super unusual
[15:09:47] <elukey>	 knative logs
[15:09:51] <klausman>	 thx
[15:10:51] <elukey>	 we have a dashboard in logstash but it seems broken
[15:10:52] <elukey>	 sigh
[15:11:04] <elukey>	 anyway, you can find them if you tail the logs of the activator
[15:11:27] <elukey>	 all right new knative gateways up
[15:11:29] <elukey>	 let's see
[15:13:25] <klausman>	 The only errors I see on the logtail are for experimental/falcon-7b-instruct-predictor-default-*
[15:13:51] <klausman>	 ah, my run had completed :D
[15:18:53] <elukey>	 in the activator?
[15:19:38] <elukey>	 weird..
[15:19:51] <elukey>	 why did the activator pod restart?
[15:20:42] <elukey>	 some readiness probes failed
[15:20:43] <elukey>	 mmmm
[15:24:59] <klausman>	 all the pod probes from activator that I saw were not talking about RR-LA
[15:25:18] <elukey>	 yes yes but the activators restarted, not a good sign
[15:25:34] <elukey>	 there is an option to remove the activator from the request handling path
[15:25:40] <elukey>	 we should probably try it 
[15:29:13] <elukey>	 klausman: my theory is that the activator failed the readiness probes when they were under load
[15:32:02] <elukey>	 the errors are not great but we can check them alter
[15:32:04] <elukey>	 *later
[15:33:06] <elukey>	 klausman: we also lost all the metrics some days ago https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=All&from=now-7d&to=now
[15:33:10] <elukey>	 what the hell today :D
[15:33:30] <elukey>	 the 18th around 14 UTC
[15:34:30] <elukey>	 and I caused it https://sal.toolforge.org/log/DFA9aYkBGiVuUzOduD9s
[15:35:14] <elukey>	 but it was for the container limits - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939257
[15:35:53] <wikibugs>	 10Machine-Learning-Team, 10Research: Index out of range in revert risk multi-lingual - https://phabricator.wikimedia.org/T340811 (10Isaac) @MunizaA -- I took a quick pass and left some comments but generally looking good. I didn't test locally but hopefully should give you some substantial speed-ups with very...
[15:38:53] <klausman>	 That is weird as hell
[15:39:55] <isaranto>	 Hey on a different topic,I got access to the api docs :),all good!
[15:39:59] <klausman>	 I didn't start testing with high rates until after the 18th
[15:40:10] <isaranto>	 I'm logging off for the weekend, cu all!
[15:40:54] <klausman>	 elukey: so the metrics issue will be resolved, I presume?
[15:41:06] <klausman>	 as for the activators falling over, do we just need more replicas?
[15:41:59] <elukey>	 isaranto: o/
[15:42:12] <elukey>	 klausman: will be resolved?
[15:42:27] <elukey>	 for the activators not sure, we need to inspect metrics
[15:42:51] <elukey>	 can you take care of this part? Maybe we hit cpu limits when you were running the load test
[15:43:14] <elukey>	 I checked an activator pod and I can see the metrics via nsenter
[15:43:16] <klausman>	 But I didn't start testing in earnest back then, only later
[15:43:26] <klausman>	 (IIRC)
[15:45:05] <elukey>	 sure but if you could check what's wrong in there it would be great
[15:45:14] <klausman>	 yeah, sure, I can do that
[15:45:15] <elukey>	 it may also be related to your load tests
[15:45:17] <elukey>	 or something else
[15:46:08] <elukey>	 klausman: what I was referring to above is the activators failling over, not the metrics
[15:47:00] <klausman>	 ack
[15:49:25] <klausman>	 One thing that is a bit annoying is the constantly-failing pribes for falcon-7b
[15:51:38] <klausman>	 The activators falling over s due to ooms
[15:52:00] <klausman>	 cf. https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?forceLogin=&orgId=1&var-datasource=eqiad+prometheus%2Fk8s-mlserve&var-namespace=knative-serving&var-pod=activator-57c6dffbfd-2c7sd&var-pod=activator-57c6dffbfd-l75z7&var-container=All&from=1689911515483&to=1689954715483&viewPanel=15
[15:56:50] <elukey>	 klausman: yes I suspected that, all traffic goes through it, so possibly some 503s are due to the ooms
[15:57:05] <klausman>	 ack. Working on a quota increas patch
[15:59:46] <elukey>	 super
[15:59:50] <elukey>	 I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940386 as well
[16:00:09] <elukey>	 I didn't add them to one of the gateways
[16:00:56] <klausman>	 grmbl, helmfile diff hates me
[16:04:58] <klausman>	 elukey: when doing resources, is there a way to _not_ specify CPU? Or do it in a way that preserves whatever the default is?
[16:05:37] <elukey>	 klausman: in theory just omitting it should work, but check the helmfile templates to be sure
[16:05:44] <klausman>	 argh, fml. tab on the line.
[16:06:08] <elukey>	 fixed the ml-staging-codfw's knative setup to split traffic between gateways, I'll deploy the changes to prod on monday
[16:06:18] <klausman>	 It's the map problem: in ./helmfile.yaml: error during helmfile_namespaces.yaml.part.0 parsing: template: stringTemplate:128:41: executing "stringTemplate" at <.compute.requests.cpu>: map has no entry for key "cpu"
[16:07:00] <elukey>	 yeah then it is needed, check helmfile_namespaces.yaml to be sure
[16:15:17] <klausman>	 weird. So I hand-added my changes to charts/knative-serving/values.yaml to deploy1001 and diffed for ml-staging, but I don't see those changes (whole bunch of _different_ resource changes, though)
[16:15:44] <klausman>	 (my change is just a doubling of the memory values in charts/knative-serving/values.yaml near line 73
[16:16:35] <elukey>	 klausman: yeah it makes sense, no chart bump so it is not picked up. You can override them in adming_ng/ml-serve.yaml and it should work
[16:17:03] <klausman>	 ack. what about the other resource changes? Most or all of them are for revscoring stuff
[16:17:20] <elukey>	 sorry, even better, there is a knative-specific yaml in admin_ng
[16:17:44] <klausman>	 kserve/values.yaml?
[16:18:03] <klausman>	 oh, nvm, found i
[16:18:05] <klausman>	 +t
[16:21:50] <elukey>	 going afk for the weekend folks
[16:21:57] <elukey>	 I'll restart working on the metrics issue on monday
[16:22:11] <elukey>	 (and I'll deploy the gateways traffic split too)
[16:22:12] <klausman>	 can you revew my quota thing, or should I wait for next week?
[16:22:23] <elukey>	 yes definitely, let's not roll it out today
[16:22:42] <klausman>	 ack! have a splendid weekend!
[16:22:48] <elukey>	 you too!
[16:24:30] <klausman>	 heading out now as well
[16:35:03] <wikibugs>	 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy) I built the image on contint1002 today. The image size is 4.6GB with one of the layers being 4.21GB.   I tried pushing to the registry it ke...
[19:30:57] <wikibugs>	 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy) Digging up T288198 reveals that `/var/lib/nginx` is hosted on a tmpfs of size 2GB.  That would explain the problem.  I'll try to revive the...
[19:43:31] <wikibugs>	 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy)
[21:58:40] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 62 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)