[08:16:37] o/ folks [08:21:32] I need edit permissions for the LW API portal docs https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_goodfaith_prediction. Can any1 point me towards the right place to file a request? (phabricator search isn't helping me) [08:26:26] o/ [08:26:43] in theory we all should have it, I requested it to the API portal's team [08:26:44] lemme see [08:27:12] I thought I had checked it at the time, but probably I was wrong as I don't have access [08:27:16] https://phabricator.wikimedia.org/T330634 [08:29:16] hmm I used to have two users attached to my phabricator account (WMF and personal one) and at some point I got a request to remove my personal one. perhaps that one was attached [08:29:26] thanks Luca I'll follow up on that [08:30:40] ahhh yes maybe something like that happened [08:35:13] one interesting thing that I am checking now [08:35:49] if I made calls from ores-legacy I'd expect them to show up on the istio gateway's access log dashboard [08:35:59] and what I find is usually a single HTTP/2 connection logged [08:36:24] now I am wondering if we multiplex multiple calls on http/2, logging only one [08:36:33] and maybe the timeouts could be related to http/2 limits [08:39:47] yeah I see some bugs related to envoy setting max streams 100 for http/2 conns [08:40:12] aha! I would have never thought about this [08:41:13] nice catch. can we override this though only for LW/ores-legacy? [08:42:09] quick (irrelevant) question: the host header in LW is set from the MODEL_NAME/INFERENCE_NAME right? [08:42:34] isaranto: yes yes the one that kserve needs [08:43:00] it is a blind spot for me where that is set [08:43:32] I think it uses the inference name [08:44:09] for the http2 vs http1 conns - maybe we could disable http2 in the istio gateway, and deal only with http1.1 conns.. [08:44:54] e\o [08:45:37] elukey: HTTP/2 vs HTTP/1.1 seemed to improve the 503 situation on RR-LA as well, but it's very close to the noise floor in that regard. It also reduced performance quite a bit. [08:45:55] i.e. with 1.1, the 503s seemed to go away completely [08:46:05] But that might also because the qps dropped [08:48:11] klausman: o/ so http1.1 set by the go client that you are using right? [08:51:10] klausman: also, do you have any limit in the max http2 streams used? [08:51:25] I am wondering what happens if you use http/2 and increase that value [08:55:31] in the ores-legacy case it is probably the tls-proxy that tries to upgrade from http1 to http2 [08:56:50] I am looking for a way to force http 1.1 on the istio gw but it seems not straightforward [09:07:35] elukey: yes, the Go http client defaults to HTTP/2 with servers that support it. [09:07:49] as for # of streams, I haven't checked yet if that is configurable [09:09:58] I _think_ I am using one stream, since I create a new Request object (and thus client object) for every request I make [09:10:30] The library I am using (Go's stdlib `net/http`) does not seem to expose stream count as configurable. [09:13:14] need to go afk for some errands, ttl! [09:13:18] \o [09:13:31] I'll do some code rfactoring to reuse the client object, see what happens [09:35:14] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for testing - https://phabricator.wikimedia.org/T342417 (10LDlulisa-WMF) [09:39:29] 10Machine-Learning-Team, 10Epic: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10Gehel) [09:39:32] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10Gehel) 05Open→03Resolved [10:36:52] * klausman lunch [10:40:14] 10Machine-Learning-Team, 10Research: Update torch's settings in the Knowledge Integrity repo - https://phabricator.wikimedia.org/T325349 (10CodeReviewBot) mnz closed https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/10 Remove torch dependency [10:59:32] 10Machine-Learning-Team, 10Research: Index out of range in revert risk multi-lingual - https://phabricator.wikimedia.org/T340811 (10MunizaA) > I managed to separate the headers from sections in the code so it's much cleaner now and seems to run fine for your list of revisions with timeout set to True. @Isaac... [11:28:58] * isaranto lunch! [12:25:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10Sgs) a:05Sgs→03Trizek-WMF @Trizek-WMF I can confirm all the wikis from this round have produced abundant re... [12:26:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10Sgs) [12:27:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 11th round of wikis - https://phabricator.wikimedia.org/T308136 (10Sgs) Status update, as per today the maintenance script is processing `lbwiki`. Growth engineers are trying to come up with an agreem... [13:11:04] very weird, if I make a lot of calls to the api-gateway for say goodfaith I can only see one request logged in logstash, even if it is http1.1 [13:11:16] it makes zero sense [13:41:18] That is really weird. even with connection pooling, I'd expect more than one [13:42:01] I am trying to figure out if the ingress gws do have the logs but we drop them somewhere or not [13:43:36] klausman: I thought about setting http2 on your side - in theory it is fine, since we want http2 up to the CDN, but we'd probably prefer http/1.1 for internal conns [13:43:50] and we have multiple envoys after the CDN [13:43:53] api-gateway [13:43:57] istio gateway [13:44:01] sidecars [13:44:02] etc.. [13:44:12] and they seem to upgrade to http/2 [13:44:14] Ack. [13:44:29] nono I was trying to brainbounce, does it make sense? :D [13:44:35] I haven't experimented with pooled HTTP/1.1 yet, still tweaking my code a bit. [13:45:08] I think we want to support HTTP/2 in the long term anyway, so let's not chase that "solution" too much. [13:45:22] I will try a test run with 1.1 after the current one has completed [13:46:22] we want to support http2 up to the CDN for sure, anything after it needs a discussion [13:46:37] also: I wis the errors we get back client side would include the request_id [13:47:25] the whole deal of multiplexing streams is a bit messy and it complicates all our metrics [13:47:57] I _think_ envoy et al should be able to provide adequate metrics. They definitely have enough information. [13:48:46] The errors that I still see look like this: [13:48:56] 2023/07/21 13:48:33 id:063 503 time:48.190943ms [13:48:58] H: map[Content-Length:[145] Content-Type:[text/plain] Date:[Fri, 21 Jul 2023 13:48:33 GMT] Server:[istio-envoy] X-Envoy-Upstream-Service-Time:[47]] [13:49:00] B: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 [13:49:03] H: is heade, B: is body [13:49:32] yes I think we are hitting some bottleneck [13:49:39] we have been seeing them in the past as well [13:49:43] (this is http/1.1 enforced from the client. Dunno if envoy secretly upgrades to /2 [13:49:46] ) [13:50:15] this is my point - at any given stage of our proxy pipeline of envoys, any envoy can upgrade to http/2 [13:50:38] yeah, and that may make debugging and monitoring rather murky [13:51:14] One option we might have is to set everything to 1.1, debug it until we're confident, then try /2 in select spots and see what happens. [13:51:51] For this /1.1 run, out of 211870 requests in 5m, 325 (0.15%) were answered with 503s like the one above. [13:52:57] klausman: are you hitting revert risk via api-gateway? [14:00:22] No, I am using the internal endpoint of eqiad directly, to avoid as many rate limits as possible [14:00:52] okok because I saw something weird in the logs, namely that the pod serving revert risk was ingress-gateway-services [14:01:03] that is the one used only for non-kserve stuff, like ores-legacy [14:01:15] Huh. [14:01:19] just for confirmation, are you using port 30443? [14:01:23] So that must be someone else [14:03:14] I se ekubernetes.labels.service_istio_io/canonical-name [14:03:15] istio-ingressgateway-services [14:03:25] https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict and Host: revertrisk-language-agnostic.revertrisk.wikimedia.org [14:03:33] okok perfect [14:03:43] because in theory we use -services only on port 31443 [14:04:53] ah no I see it for others as well [14:05:06] ack. I also wasn't running any tests for several minutes until 14:04:30 [14:06:01] ok I see logs of revert risk on ingress gateways services, that is not how it was intended [14:06:40] fwiw, I am working from stat1004, so if that IP is in forwarded-for, it may be me [14:07:18] mmm it seems all "upstream_cluster":"outbound|80||revertrisk-language-agnostic-predictor-default-00008.revertrisk.svc.cluster.local" [14:07:29] this is usually the knative-internal-gateway [14:08:03] maybe that one is deployed on all gateways, mmm [14:08:05] will check [14:09:43] of course, the selector is istio: ingressgateway and both gateway pods have it [14:09:56] it shouldn't matter a lot for the current issues, but we want to keep the things split [14:09:59] I thought I fixed it [14:10:19] Can you clarify what the different traffic patterns are, and where they -should_ go? [14:11:29] we have two istio gateways deployed, one called ingress-gateway and another one called ingress-gateway-services [14:11:36] this is defined in the custom.d yaml file [14:11:47] one listening on nodeport 30443 and one on 31443 [14:12:01] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) @elukey thanks for the additional context. there are ongoing discussions in the Research Team around the level of commitment we ca... [14:12:02] Ah, so the distinciton criteria is ores-legacy vs everything else? [14:12:17] the idea was to have the kserve stuff going through the former, and the serviceops-like services in the other one [14:12:21] yes correct [14:12:32] in -services I'd love to see flowing only ores-legacy for the moment [14:12:35] and recommendation-api etc.. [14:13:06] the main issue is that in the Gateway resources (istio) that we define in the Knative chart, we use selector "istio: ingressgateway" [14:13:10] that is common to both [14:13:29] so when we deploy the Gateway resource, all istio gw pods are entitled to route requests for it [14:13:42] (at least this is my current running theory) [14:14:05] Ah, I see. [14:14:22] Do you think that maybe we sometimes hit the -services IG and thus get 503s? [14:17:32] in theory no, but let's fix it [14:17:34] sending a patch [14:17:37] ack. [14:17:52] also, is there an easy way to see what pod a given IP:port routes to? [14:18:08] nvm, that is logged alongside anyway [14:19:35] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940361/ [14:20:12] of course -1 from ci sigh [14:22:46] I think you may need to quote service.istio.io/canonical-name since it has a - in it [14:23:07] But your YAML-fu is better than mine anyway [14:23:37] nono good point lemme try [14:24:53] also: the 503s I see always come in one or two big blobs in like 1-3 seconds adn the rest of the run (minutes!) is quiet [14:30:35] ok now it should work, I quoted + there was an extra "-" before Values [14:43:00] reading [14:43:21] LGTM! [14:45:37] thanks :) [14:46:13] lmk when you roll out, so I can pause experimenting (and maybe see if it makes a difference) [14:52:54] I am going to test in staging first, should be a little change but I don't want to risk a big problem on a friday :) [14:58:06] Yep, understood. [14:58:38] of course deployment failed and left helm in a weird state [14:58:40] lol [14:59:10] Another observation I have had: if I don't test for a while (O(hours)), it seems the 503s seem more common. Or put another way: warming up the pods makes the 503s less likely. But I don't have hard data [14:59:19] anything I can do to help re: helm? [15:01:29] lemme check wikitech first [15:01:35] one weird thing, I noticed it by mistake [15:01:53] NAME READY STATUS RESTARTS AGE [15:01:56] activator-57c6dffbfd-2c7sd 1/1 Running 10 (15m ago) 24h [15:01:59] activator-57c6dffbfd-l75z7 1/1 Running 8 (16m ago) 24h [15:02:02] there are a ton of errors logged in logstash too [15:02:08] I think they are related to your load tests [15:03:10] Huh. [15:03:17] What are the errors? [15:04:20] I think some timeouts [15:04:40] I'll have a look-see [15:06:49] I see lotsa `Failed to probe clusterIP 10.67.12.88:80` [15:08:02] I am trying to rollback via https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Advanced_use_cases:_using_helm [15:08:03] what app did you see the errors in? [15:08:48] ?? [15:09:08] You mentioned a lot of errors/timeouts in logstash. [15:09:36] So I went there to find them, but I see nothing super unusual [15:09:47] knative logs [15:09:51] thx [15:10:51] we have a dashboard in logstash but it seems broken [15:10:52] sigh [15:11:04] anyway, you can find them if you tail the logs of the activator [15:11:27] all right new knative gateways up [15:11:29] let's see [15:13:25] The only errors I see on the logtail are for experimental/falcon-7b-instruct-predictor-default-* [15:13:51] ah, my run had completed :D [15:18:53] in the activator? [15:19:38] weird.. [15:19:51] why did the activator pod restart? [15:20:42] some readiness probes failed [15:20:43] mmmm [15:24:59] all the pod probes from activator that I saw were not talking about RR-LA [15:25:18] yes yes but the activators restarted, not a good sign [15:25:34] there is an option to remove the activator from the request handling path [15:25:40] we should probably try it [15:29:13] klausman: my theory is that the activator failed the readiness probes when they were under load [15:32:02] the errors are not great but we can check them alter [15:32:04] *later [15:33:06] klausman: we also lost all the metrics some days ago https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=All&from=now-7d&to=now [15:33:10] what the hell today :D [15:33:30] the 18th around 14 UTC [15:34:30] and I caused it https://sal.toolforge.org/log/DFA9aYkBGiVuUzOduD9s [15:35:14] but it was for the container limits - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939257 [15:35:53] 10Machine-Learning-Team, 10Research: Index out of range in revert risk multi-lingual - https://phabricator.wikimedia.org/T340811 (10Isaac) @MunizaA -- I took a quick pass and left some comments but generally looking good. I didn't test locally but hopefully should give you some substantial speed-ups with very... [15:38:53] That is weird as hell [15:39:55] Hey on a different topic,I got access to the api docs :),all good! [15:39:59] I didn't start testing with high rates until after the 18th [15:40:10] I'm logging off for the weekend, cu all! [15:40:54] elukey: so the metrics issue will be resolved, I presume? [15:41:06] as for the activators falling over, do we just need more replicas? [15:41:59] isaranto: o/ [15:42:12] klausman: will be resolved? [15:42:27] for the activators not sure, we need to inspect metrics [15:42:51] can you take care of this part? Maybe we hit cpu limits when you were running the load test [15:43:14] I checked an activator pod and I can see the metrics via nsenter [15:43:16] But I didn't start testing in earnest back then, only later [15:43:26] (IIRC) [15:45:05] sure but if you could check what's wrong in there it would be great [15:45:14] yeah, sure, I can do that [15:45:15] it may also be related to your load tests [15:45:17] or something else [15:46:08] klausman: what I was referring to above is the activators failling over, not the metrics [15:47:00] ack [15:49:25] One thing that is a bit annoying is the constantly-failing pribes for falcon-7b [15:51:38] The activators falling over s due to ooms [15:52:00] cf. https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?forceLogin=&orgId=1&var-datasource=eqiad+prometheus%2Fk8s-mlserve&var-namespace=knative-serving&var-pod=activator-57c6dffbfd-2c7sd&var-pod=activator-57c6dffbfd-l75z7&var-container=All&from=1689911515483&to=1689954715483&viewPanel=15 [15:56:50] klausman: yes I suspected that, all traffic goes through it, so possibly some 503s are due to the ooms [15:57:05] ack. Working on a quota increas patch [15:59:46] super [15:59:50] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940386 as well [16:00:09] I didn't add them to one of the gateways [16:00:56] grmbl, helmfile diff hates me [16:04:58] elukey: when doing resources, is there a way to _not_ specify CPU? Or do it in a way that preserves whatever the default is? [16:05:37] klausman: in theory just omitting it should work, but check the helmfile templates to be sure [16:05:44] argh, fml. tab on the line. [16:06:08] fixed the ml-staging-codfw's knative setup to split traffic between gateways, I'll deploy the changes to prod on monday [16:06:18] It's the map problem: in ./helmfile.yaml: error during helmfile_namespaces.yaml.part.0 parsing: template: stringTemplate:128:41: executing "stringTemplate" at <.compute.requests.cpu>: map has no entry for key "cpu" [16:07:00] yeah then it is needed, check helmfile_namespaces.yaml to be sure [16:15:17] weird. So I hand-added my changes to charts/knative-serving/values.yaml to deploy1001 and diffed for ml-staging, but I don't see those changes (whole bunch of _different_ resource changes, though) [16:15:44] (my change is just a doubling of the memory values in charts/knative-serving/values.yaml near line 73 [16:16:35] klausman: yeah it makes sense, no chart bump so it is not picked up. You can override them in adming_ng/ml-serve.yaml and it should work [16:17:03] ack. what about the other resource changes? Most or all of them are for revscoring stuff [16:17:20] sorry, even better, there is a knative-specific yaml in admin_ng [16:17:44] kserve/values.yaml? [16:18:03] oh, nvm, found i [16:18:05] +t [16:21:50] going afk for the weekend folks [16:21:57] I'll restart working on the metrics issue on monday [16:22:11] (and I'll deploy the gateways traffic split too) [16:22:12] can you revew my quota thing, or should I wait for next week? [16:22:23] yes definitely, let's not roll it out today [16:22:42] ack! have a splendid weekend! [16:22:48] you too! [16:24:30] heading out now as well [16:35:03] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy) I built the image on contint1002 today. The image size is 4.6GB with one of the layers being 4.21GB. I tried pushing to the registry it ke... [19:30:57] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy) Digging up T288198 reveals that `/var/lib/nginx` is hosted on a tmpfs of size 2GB. That would explain the problem. I'll try to revive the... [19:43:31] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10dancy) [21:58:40] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 62 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)