[08:12:19] good morning :) [08:12:20] https://github.com/istio/istio/issues/33105 [08:12:27] "TLS origination does not work as expected in 1.9.5" [08:12:29] lol [08:27:33] mmm I think I found a working config [08:28:05] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=unknown&var-backend=api-ro.discovery.wmnet&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-1h&to=now [08:48:01] the change is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/832609 [08:51:54] the main issue for the moment is that the new metrics are not separated by namespace [09:02:34] morning! [09:02:47] Re: no NS separation: that we should definitely work on, if feasible. Might be good enough now, but definitely not in the long term [09:03:11] Looking at change 832609 now [09:04:59] lgtm'd [09:09:21] yeah it is annoying, I need to check the envoy metrics [09:10:13] https://thanos.wikimedia.org/graph?g0.expr=istio_requests_total%7Bdestination_service%3D%22api-ro.discovery.wmnet%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [09:10:19] the good thing is that there is the pod [09:10:51] ah wait [09:10:52] kubernetes_namespace="revscoring-editquality-goodfaith", [09:10:55] yeah all good then [09:11:06] I think it is only a matter of fixing the Grafana Dashboard [09:11:19] I have to go afk for an errand but will continue later on :) [09:12:07] ack (both re errand and `kubernetes_namespace`) [10:36:42] getting lunch as well :) [13:46:46] started https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar [13:46:52] that should in theory do what we need [13:46:58] now merging my change, thanks for the review :) [13:58:30] staging updated, doing some tests here and there [13:58:41] the dashboard is a fork of the istio one, needs some tuning [14:00:08] checked metrics reported in the istio-proxy container and the dashboard values, so far they look consistent [14:00:26] very happy about the new metrics \o/ [14:23:16] moving to ml-serve-codfw, so we can start seeing somemetrics [14:31:40] mmmm of course in ml-serve-codfw doesn't work [14:33:49] ok forgot a thing sigh [14:43:47] klausman: so from the metrics it seems that the connection to the mw api is not at fault [14:43:53] Dang [14:44:05] I mean it is good :) [14:44:23] sorta. it would have been "Some else's problem to fix" ;) [14:44:24] the other thing that I have in mind, taking a lot of time, is the model itself computing [14:44:45] But that shouldn't really vary that much between staging and prod? [14:44:54] ah yes in theory it shoudln't :D [14:46:49] but I was worried about how tornado could have handled async HTTP coroutines and blocking code (the model I mean) [14:47:20] Hrm, yes. DO you happen to know whether the istio-reported latency includes e.g. DNS lookup? [14:47:33] in theory yes [14:47:41] Because I was only half joking when I mentioned it was DNS [14:47:54] oh yes it is always DNS [14:48:17] but in this particular case I think it may be tornado [14:50:26] instrumenting our code with timings could surely help [14:50:33] yup [14:50:41] maybe turning logs to debug could also suffice [14:50:45] Do we already have prom metrics on the services themselves? [14:50:48] I'll test locally [14:51:13] we have only metrics from the knative/kserve point of view, like requests and latencies [14:51:32] but nothing revscoring-related [14:52:04] mh. I think model/service-internal metrics may become useful at some point anyway, but I am not sure how easy it would be to wedge into revscoring, and how much work it is to make Prom actually scrape it in a sane way [14:52:20] E.g. you'd want to scrape the replicas, not the LB endpoint [14:55:45] ORES itself publishes metrics via statsd, and we have a local prometheus exporter that collects them and expose them again in prom format [14:56:04] but I am not super sure if they are revscoring-related, I think it is more ORES-service related [15:04:54] the other step that I want to make is to have a logstash dashboard for kserve's access logs [15:05:08] so we can have a breakdown of UA like we do for ORES [15:05:10] yes, you mentioned that before [15:05:54] ETOOMANYTHINGS :D [15:07:50] I feel like that alarm is firing all the time and we should just silence it ;) [15:41:06] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10Isaac) Maybe we pick this up in a week or two when I'm back from research offsite and coordinate directly (meeting where I kick off sl... [16:05:03] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) I was finally able to have metrics from sidecars, but it was not one of the solutions highlighted above. The preprocess() call started a... [16:17:31] all right I think that I am done for today! [16:17:36] klausman: see you in Prague!! [16:17:48] have a nice weekend folks [16:34:24] \o