[08:12:19] <elukey>	 good morning :)
[08:12:20] <elukey>	 https://github.com/istio/istio/issues/33105
[08:12:27] <elukey>	 "TLS origination does not work as expected in 1.9.5"
[08:12:29] <elukey>	 lol
[08:27:33] <elukey>	 mmm I think I found a working config 
[08:28:05] <elukey>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=unknown&var-backend=api-ro.discovery.wmnet&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-1h&to=now
[08:48:01] <elukey>	 the change is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/832609
[08:51:54] <elukey>	 the main issue for the moment is that the new metrics are not separated by namespace
[09:02:34] <klausman>	 morning!
[09:02:47] <klausman>	 Re: no NS separation: that we should definitely work on, if feasible. Might be good enough now, but definitely not in the long term
[09:03:11] <klausman>	 Looking at change 832609 now
[09:04:59] <klausman>	 lgtm'd
[09:09:21] <elukey>	 yeah it is annoying, I need to check the envoy metrics
[09:10:13] <elukey>	 https://thanos.wikimedia.org/graph?g0.expr=istio_requests_total%7Bdestination_service%3D%22api-ro.discovery.wmnet%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[09:10:19] <elukey>	 the good thing is that there is the pod
[09:10:51] <elukey>	 ah wait 
[09:10:52] <elukey>	 kubernetes_namespace="revscoring-editquality-goodfaith", 
[09:10:55] <elukey>	 yeah all good then
[09:11:06] <elukey>	 I think it is only a matter of fixing the Grafana Dashboard
[09:11:19] <elukey>	 I have to go afk for an errand but will continue later on :)
[09:12:07] <klausman>	 ack (both re errand and `kubernetes_namespace`)
[10:36:42] <elukey>	 getting lunch as well :)
[13:46:46] <elukey>	 started https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar
[13:46:52] <elukey>	 that should in theory do what we need
[13:46:58] <elukey>	 now merging my change, thanks for the review :)
[13:58:30] <elukey>	 staging updated, doing some tests here and there
[13:58:41] <elukey>	 the dashboard is a fork of the istio one, needs some tuning
[14:00:08] <elukey>	 checked metrics reported in the istio-proxy container and the dashboard values, so far they look consistent
[14:00:26] <elukey>	 very happy about the new metrics \o/
[14:23:16] <elukey>	 moving to ml-serve-codfw, so we can start seeing somemetrics
[14:31:40] <elukey>	 mmmm of course in ml-serve-codfw doesn't work
[14:33:49] <elukey>	 ok forgot a thing sigh
[14:43:47] <elukey>	 klausman: so from the metrics it seems that the connection to the mw api is not at fault
[14:43:53] <klausman>	 Dang
[14:44:05] <elukey>	 I mean it is good :)
[14:44:23] <klausman>	 sorta. it would have been "Some else's problem to fix" ;)
[14:44:24] <elukey>	 the other thing that I have in mind, taking a lot of time, is the model itself computing
[14:44:45] <klausman>	 But that shouldn't really vary that much between staging and prod?
[14:44:54] <elukey>	 ah yes in theory it shoudln't :D
[14:46:49] <elukey>	 but I was worried about how tornado could have handled async HTTP coroutines and blocking code (the model I mean)
[14:47:20] <klausman>	 Hrm, yes. DO you happen to know whether the istio-reported latency includes e.g. DNS lookup?
[14:47:33] <elukey>	 in theory yes
[14:47:41] <klausman>	 Because I was only half joking when I mentioned it was DNS
[14:47:54] <elukey>	 oh yes it is always DNS
[14:48:17] <elukey>	 but in this particular case I think it may be tornado
[14:50:26] <elukey>	 instrumenting our code with timings could surely help
[14:50:33] <klausman>	 yup
[14:50:41] <elukey>	 maybe turning logs to debug could also suffice
[14:50:45] <klausman>	 Do we already have prom metrics on the services themselves?
[14:50:48] <elukey>	 I'll test locally
[14:51:13] <elukey>	 we have only metrics from the knative/kserve point of view, like requests and latencies
[14:51:32] <elukey>	 but nothing revscoring-related
[14:52:04] <klausman>	 mh. I think model/service-internal metrics may become useful at some point anyway, but I am not sure how easy it would be to wedge into revscoring, and how much work it is to make Prom actually scrape it in a sane way
[14:52:20] <klausman>	 E.g. you'd want to scrape the replicas, not the LB endpoint
[14:55:45] <elukey>	 ORES itself publishes metrics via statsd, and we have a local prometheus exporter that collects them and expose them again in prom format
[14:56:04] <elukey>	 but I am not super sure if they are revscoring-related, I think it is more ORES-service related
[15:04:54] <elukey>	 the other step that I want to make is to have a logstash dashboard for kserve's access logs
[15:05:08] <elukey>	 so we can have a breakdown of UA like we do for ORES
[15:05:10] <klausman>	 yes, you mentioned that before
[15:05:54] <elukey>	 ETOOMANYTHINGS :D
[15:07:50] <klausman>	 I feel like that alarm is firing all the time and we should just silence it ;)
[15:41:06] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10Isaac) Maybe we pick this up in a week or two when I'm back from research offsite and coordinate directly (meeting where I kick off sl...
[16:05:03] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) I was finally able to have metrics from sidecars, but it was not one of the solutions highlighted above. The preprocess() call started a...
[16:17:31] <elukey>	 all right I think that I am done for today!
[16:17:36] <elukey>	 klausman: see you in Prague!!
[16:17:48] <elukey>	 have a nice weekend folks
[16:34:24] <klausman>	 \o