[02:59:17] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't work in mobile web - https://phabricator.wikimedia.org/T314026 (10eigyan) a:03eigyan
[06:53:10] <elukey>	 good morning :)
[07:44:17] <elukey>	 ok so the first weird thing is that https://wikitech.wikimedia.org/wiki/Kubernetes#Running_a_rolling_restart_of_a_Helmfile_service doesn't work for us
[07:45:27] <elukey>	 (I wanted to use it to redistribute pods)
[07:56:19] <elukey>	 there is also https://logstash.wikimedia.org/app/discover#/?_g=h@89714be&_a=h@92557e0, a  ton of logs from webhooks (istio/knative mostly)
[07:56:34] <elukey>	 the observability team asked if we need all of those, sigh
[08:22:19] <klausman>	 What happens with that helmfile commandline?
[08:22:37] <klausman>	 also, morning :)
[08:24:43] <elukey>	 nothing basically
[08:25:15] <elukey>	 I tried with kubectl rollout restart deployment/... as suggested by Janis (in theory the helmfile command should call that under the hood)
[08:25:18] <elukey>	 and something moves
[08:25:23] <elukey>	 but the pods are not re-created
[08:25:38] <elukey>	 I see in the deployment logs
[08:25:38] <elukey>	   Normal  ScalingReplicaSet  14m   deployment-controller  Scaled up replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 1
[08:25:41] <elukey>	   Normal  ScalingReplicaSet  14m   deployment-controller  Scaled down replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 0
[08:25:44] <elukey>	 and in the rs logs
[08:25:54] <elukey>	   Normal  ScalingReplicaSet  20m   deployment-controller  Scaled up replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 1
[08:25:57] <elukey>	   Normal  ScalingReplicaSet  20m   deployment-controller  Scaled down replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 0
[08:26:22] <elukey>	 sorry
[08:26:22] <elukey>	   Normal  SuccessfulCreate  27m   replicaset-controller  Created pod: enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5kqjqr
[08:26:25] <elukey>	   Normal  SuccessfulDelete  27m   replicaset-controller  Deleted pod: enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5kqjqr
[08:26:34] <elukey>	 so afaics it creates a new pod and deletes it :D
[08:30:53] <elukey>	 ok so there seems to be a better explanation, after a chat with Jaime
[08:31:01] <elukey>	 err Janis sorry :)
[08:31:19] <elukey>	 so the helmfile command creates a new replica set, scales it up and then deletes the old one
[08:31:24] <elukey>	 https://logstash.wikimedia.org/app/dashboards#/view/d43f9bf0-17b5-11eb-b848-090a7444f26c?_g=h@1be52a8&_a=h@32e5a96
[08:31:34] <elukey>	 but for some reason the new replica set fails
[08:39:59] <elukey>	 the failed mounts seem to be the culprit, maybe something knative-related (with our dear ancient version)
[08:40:26] <elukey>	 I can open a task about it, and collect all info
[08:41:09] <elukey>	 I was derailed while checking perfs on deploy1002
[08:41:18] <elukey>	 `wrk -c 10 -t 10 --timeout 5s -s inference-article.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency
[08:41:21] <elukey>	 Running 10s test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict` 
[08:41:30] <elukey>	 leads to ~ 29 rps
[08:41:38] <elukey>	 if I run it for staging, 39 rps
[08:41:50] <elukey>	 both in codfw, both one pod
[08:42:00] <klausman>	 nice. And yes, I think the rolling-restart thing is not ultra urgent to fix.
[08:42:00] <elukey>	 same cpu/memory allowance
[08:43:04] <elukey>	 so a 10 rps difference is a lot
[08:43:31] <elukey>	 and the latency distribution is different as well, p99 is way worse on ml-serve-prod
[08:43:36] <elukey>	 I'd have expected the opposite
[08:43:47] <klausman>	 Hmm.
[08:44:13] <klausman>	 Is it maybe because we're not running all models in staging?
[08:44:56] <elukey>	 could be a possibility, I started to check pod distribution across nodes and ml-serve-codfw is unbalanced, this is why I wanted to use roll_restart=1
[08:45:23] <elukey>	 some of the nodes (checked via kubectl describe nodes) seem to be overcommitted in cpu (Limits)
[08:45:40] <elukey>	 but the cpu/memory usage of the hosts is super low
[08:45:49] <klausman>	 This may be a result of the rolling restarts I did for kernel updates
[08:46:32] <klausman>	 The cookbook atm tries to avoid scheduling pods on nodes that soon will be restarted, resulting in everyhting piling into the first few
[08:47:21] <elukey>	 could be yes, I do see though that the Request section in the cpu details for every node is around 60%, on Limits is overcommitted, so probably it is something else
[08:47:34] <elukey>	 I mean I don't see clear signs of problems in resource usage atm
[08:47:51] <klausman>	 Might be the context switching that adds latency
[08:48:41] <klausman>	 If you completely stop/delete a particular pod set, start it again, verfy it's on a different host and then perf test it, that would (dis)prove that hypothesis
[08:48:48] <elukey>	 or it could be that istio/knative routing (with our versions) is not great
[08:48:59] <klausman>	 That is also a possibility of course
[08:50:18] <elukey>	 if so it is very scary
[08:50:57] <klausman>	 But at least we would've found out before going to proper prod
[08:52:21] <elukey>	 but there is no clear solution if so, other than waiting for k8s 1.2x and knative 1.x probably
[08:52:24] <elukey>	 that is not great
[08:54:25] <klausman>	 Ack. But I would then also expect this to be more visible elsewhere
[08:54:30] <klausman>	 aka a known issue
[08:55:02] <elukey>	 mmmmmmmmm
[08:55:12] <elukey>	 so something is strange
[08:55:59] <elukey>	 if I run wrk with 5 clients on the ml-serve-codfw cluster, I see
[08:56:00] <elukey>	      50%  188.50ms
[08:56:00] <elukey>	      75%  216.53ms
[08:56:00] <elukey>	      90%  327.86ms
[08:56:00] <elukey>	      99%  945.10ms
[08:56:29] <elukey>	 meanwhile the same on staging is way smoother
[08:56:38] <elukey>	      50%  173.41ms
[08:56:38] <elukey>	      75%  197.69ms
[08:56:38] <elukey>	      90%  213.70ms
[08:56:38] <elukey>	      99%  562.99ms
[08:56:55] <elukey>	 the former reaches ~18 rps, the latter ~27
[08:56:56] <klausman>	 Are you running both tests from the same place?
[08:56:59] <elukey>	 yes
[08:57:21] <elukey>	 I checked the kserve-container logs, and I see sometimes high latencies
[08:57:36] <elukey>	 if it was istio/knative I wouldn't have expected to find them
[08:58:06] <klausman>	 we should try the re-schedule approach, I think
[08:58:30] <klausman>	 (stop completely, start and hope for empty(~er) server)
[08:59:18] <elukey>	 I tried yesterday but it didn't change much, even if the new pod landed not on a super free server
[08:59:27] <klausman>	 Hurm.
[08:59:37] <klausman>	 How does serve in eqiad compare?
[09:02:08] <elukey>	 similar to ml-serve-codfw
[09:02:33] <klausman>	 So something on the staging cluster is working... better. or it's really just a capacity thing.
[09:03:01] <klausman>	 One (annoying) thing we could do is empty a cluster completely and start everything afresh, see if it makes a difference.
[09:03:18] <klausman>	 But it's a lot of work and might not prove anything
[09:07:01] <elukey>	 yeah
[09:54:49] <elukey>	 one thing that we haven't built up to now is a dashboard for the istio-sidecars
[09:55:05] <elukey>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio includes only the traffic to pods
[09:55:12] <elukey>	 not their traffic to other services
[09:55:31] <elukey>	 I am wondering if in prod, for some obscure reason, the latency to the mw api is worse/throttled/etc..
[10:06:55] <elukey>	 ok so the metrics are already published, it should be a matter of building the dashboard
[10:11:02] <elukey>	 mmm or maybe not
[10:30:40] * elukey lunch!
[10:39:33] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Connect Outlink topic model to eventgate - https://phabricator.wikimedia.org/T315994 (10achou) The schema of `mediawiki.revision-score` that ORES uses is as follows: ` {     "awesomeness": {         "model_name": "awesomeness",...
[13:21:33] <chrisalbon>	 Morning all!
[13:21:47] <chrisalbon>	 Yesterday sucked but I’m feeling great today
[13:23:44] <elukey>	 nice :)
[13:54:01] <elukey>	 https://istio.io/latest/docs/concepts/observability/#distributed-traces is very nice
[13:54:06] <elukey>	 but it requires a lot of work
[14:19:40] <klausman>	 I wasn't even aware k8s had that kinda functionality. Distributed traces are awesome, tho.
[14:19:58] <klausman>	 Some types of debugging are downtight impossible without it
[14:30:36] <elukey>	 it needs a bit of work in our case I am afraid
[14:30:39] <elukey>	 but we can think about it
[14:32:31] <klausman>	 ack. definitely in the "nice to have" category, rather than "need now"
[14:33:07] <elukey>	 I am currently playing with https://istio.io/latest/docs/ops/configuration/telemetry/envoy-stats/
[14:33:23] <elukey>	 it would be sooo nice to be able to fit into https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry
[14:36:56] <elukey>	 but istio currently doesn't collaborate with me
[14:45:16] <klausman>	 It can be recalcitrant that way
[14:50:03] <elukey>	 elukey@ml-serve2003:~$ sudo nsenter -t 3672660 -n curl localhost:15000/stats/prometheus -s | wc -l
[14:50:06] <elukey>	 210700
[14:50:09] <elukey>	 for too few to too many
[14:51:55] <klausman>	 well, you have to divide by 3, since there's likely HELP and TYPE lines for each metric line
[14:52:04] <klausman>	 still, 70k metrics seems.... a lot
[14:52:25] <elukey>	 not really, most of the metrics are listed after HELP and TYPE
[14:52:48] <klausman>	 what do you mean?
[14:52:48] <elukey>	 for some of them it lists errors/latencies/etc.. from the sidecar to all the other destination/pods
[14:53:01] <klausman>	 Oh, right
[14:53:08] <elukey>	 you find TYPE and HELP, and then like one gigazillion metrics
[14:53:31] <klausman>	 Yeah, I just figured. Their cardinality is in the labels, not base metrics.
[14:53:38] <klausman>	 Wither way, seems like way too many
[14:53:41] <klausman>	 Either*
[14:55:45] <elukey>	 I can simply whitelist the ones that we use in the dashboard
[14:55:50] <elukey>	 and see if we can work with those
[14:55:57] <klausman>	 ack
[15:02:34] <elukey>	 ok I have saved the metrics on ml-serve2003, and I am now restoring the old behavior before the weekend
[15:04:56] <elukey>	 done :)
[15:08:43] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) Something really strange happened after the deployment to ml-serve-codfw:  ` elukey@deploy1002:~$ wrk -c 10 -t 10 --timeout 5s -s inferen...
[15:08:45] <elukey>	 going afk for a bit!
[15:28:41] <wikibugs>	 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10achou)
[15:31:04] <wikibugs>	 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10achou) That's a good idea, Luca :)  Hi @Miriam @Diego @Isaac @fkaelin @MunizaA @Htriedman!  ML team has been writing the Lift Wing documentation so people can g...
[16:11:57] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) Another note - Initially I thought it could have been the routing between istio and knative to add latency, somehow getting worse as more...
[16:17:44] <elukey>	 going afk for the weekend folks!
[16:17:50] <elukey>	 have a nice weekend :)
[16:17:55] <klausman>	 same to you
[16:18:03] <klausman>	 See you Tuesday (Monday is a holiday here)
[16:18:09] <elukey>	 ack!
[16:37:57] <aiko>	 Have a nice weekend Luca and Tobias! :)
[16:55:25] <klausman>	 you too, Aiko :)