[02:59:17] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't work in mobile web - https://phabricator.wikimedia.org/T314026 (10eigyan) a:03eigyan [06:53:10] good morning :) [07:44:17] ok so the first weird thing is that https://wikitech.wikimedia.org/wiki/Kubernetes#Running_a_rolling_restart_of_a_Helmfile_service doesn't work for us [07:45:27] (I wanted to use it to redistribute pods) [07:56:19] there is also https://logstash.wikimedia.org/app/discover#/?_g=h@89714be&_a=h@92557e0, a ton of logs from webhooks (istio/knative mostly) [07:56:34] the observability team asked if we need all of those, sigh [08:22:19] What happens with that helmfile commandline? [08:22:37] also, morning :) [08:24:43] nothing basically [08:25:15] I tried with kubectl rollout restart deployment/... as suggested by Janis (in theory the helmfile command should call that under the hood) [08:25:18] and something moves [08:25:23] but the pods are not re-created [08:25:38] I see in the deployment logs [08:25:38] Normal ScalingReplicaSet 14m deployment-controller Scaled up replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 1 [08:25:41] Normal ScalingReplicaSet 14m deployment-controller Scaled down replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 0 [08:25:44] and in the rs logs [08:25:54] Normal ScalingReplicaSet 20m deployment-controller Scaled up replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 1 [08:25:57] Normal ScalingReplicaSet 20m deployment-controller Scaled down replica set enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5d8 to 0 [08:26:22] sorry [08:26:22] Normal SuccessfulCreate 27m replicaset-controller Created pod: enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5kqjqr [08:26:25] Normal SuccessfulDelete 27m replicaset-controller Deleted pod: enwiki-damaging-predictor-default-2hl58-deployment-c87dcc5kqjqr [08:26:34] so afaics it creates a new pod and deletes it :D [08:30:53] ok so there seems to be a better explanation, after a chat with Jaime [08:31:01] err Janis sorry :) [08:31:19] so the helmfile command creates a new replica set, scales it up and then deletes the old one [08:31:24] https://logstash.wikimedia.org/app/dashboards#/view/d43f9bf0-17b5-11eb-b848-090a7444f26c?_g=h@1be52a8&_a=h@32e5a96 [08:31:34] but for some reason the new replica set fails [08:39:59] the failed mounts seem to be the culprit, maybe something knative-related (with our dear ancient version) [08:40:26] I can open a task about it, and collect all info [08:41:09] I was derailed while checking perfs on deploy1002 [08:41:18] `wrk -c 10 -t 10 --timeout 5s -s inference-article.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency [08:41:21] Running 10s test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict` [08:41:30] leads to ~ 29 rps [08:41:38] if I run it for staging, 39 rps [08:41:50] both in codfw, both one pod [08:42:00] nice. And yes, I think the rolling-restart thing is not ultra urgent to fix. [08:42:00] same cpu/memory allowance [08:43:04] so a 10 rps difference is a lot [08:43:31] and the latency distribution is different as well, p99 is way worse on ml-serve-prod [08:43:36] I'd have expected the opposite [08:43:47] Hmm. [08:44:13] Is it maybe because we're not running all models in staging? [08:44:56] could be a possibility, I started to check pod distribution across nodes and ml-serve-codfw is unbalanced, this is why I wanted to use roll_restart=1 [08:45:23] some of the nodes (checked via kubectl describe nodes) seem to be overcommitted in cpu (Limits) [08:45:40] but the cpu/memory usage of the hosts is super low [08:45:49] This may be a result of the rolling restarts I did for kernel updates [08:46:32] The cookbook atm tries to avoid scheduling pods on nodes that soon will be restarted, resulting in everyhting piling into the first few [08:47:21] could be yes, I do see though that the Request section in the cpu details for every node is around 60%, on Limits is overcommitted, so probably it is something else [08:47:34] I mean I don't see clear signs of problems in resource usage atm [08:47:51] Might be the context switching that adds latency [08:48:41] If you completely stop/delete a particular pod set, start it again, verfy it's on a different host and then perf test it, that would (dis)prove that hypothesis [08:48:48] or it could be that istio/knative routing (with our versions) is not great [08:48:59] That is also a possibility of course [08:50:18] if so it is very scary [08:50:57] But at least we would've found out before going to proper prod [08:52:21] but there is no clear solution if so, other than waiting for k8s 1.2x and knative 1.x probably [08:52:24] that is not great [08:54:25] Ack. But I would then also expect this to be more visible elsewhere [08:54:30] aka a known issue [08:55:02] mmmmmmmmm [08:55:12] so something is strange [08:55:59] if I run wrk with 5 clients on the ml-serve-codfw cluster, I see [08:56:00] 50% 188.50ms [08:56:00] 75% 216.53ms [08:56:00] 90% 327.86ms [08:56:00] 99% 945.10ms [08:56:29] meanwhile the same on staging is way smoother [08:56:38] 50% 173.41ms [08:56:38] 75% 197.69ms [08:56:38] 90% 213.70ms [08:56:38] 99% 562.99ms [08:56:55] the former reaches ~18 rps, the latter ~27 [08:56:56] Are you running both tests from the same place? [08:56:59] yes [08:57:21] I checked the kserve-container logs, and I see sometimes high latencies [08:57:36] if it was istio/knative I wouldn't have expected to find them [08:58:06] we should try the re-schedule approach, I think [08:58:30] (stop completely, start and hope for empty(~er) server) [08:59:18] I tried yesterday but it didn't change much, even if the new pod landed not on a super free server [08:59:27] Hurm. [08:59:37] How does serve in eqiad compare? [09:02:08] similar to ml-serve-codfw [09:02:33] So something on the staging cluster is working... better. or it's really just a capacity thing. [09:03:01] One (annoying) thing we could do is empty a cluster completely and start everything afresh, see if it makes a difference. [09:03:18] But it's a lot of work and might not prove anything [09:07:01] yeah [09:54:49] one thing that we haven't built up to now is a dashboard for the istio-sidecars [09:55:05] https://grafana.wikimedia.org/d/G7yj84Vnk/istio includes only the traffic to pods [09:55:12] not their traffic to other services [09:55:31] I am wondering if in prod, for some obscure reason, the latency to the mw api is worse/throttled/etc.. [10:06:55] ok so the metrics are already published, it should be a matter of building the dashboard [10:11:02] mmm or maybe not [10:30:40] * elukey lunch! [10:39:33] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Connect Outlink topic model to eventgate - https://phabricator.wikimedia.org/T315994 (10achou) The schema of `mediawiki.revision-score` that ORES uses is as follows: ` { "awesomeness": { "model_name": "awesomeness",... [13:21:33] Morning all! [13:21:47] Yesterday sucked but I’m feeling great today [13:23:44] nice :) [13:54:01] https://istio.io/latest/docs/concepts/observability/#distributed-traces is very nice [13:54:06] but it requires a lot of work [14:19:40] I wasn't even aware k8s had that kinda functionality. Distributed traces are awesome, tho. [14:19:58] Some types of debugging are downtight impossible without it [14:30:36] it needs a bit of work in our case I am afraid [14:30:39] but we can think about it [14:32:31] ack. definitely in the "nice to have" category, rather than "need now" [14:33:07] I am currently playing with https://istio.io/latest/docs/ops/configuration/telemetry/envoy-stats/ [14:33:23] it would be sooo nice to be able to fit into https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry [14:36:56] but istio currently doesn't collaborate with me [14:45:16] It can be recalcitrant that way [14:50:03] elukey@ml-serve2003:~$ sudo nsenter -t 3672660 -n curl localhost:15000/stats/prometheus -s | wc -l [14:50:06] 210700 [14:50:09] for too few to too many [14:51:55] well, you have to divide by 3, since there's likely HELP and TYPE lines for each metric line [14:52:04] still, 70k metrics seems.... a lot [14:52:25] not really, most of the metrics are listed after HELP and TYPE [14:52:48] what do you mean? [14:52:48] for some of them it lists errors/latencies/etc.. from the sidecar to all the other destination/pods [14:53:01] Oh, right [14:53:08] you find TYPE and HELP, and then like one gigazillion metrics [14:53:31] Yeah, I just figured. Their cardinality is in the labels, not base metrics. [14:53:38] Wither way, seems like way too many [14:53:41] Either* [14:55:45] I can simply whitelist the ones that we use in the dashboard [14:55:50] and see if we can work with those [14:55:57] ack [15:02:34] ok I have saved the metrics on ml-serve2003, and I am now restoring the old behavior before the weekend [15:04:56] done :) [15:08:43] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) Something really strange happened after the deployment to ml-serve-codfw: ` elukey@deploy1002:~$ wrk -c 10 -t 10 --timeout 5s -s inferen... [15:08:45] going afk for a bit! [15:28:41] 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10achou) [15:31:04] 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10achou) That's a good idea, Luca :) Hi @Miriam @Diego @Isaac @fkaelin @MunizaA @Htriedman! ML team has been writing the Lift Wing documentation so people can g... [16:11:57] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) Another note - Initially I thought it could have been the routing between istio and knative to add latency, somehow getting worse as more... [16:17:44] going afk for the weekend folks! [16:17:50] have a nice weekend :) [16:17:55] same to you [16:18:03] See you Tuesday (Monday is a holiday here) [16:18:09] ack! [16:37:57] Have a nice weekend Luca and Tobias! :) [16:55:25] you too, Aiko :)