[01:54:49] viwiki noooo [08:07:09] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9773668 (10kevinbazira) Thank you for the confirmation, @elukey! Since we are working towards implementing the new trans... [09:36:18] (03PS1) 10Elukey: huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) [09:36:38] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9773856 (10elukey) Hi Kevin! So https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_n... [09:36:41] (03CR) 10CI reject: [V:04-1] huggingface: upgrade base image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028393 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [10:38:33] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9773973 (10elukey) >>! In T362984#9768972, @elukey wrote: > ` > == Step 2: publishing == > Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.... [10:38:37] hello folks! [11:28:49] Heyo Luca [11:29:08] (still at the doc's. stationary bike broke. it wasn't me!) [12:27:03] back! [12:27:07] going to depool eqiad [12:28:16] done [12:31:43] Ack, keeping an eye on Grafana and logstash [12:50:54] deploying [12:51:57] first, coredns changes [12:52:40] then knative/istio ones [12:56:26] and now starting with the isvcs [13:01:16] Morning all [13:03:14] o/ [13:03:22] this rollout will also add the logging changes etc.. [13:07:21] hey chris [13:10:34] elukey: I am seeing an increase in 0-code responses for RR in eqiad, but I suspect that's just residual traffic [13:10:54] (~8qps 200s, ~4qps 0s) [13:11:09] graph? [13:11:12] sec [13:11:25] https://grafana.wikimedia.org/goto/N4Ps8vLIg?orgId=1 [13:11:37] That is just the RR NS [13:14:12] could be yes, there is also a MW API outage sigh [13:15:07] Of course. Must be a Monday [13:16:09] seems auto-resolved and probably not an issue, very weird [13:16:22] I am trying to contact the isvcs in eqiad to force istio to get the new config [13:17:21] Think it might be the same latency thing you saw in codfw (initially taking 30+s to work)? [13:17:35] for sure yes [13:23:39] Sending to inference.svc.eqiad.wmnet... [13:23:40] PASS: 114 requests sent to inference.svc.eqiad.wmnet. All assertions passed. [13:23:44] ok back in business :) [13:24:03] if you folks want to spot-check I think we should be good to repool [13:26:51] I'll do some poking and prodding, but you can repool anytime from my POV [13:28:02] repooled! [13:28:10] we are officially off api-ro :) [13:30:36] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9774366 (10elukey) And eqiad migrated as well, all done :) [13:30:52] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9774368 (10elukey) [13:31:04] Nice work! [13:31:15] aw shit! [13:31:18] so cool [13:31:20] 06Machine-Learning-Team: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9774374 (10elukey) [13:31:39] 06Machine-Learning-Team: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9774376 (10elukey) The changes have been successfully deployed on all Lift Wing clusters. [13:31:58] and we have the transparent proxy settings etc.. [13:36:30] elukey: I think we have itwiki exhibiting the ruwiki-alike latency behavior [13:36:39] rr-eq-damaging [13:36:43] in eqiad [13:37:12] ~40% of responses are code 0 and the avg latency is approaching 30s [13:37:21] rr-eq-damaging? [13:38:19] anyway I think it happens from time to time to a lot of revscoring isvcs, we'll need to improve our settings for usre [13:38:22] *sure [13:38:31] er revscoring, of course, not rr [13:38:37] I checked the mw-api-int-ro status in eqiad and we are good afacs [13:39:01] https://grafana.wikimedia.org/goto/vPNdlDYSg?orgId=1 [13:39:12] it seems to be recovering [13:39:23] Currently poking at logstash to find a bad req [13:40:47] there are the viwiki logs to check too, over the weekend it failed two times.. [13:41:03] still not sure if it is a single bad request, for sure the pattern seems to be CPU usage going high [13:50:33] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9774417 (10elukey) Ah wow my bad! I inspected the docker image and it contains a ton of Nvidia binaries. Will review again the install procedure, really sneaky. [13:58:02] very sneaky, if the --extra-index-url is not precise then we install pytorch-nvidia [13:58:37] the final image is smaller, but the docker layer with the packages is bigger than 4G [13:58:42] so the docker registry doesn't allow it [13:58:49] oof. [13:59:17] How far over the limit is it? [13:59:43] ~1G more, but I am rebuilding the image with the correct rocm version [13:59:53] I'll check locally before sending the patch, that should work [14:00:02] but the compressed 4G limit may bite us in the future [14:00:33] rocm clearly superior to nvidia, because smaller ;) [14:44:01] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9774554 (10kevinbazira) As we work to enable the logo-detection model-server to access images from the upload stash using a k8s endpoint, @achou pointed out that files... [15:10:51] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9774678 (10mfossati) >>! In T362749#9774553, @kevinbazira wrote: > @achou pointed out that files might not be accessible since the [[ https://www.mediawiki.org/wiki/Up... [15:26:39] (03CR) 10Kevin Bazira: "Thank you for pointing this out, Aiko! I looped in Marco who shared the stash URLs: https://phabricator.wikimedia.org/T362749#9774553" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [16:34:21] I filed a proposal to mitigate the current issue we have with 50x: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1028552 [16:34:48] summary: the revscoring isvcs are slow with mixed rev-id traffic, we need to scale up sooner [16:34:56] we should have plenty of capacity in prod for this [16:35:28] I am afraid we'll probably need to go lower than 10, but it seems a good start [16:35:34] lemme know your thoughts :) [16:36:11] logging off folks! [16:36:16] have a nice rest of the day [17:41:55] This is definitely worth discussing given that la-rr exists [17:44:00] chrisalbon: definitely yes, but draining the current revscoring-clients will be a long game, and the current status is not great.. I'd fear to drive community members away before we can move them to RR [17:44:11] and of course even with scaling RR performs way better than revscoring [17:44:37] so no real chance that improving the current status will go against movint to RR