[07:31:50] hello folks [07:45:56] kevinbazira_: o/ you can deploy anytime :) [07:46:21] I got an idea to simplify the revscoring_inference_services config further [07:46:51] niiiice ... [07:47:15] thanks for the merge. let me start the deployment process. [07:59:04] both eqiad and codfw deployments have been completed successfully. [07:59:13] gooood!! [07:59:13] now checking pods ... [07:59:30] once you are done I am going to start the os upgrade work in eqiad [08:00:59] all the 5 new pods are up and running: [08:01:01] NAME READY STATUS RESTARTS AGE [08:01:02] glwiki-reverted-predictor-default-fl4ph-deployment-66f98746tv2g 2/2 Running 0 9m18s [08:01:02] hewiki-damaging-predictor-default-7dvx5-deployment-64f5799l64p6 2/2 Running 0 9m17s [08:01:02] hewiki-goodfaith-predictor-default-s6hd6-deployment-7859f4rktlg 2/2 Running 0 9m15s [08:01:02] hiwiki-damaging-predictor-default-h72rp-deployment-bcc77fdmz5cm 2/2 Running 0 9m14s [08:01:02] hiwiki-goodfaith-predictor-default-8bmz7-deployment-98fc5b7rlp8 2/2 Running 0 9m12s [08:01:41] elukey: please proceed with the eqiad os upgrade. thanks. [08:04:40] kevinbazira: have you also tried to hit the endpoints? [08:16:23] yes I did. glwiki's prediction: [08:16:25] "Host: glwiki-reverted.revscoring-editquality.wikimedia.org" [08:16:26] HTTP/2 200 [08:16:26] content-length: 114 [08:16:26] content-type: application/json; charset=UTF-8 [08:16:26] date: Tue, 22 Feb 2022 08:15:02 GMT [08:16:28] server: istio-envoy [08:16:30] x-envoy-upstream-service-time: 318 [08:16:32] {"predictions": {"prediction": false, "probability": {"false": 0.9948811817984956, "true": 0.005118818201504425}}} [08:16:35] real 0m0.345s [08:16:37] user 0m0.016s [08:16:39] sys 0m0.008s [08:18:42] hewiki's prediction: [08:18:42] "Host: hewiki-damaging.revscoring-editquality.wikimedia.org" [08:18:43] HTTP/2 200 [08:18:43] content-length: 84 [08:18:43] content-type: application/json; charset=UTF-8 [08:18:43] date: Tue, 22 Feb 2022 08:18:08 GMT [08:18:45] server: istio-envoy [08:18:47] x-envoy-upstream-service-time: 502 [08:18:49] {"predictions": {"prediction": false, "probability": {"false": 0.95, "true": 0.05}}} [08:18:51] real 0m0.552s [08:18:53] user 0m0.017s [08:18:55] sys 0m0.009s [08:20:06] "Host: hewiki-goodfaith.revscoring-editquality.wikimedia.org" [08:20:06] HTTP/2 200 [08:20:06] content-length: 114 [08:20:06] content-type: application/json; charset=UTF-8 [08:20:06] date: Tue, 22 Feb 2022 08:19:49 GMT [08:20:07] server: istio-envoy [08:20:08] x-envoy-upstream-service-time: 424 [08:20:10] {"predictions": {"prediction": true, "probability": {"false": 6.503285107484214e-05, "true": 0.9999349671489252}}} [08:20:13] real 0m0.477s [08:20:15] user 0m0.022s [08:20:17] sys 0m0.005s [08:21:21] "Host: hiwiki-damaging.revscoring-editquality.wikimedia.org" [08:21:21] HTTP/2 200 [08:21:21] content-length: 114 [08:21:21] content-type: application/json; charset=UTF-8 [08:21:21] date: Tue, 22 Feb 2022 08:21:04 GMT [08:21:22] server: istio-envoy [08:21:23] x-envoy-upstream-service-time: 516 [08:21:25] {"predictions": {"prediction": false, "probability": {"false": 0.9999536447655532, "true": 4.63552344468446e-05}}} [08:21:30] real 0m0.566s [08:21:32] user 0m0.018s [08:21:34] sys 0m0.007s [08:25:56] hahaha kevinbazira I trusted a simple "yep all good!" as well :D [08:26:08] hihih :D [08:38:39] need to run errand for a bit, I'll start the os reimage aftewards! [09:59:25] back :) [09:59:37] trying to test my kserve-inference change, let's see if it works [10:40:06] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764741/1/helmfile.d/ml-services/revscoring-editquality/values.yaml is the idea [10:40:11] lemme know if it is better or not [10:40:42] great. let me check. [10:45:21] this is a very nice idea, elukey. it keeps things DRY. i've +1'd. [10:45:39] kevinbazira: thanks! I found a little issue, snap [10:45:58] "enwiktionary" doesn't contain "wiki" [10:46:06] so the s3 path gets changed [10:46:17] I need to fix the regex [10:46:18] sigh [10:46:28] woops ... could we take care of that in a condition [10:46:29] anyway, a little corner case, hopefully solved soon [10:46:53] the important bit is if you liked the idea of less variables etc.. [10:47:01] looks it is the right direction [10:47:12] yep ... the idea is fantastic! [10:49:09] <3 [13:04:16] all ml-serve nodes on bullseye and overlay! [13:14:06] :tumbsup: [13:20:03] of course I missed some changes during the last refactoring https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764758 [13:20:06] uff [13:28:43] mmmm and now I see some connection timeouts in storage initializers [13:30:01] o/ sorry stupid question.. what is bullseye and overlay here? [13:31:48] aiko: no stupid questions :) Bullseye is the codename for Debian 11 (the OS) and overlay is the storage layer that Docker now uses on our ml-serve nodes [13:32:06] we upgraded from Debian 10 and devicemapper (another storage layer) [13:32:20] it is part of a major migration of all k8s clusters to a new os and storage layer [13:34:52] Thanks Luca :) I see [13:46:07] ah interesting [13:46:25] so on ml-serve-eqiad I tried to deploy the new refactoring and all the isvc are not ready now [13:46:47] trying to get what's wrong, and in `kubect get events -n revscoring-editquality` I see [13:46:50] Error creating: pods "hiwiki-goodfaith-predictor-default-2n8hv-deployment-8647d4j28q8" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=2,requests.cpu=2, used: limits.cpu=90,requests.cpu=90, limited: limits.cpu=90,requests.cpu=90 [13:47:59] ahhh we have a default [13:55:02] good to know though [13:55:10] let's expand it [13:58:51] so we are starting to see some nodes full of pods [13:59:19] surely due to the drain|uncordon [13:59:49] I am a little worried about capacity vs number of kserve pods that we are scheduling [14:00:03] it is true that in eqiad we'll get 4 more nodes [14:03:07] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764787 [14:06:43] kevinbazira: how many pods do we need to add to every revscoring namespace, more or less? [14:06:56] I guess that the bulk are the editquality ones [14:13:02] Morning all! [14:13:54] morning! [14:13:56] also [14:13:56] failed to create private K8s Service: Internal error occurred: failed to allocate a serviceIP: range is full [14:14:53] of course all past knative revisions hold an ip address [14:15:45] ~# kubectl get svc -n revscoring-editquality | egrep "^enwiki" | wc -l [14:15:45] 74 [14:15:48] * elukey cries in a corner [14:26:25] managing this many pods in the same namespace is probably difficult [14:26:38] elukey: I believe we'll add about 76 editquality, 13 articlequality, 11 articletopic, 10 drafttopic, 2 draftquality. [14:26:45] we may need to segment revscoring-editquality into multiple ones [14:26:47] a total of 112 revscoring pods. [14:26:56] this is if we are going with only predictors. the number might double for each revscoring namespace that requires transformers. [14:35:04] I am cleaning up old knative revisions manually, I hope that there is a way to manually cap them [14:35:09] like "keep the last two" [14:39:21] like https://knative.dev/development/serving/configuration/revision-gc/ [14:39:27] will it be in our version? [14:53:52] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799 [14:58:34] mmm doesn't work [14:58:38] in codfw I mean [15:28:51] Day'O'Meetings! [15:29:11] but you are out of the 10 days now right? [15:29:16] I am! [15:29:16] (at least a good thing :D) [15:29:21] I am hyped [15:29:35] Sadly sounds like accraze is on day 1 [15:41:29] Heads up for those who haven't noticed: Slack is having a wobbly moment currently, so DMs and the like there may be delayed [15:45:36] IRC is the new slack! [15:45:48] Its 1998 again! [15:53:27] o/ [15:54:17] hanging out a bit before my kid wakes up [15:56:01] Morning! [16:00:50] elukey: yeah i was wondering about how many pods we could put in a single namespace last week [16:01:01] accraze: morning :) [16:01:13] I am battling a bit with old revisions and strange old pods in codfw atm [16:01:25] leftover revisions? [16:02:08] there are old revisions for sure, but some of the pods are stuck in init state since they timeout contacting thanos for the model [16:02:18] and I haven't yet deployed the refactor that I did this morning in there [16:04:39] i've seen pods stuck in init state on ml-sandbox when there was an issue w/ service-account connecting to storage [16:05:16] but I see connect timeouts logged reaching thanos [16:05:18] that is weird [16:05:34] I killed an healthy pod and it worked fine [16:05:42] hmm yeah that is really weird [16:06:14] I have drained reimaged etc.. all nodes recently so maybe some pods suffered [16:06:34] zombie pods :) [16:08:48] Hi people. Got here from chrisalbon's tweet o/ [16:08:49] I'm a data scientist from Brazil. [16:08:49] Great fan of the wikimedia team, specially the transparency aspect. Public board and public model cards are good instances of this. [16:08:50] Congrats on the great work. Keep pushing! [16:09:13] Thanks! And Welcome! [16:11:46] accraze: I am deleting the related revisions, it is quicker [16:13:18] yeah worked [16:13:23] niiiice [16:13:32] I think that those may have been created when the nodes were drained etc.. [16:13:42] not nice [16:13:49] lol good point [16:14:15] that's probably what kserve controller does under the hood right? [16:14:46] or something to that effect [16:15:02] it may also be knative itself [16:16:33] these are the kind of things that I was scared about when we decided to stick with knative 0.18 [16:16:48] ^^^ [16:17:41] i mean, if we can bump k8s version, we could eventually bypass knative completely [16:18:19] what do you mean? [16:19:16] there is a raw k8s deployment for kserve now [16:19:24] it's still alpha tho :/ [16:19:39] https://kserve.github.io/website/admin/kubernetes_deployment/ [16:19:50] but knative is not required [16:20:12] yeah but it offers a lot of nice things [16:20:19] all the A/B testing etc.. [16:20:38] with k8s 1.2x we'll also upgrade to knative 1.0 [16:20:44] way more stable [16:20:51] ahhh good point [16:20:54] but yeah we need to upgrade k8s first [16:20:57] looong way to go [16:21:08] I'll probably work on it next Q with service ops [16:21:22] this q we are focusing only on bullseye and overlayfs [16:21:26] "only" :D [16:21:53] lolol [16:23:08] accraze: did you see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764741/2/helmfile.d/ml-services/revscoring-editquality/values.yaml ? [16:23:21] (lemme know your thoughts when you have time, Kevin liked it) [16:27:29] elukey: yeeessss that's awesome! [16:27:57] only thought is that we may need some logic to handle the non-wikipedia.org hosts, but i see you and kevin caught that edge-case already [16:28:35] +1 for removing additional boilerplate [16:30:48] accraze: there is some logic to handle the s3 path for non-wikipedia host, so it is taken care (in theory), but we may see other corner cases [16:31:15] for example, in the s3 path we use the wiki name (like "en") and we append "wiki" [16:31:45] but there is logic now to recognize wiki names with "wiki" or "wiktionary", and to avoid the extra suffix in case [16:31:55] (following what is our naming structure on swift now) [16:31:57] so all works [16:32:04] but if you have other things in mind lemme know [16:33:06] cool that all sounds good for now, nice one! [16:33:58] ack :) [16:34:58] about number of pods - the refactoring that I did triggered the recreation of all pods, that required some extra allowance to have multiple pods up at the same time (before terminating the old) [16:35:21] it is fine, but IIUC we are half way through editquality [16:35:43] (Kevin reported ~70 pods to reach, I count 35) [16:36:11] as the namespace grows we may get into more headaches [16:36:20] maybe we could split goodfaith and damaging [16:37:04] (so dedicated namespaces and helmfile configs etc..) [16:37:26] it is true that we'll also likely not touch the revscoring models a lot once we'll be in steady state [16:38:01] soooo yeah I am not sure what's best [16:38:17] probably splitting is a wise choice [16:38:23] let me know your thoughts [16:45:42] hmmm yeah lemme think about this a bit - splitting seems to make sense [16:58:47] something like [16:58:54] revscoring-editquality-goodfaith [16:58:58] and -damaging [16:59:14] we keep the same structure but two separate namespaces [17:24:54] going afk, have a nice day folks! [17:36:05] Bye elukey! [17:57:46] 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) This caused a significant rise in dead letters on the logging pipeline today which caused most collectors to [[ https://grafana.wikimedia.org/d/... [23:10:51] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Add editquality isvc configurations to ml-services helmfile - https://phabricator.wikimedia.org/T301415 (10ACraze) After talking with @elukey and @kevinbazira on IRC, it seems that... [23:33:07] (03PS1) 10Accraze: editquality: handle revision not found error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) [23:38:25] ya'll still have office hours? I'd like to come say hi sometime