[06:04:51] (03PS1) 10Kevin Bazira: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) [06:42:55] o/ good morning [07:12:09] (03CR) 10Ilias Sarantopoulos: [C:03+1] nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [07:49:02] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [07:53:12] (03PS2) 10Kevin Bazira: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) [07:54:42] (03CR) 10Kevin Bazira: [C:03+2] nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [07:55:32] (03Merged) 10jenkins-bot: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:24:15] 06Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447#9980295 (10klausman) 05Open→03Declined In the context of T367537 we discovered that minikube is not really used anymore. The VM was still u... [09:17:45] morning! [09:18:33] o/ [10:05:52] * klausman lunch [10:32:43] * isaranto lunch [12:50:36] I just pushed the Istio changes from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1052702 to the staging cluster. Nothing exploded and httpbb reports all green [12:52:54] nice! [12:55:00] I'll do prod-codfw tomorrow and eqiad on Wednesday [12:56:27] ack [13:28:02] I'm trying to find scale up/down history of isvcs. Basically what we get from running `kubectl get events`. any ideas? [13:28:36] I was looking for the logs of a pod under kube-system (e.g. kube-metrics-xx) but can't find anything yet. [13:31:55] in ml-serve-eqiad, you can see some events with `# kubectl get events -n revscoring-editquality-damaging |grep ScalingReplicaSet` [13:32:08] (minus the leading `#` [13:32:44] Good morning all! [13:32:58] Heyo Chris, welcome back! [13:35:23] Hey Chris! [13:37:16] klausman: thanks for the info, however I am looking at the history of this, trying to access logs from days ago (I am looking at the Saturday alerts). [13:37:47] is it available in logstash ? [13:38:46] possibly, but the message format would be different [13:41:24] Going to https://logstash.wikimedia.org/app/discover# and searching for `ScalingReplicaSet` would be my first step [13:45:45] thanks, you're right I'll work with that! [14:11:19] I managed to get exactly what I want -> https://logstash.wikimedia.org/goto/404556c0ea5d6fdabd660eefa34a8749 [14:13:16] no up/down scaling for enwiki-damaging. I would suggest that by reducing the number of max replicas and enable multiprocessing we'd have better results [14:13:40] that way we wouldn't drain the resources (as mp would require more cpu allocation to the pods) [14:13:52] I'll prepare something to share/discuss in our meeting [14:16:42] nice work! [14:49:28] I redeployed the gemma2 27b model which failed last week and now it was fine [14:51:00] 06Machine-Learning-Team: Simplify dependencies in hf image - https://phabricator.wikimedia.org/T369359#9981630 (10isarantopoulos) I re-deployed the 27b model today and it is running fine: ` time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: gemma2-27b-it.experimental.wik... [14:53:12] 06Machine-Learning-Team: Simplify dependencies in hf image - https://phabricator.wikimedia.org/T369359#9981636 (10isarantopoulos) 05Open→03Resolved Resolving this as the previous issue that occurred during a deployment (https://phabricator.wikimedia.org/T369359#9974140) doesn't have anything to do with t... [15:01:58] klausman: o/ when you have time can we deploy knative in eqiad? [15:02:10] Sure! [15:02:40] I'm about done for today with the knativ-serving NP update, so now is as good a time as any [15:04:14] elukey: what was your plan regarding the hemlfile approach? [15:06:35] klausman: not sure, we can try to deploy and see what error is returned. The last time it was sufficient to just remove an _example + data in config-something [15:06:52] yeah, so lemme quickly try that [15:07:24] last chance to stop me :) [15:07:39] go ahead :) [15:10:30] Same error as last time (webhook call failures) [15:11:25] trying `kubectl edit configmaps -n knative-serving config-autoscaler` [15:13:54] edit complete, re-doing apply [15:17:25] apply done, no more examples in diff (there's still a services update for mariadb, doing that rn) [15:17:42] correction: s/mariadb/cas-idp)/ [15:18:25] all done, no crashy services [15:19:46] Trying a manual bounce of a revertrisk pod, see if it comes back ok [15:20:32] Also running httpbb on all LW services in eqiad. [15:20:59] PASS: 114 requests sent to inference.svc.eqiad.wmnet. All assertions passed. [15:21:05] \o/ [15:22:23] great! I was curious wether that error for articletopic would pop up again [15:25:16] klausman: looks good! Just to be sure, check the knative logstash dashboard to see if any weird log pattern pops up [15:25:21] not sure if we did it for codfw [15:25:35] precaution, just in case something is being logged that may indicate an issue [15:25:45] (like ton of logs, nothing small) [15:26:40] also folks not sure if you saw the email to ops@, but you should have a diff for the envoy docker image used for ores-legacy and rec-api-ng in helmfile [15:26:56] the new image is running on bookworm, if you can deploy it when you have time (not urgent) it would be super great :) [15:29:09] ack! I'll do that tomorrow morning to be sure [15:33:39] elukey: I see some "Failed probing pods", not sure yet if that is new [15:35:16] Doesn't seem new, but there was a spike during the update [15:35:20] there is also a helm alert in #operations, sigh [15:38:35] should we rollback? [15:40:01] not sure, I think it is just helm that consider the release failed, but the deployment went through [15:40:16] Could we do a noop release to clear the state? [15:40:22] if there was a way to mark a release as "done" it would be nice [15:40:32] that too, not sure if it'd clear though [15:40:55] We can also try `helmfile -e $CLUSTER --state-values-set roll_restart=1 sync` first [15:42:48] but that is not a helm release, afaict [15:43:20] yeah, just hoping to make Helm see that the release is fine/everything came back [15:45:57] IIUC from helm history, in codfw the deploy failed due to a call to the webhook, and then it somehow rolled back [15:46:16] but only in helm metadata, the config stayed deployed [15:46:35] in eqiad something different happened, the rollback failed and now the status is marked as failed [15:46:39] Sorta, it said it tried to rollback but then said it couldn't, leaving us with the rolled-forward state, but "somehow" helm was ok with the state [15:46:51] yeah :-/ [15:47:14] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982111 (10mfossati) 05Open→03In progress [15:47:26] I wonder what happens if I just run apply again [15:47:48] no prompt, just looks like "no diff" [15:49:40] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982125 (10mfossati) [15:49:55] at this point I'd try `helm rollback knative-serving 38 -n knative-serving` [15:50:09] the 38 is the last known good release, before the 39 that is marked as failed [15:50:13] elukey: I got something! [15:50:14] from helm history [15:50:39] the config-deployment object still has an _example stanza [15:50:56] I could try manually deleting it [15:51:10] sure let's do it, but helm's status will stay the same [15:51:27] let's clean up and possibly rollback, I think it will not do much other than clearing that knative state [15:51:33] worst case we re-deploy [15:52:02] ack, will do a rollback [15:52:24] Rollback was a success! Happy Helming! [15:52:53] perfect, the pods look the same [15:52:57] Though helmfile diff shows nothing? [15:53:29] I think it was just to clear out the helm state, that was messed up [15:53:42] and the alert is goine, too [15:54:02] knative definitely needs an upgrade [15:54:23] Agreed [16:03:49] going afk folks, have a nice evening/rest of day o/ [16:07:10] seeya, Ilias [16:32:55] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982313 (10mfossati) Thanks @kevinbazira for the prompt action. @klaus... [20:03:12] (03CR) 10Abijeet Patro: Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)