[06:04:51] <wikibugs>	 (03PS1) 10Kevin Bazira: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344)
[06:42:55] <isaranto>	 o/ good morning
[07:12:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[07:49:02] <wikibugs>	 (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[07:53:12] <wikibugs>	 (03PS2) 10Kevin Bazira: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344)
[07:54:42] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[07:55:32] <wikibugs>	 (03Merged) 10jenkins-bot: nsfw-model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1054087 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira)
[08:24:15] <wikibugs>	 06Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447#9980295 (10klausman) 05Open→03Declined In the context of T367537 we discovered that minikube is not really used anymore. The VM was still u...
[09:17:45] <klausman>	 morning!
[09:18:33] <isaranto>	 o/
[10:05:52] * klausman lunch
[10:32:43] * isaranto lunch
[12:50:36] <klausman>	 I just pushed the Istio changes from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1052702 to the staging cluster. Nothing exploded and httpbb reports all green
[12:52:54] <isaranto>	 nice!
[12:55:00] <klausman>	 I'll do prod-codfw tomorrow and eqiad on Wednesday
[12:56:27] <isaranto>	 ack
[13:28:02] <isaranto>	 I'm trying to find scale up/down history of isvcs. Basically what we get from running `kubectl get events`. any ideas?
[13:28:36] <isaranto>	 I was looking for the logs of a pod under kube-system (e.g. kube-metrics-xx) but can't find anything yet.
[13:31:55] <klausman>	 in ml-serve-eqiad, you can see some events with `# kubectl  get events -n revscoring-editquality-damaging |grep ScalingReplicaSet`
[13:32:08] <klausman>	 (minus the leading `#`
[13:32:44] <chrisalbon>	 Good morning all!
[13:32:58] <klausman>	 Heyo Chris, welcome back!
[13:35:23] <isaranto>	 Hey Chris!
[13:37:16] <isaranto>	 klausman: thanks for the info, however I am looking at the history of this, trying to access logs from days ago (I am looking at the Saturday alerts).
[13:37:47] <isaranto>	 is it available in logstash ?
[13:38:46] <klausman>	 possibly, but the message format would be different
[13:41:24] <klausman>	 Going to https://logstash.wikimedia.org/app/discover# and searching for `ScalingReplicaSet` would be my first step
[13:45:45] <isaranto>	 thanks, you're right I'll work with that!
[14:11:19] <isaranto>	 I managed to get exactly what I want -> https://logstash.wikimedia.org/goto/404556c0ea5d6fdabd660eefa34a8749
[14:13:16] <isaranto>	 no up/down scaling for enwiki-damaging. I would suggest that by reducing the number of max replicas and enable multiprocessing we'd have better results
[14:13:40] <isaranto>	 that way we wouldn't drain the resources (as mp would require more cpu allocation to the pods)
[14:13:52] <isaranto>	 I'll prepare something to share/discuss in our meeting
[14:16:42] <klausman>	 nice work!
[14:49:28] <isaranto>	 I redeployed the gemma2 27b model which failed last week  and now it was fine
[14:51:00] <wikibugs>	 06Machine-Learning-Team: Simplify dependencies in hf image - https://phabricator.wikimedia.org/T369359#9981630 (10isarantopoulos) I re-deployed the 27b model today and it is running fine:  ` time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: gemma2-27b-it.experimental.wik...
[14:53:12] <wikibugs>	 06Machine-Learning-Team: Simplify dependencies in hf image - https://phabricator.wikimedia.org/T369359#9981636 (10isarantopoulos) 05Open→03Resolved Resolving this as the previous issue that occurred during a deployment (https://phabricator.wikimedia.org/T369359#9974140) doesn't have anything to do with t...
[15:01:58] <elukey>	 klausman: o/ when you have time can we deploy knative in eqiad? 
[15:02:10] <klausman>	 Sure!
[15:02:40] <klausman>	 I'm about done for today with the knativ-serving NP update, so now is as good a time as any
[15:04:14] <klausman>	 elukey: what was your plan regarding the hemlfile approach?
[15:06:35] <elukey>	 klausman: not sure, we can try to deploy and see what error is returned. The last time it was sufficient to just remove an _example + data in config-something
[15:06:52] <klausman>	 yeah, so lemme quickly try that
[15:07:24] <klausman>	 last chance to stop me :)
[15:07:39] <elukey>	 go ahead :)
[15:10:30] <klausman>	 Same error as last time (webhook call failures)
[15:11:25] <klausman>	 trying `kubectl edit configmaps -n knative-serving config-autoscaler`
[15:13:54] <klausman>	 edit complete, re-doing apply
[15:17:25] <klausman>	 apply done, no more examples in diff (there's still a services update for mariadb, doing that rn)
[15:17:42] <klausman>	 correction: s/mariadb/cas-idp)/
[15:18:25] <klausman>	 all done, no crashy services
[15:19:46] <klausman>	 Trying a manual bounce of a revertrisk pod, see if it comes back ok
[15:20:32] <klausman>	 Also running httpbb on all LW services in eqiad.
[15:20:59] <klausman>	 PASS: 114 requests sent to inference.svc.eqiad.wmnet. All assertions passed.
[15:21:05] <klausman>	 \o/
[15:22:23] <isaranto>	 great! I was curious wether that error for articletopic would pop up again
[15:25:16] <elukey>	 klausman: looks good! Just to be sure, check the knative logstash dashboard to see if any weird log pattern pops up
[15:25:21] <elukey>	 not sure if we did it for codfw
[15:25:35] <elukey>	 precaution, just in case something is being logged that may indicate an issue
[15:25:45] <elukey>	 (like ton of logs, nothing small)
[15:26:40] <elukey>	 also folks not sure if you saw the email to ops@, but you should have a diff for the envoy docker image used for ores-legacy and rec-api-ng in helmfile
[15:26:56] <elukey>	 the new image is running on bookworm, if you can deploy it when you have time (not urgent) it would be super great :)
[15:29:09] <isaranto>	 ack! I'll do that tomorrow morning to be sure
[15:33:39] <klausman>	 elukey: I see some "Failed probing pods", not sure yet if that is new
[15:35:16] <klausman>	 Doesn't seem new, but there was a spike during the update
[15:35:20] <elukey>	 there is also a helm alert in #operations, sigh
[15:38:35] <klausman>	 should we rollback?
[15:40:01] <elukey>	 not sure, I think it is just helm that consider the release failed, but the deployment went through
[15:40:16] <klausman>	 Could we do a noop release to clear the state?
[15:40:22] <elukey>	 if there was a way to mark a release as "done" it would be nice
[15:40:32] <elukey>	 that too, not sure if it'd clear though
[15:40:55] <klausman>	 We can also try `helmfile -e $CLUSTER --state-values-set roll_restart=1 sync` first
[15:42:48] <elukey>	 but that is not a helm release, afaict
[15:43:20] <klausman>	 yeah, just hoping to make Helm see that the release is fine/everything came back
[15:45:57] <elukey>	 IIUC from helm history, in codfw the deploy failed due to a call to the webhook, and then it somehow rolled back
[15:46:16] <elukey>	 but only in helm metadata, the config stayed deployed
[15:46:35] <elukey>	 in eqiad something different happened, the rollback failed and now the status is marked as failed
[15:46:39] <klausman>	 Sorta, it said it tried to rollback but then said it couldn't, leaving us with the rolled-forward state, but "somehow" helm was ok with the state
[15:46:51] <klausman>	 yeah :-/
[15:47:14] <wikibugs>	 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982111 (10mfossati) 05Open→03In progress
[15:47:26] <klausman>	 I wonder what happens if I just run apply again
[15:47:48] <klausman>	 no prompt, just looks like "no diff"
[15:49:40] <wikibugs>	 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982125 (10mfossati)
[15:49:55] <elukey>	 at this point I'd try `helm rollback knative-serving 38 -n knative-serving`
[15:50:09] <elukey>	 the 38 is the last known good release, before the 39 that is marked as failed
[15:50:13] <klausman>	 elukey: I got something!
[15:50:14] <elukey>	 from helm history
[15:50:39] <klausman>	 the config-deployment object still has an _example stanza
[15:50:56] <klausman>	 I could try manually deleting it
[15:51:10] <elukey>	 sure let's do it, but helm's status will stay the same
[15:51:27] <elukey>	 let's clean up and possibly rollback, I think it will not do much other than clearing that knative state
[15:51:33] <elukey>	 worst case we re-deploy
[15:52:02] <klausman>	 ack, will do a rollback
[15:52:24] <klausman>	 Rollback was a success! Happy Helming!
[15:52:53] <elukey>	 perfect, the pods look the same
[15:52:57] <klausman>	 Though helmfile diff shows nothing?
[15:53:29] <elukey>	 I think it was just to clear out the helm state, that was messed up
[15:53:42] <klausman>	 and the alert is goine, too
[15:54:02] <elukey>	 knative definitely needs an upgrade
[15:54:23] <klausman>	 Agreed
[16:03:49] <isaranto>	 going afk folks, have a nice evening/rest of day o/
[16:07:10] <klausman>	 seeya, Ilias
[16:32:55] <wikibugs>	 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9982313 (10mfossati) Thanks @kevinbazira for the prompt action. @klaus...
[20:03:12] <wikibugs>	 (03CR) 10Abijeet Patro: Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)