[06:33:06] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: LiftWing articlecountry model logs improper json in stderr - https://phabricator.wikimedia.org/T389768#10672465 (10kevinbazira) 05Open→03In progress a:03kevinbazira
[06:40:50] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[06:40:50] <jinxer-wm>	 Deployment reference-need-predictor-00007-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00007-deployment - ...
[06:40:50] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[06:58:18] <wikibugs>	 (03PS1) 10Kevin Bazira: events: log events as JSON serialized output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130883 (https://phabricator.wikimedia.org/T389768)
[07:00:50] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[07:00:50] <jinxer-wm>	 Deployment reference-need-predictor-00007-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00007-deployment - ...
[07:00:50] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[09:16:17] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+2] Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh)
[09:17:48] <wikibugs>	 (03Merged) 10jenkins-bot: Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh)
[09:31:00] <klausman>	 \o おはよう！
[10:04:00] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+2] Optimize page collection metadata fetching with batch processing and concurrency limits (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh)
[11:24:26] <aiko>	 ohayo o/
[11:52:51] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673681 (10elukey)
[11:54:18] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673682 (10elukey)
[11:55:57] <kevinbazira>	 klausman: o/ thanks for the changeprog staging review: https://gerrit.wikimedia.org/r/1130349
[11:55:57] <kevinbazira>	 please deploy this change whenever you get a minute.
[12:16:06] <klausman>	 Ack. I've poked Hugh for his ok, since Changeprop is a bit more sensitive than the APIGW
[12:21:56] <kevinbazira>	 okok... thanks!
[12:29:43] <klausman>	 elukey: I'll be re-imaging 2010 in a moment or five
[12:31:49] <elukey>	 klausman: I am reimaging 2006, was about to write in here
[12:32:02] <klausman>	 I saw it drained and assumed you were
[12:32:05] <elukey>	 codfw is depooled so we can proceed in parallel
[12:32:07] <elukey>	 yes yes
[12:42:19] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2010.codfw.wmnet with OS bookworm
[13:15:58] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2010.codfw.wmnet with OS bookworm completed:...
[13:16:09] <elukey>	 ml-serve2006 back 
[13:16:19] <klausman>	 ditto for 2010
[13:16:51] <klausman>	 I'll do 2009 as well right now, while I got all the shells etc open
[13:31:34] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2009.codfw.wmnet with OS bookworm
[13:32:44] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674025 (10klausman)
[13:33:02] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674027 (10klausman)
[14:03:56] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2009.codfw.wmnet with OS bookworm completed:...
[14:05:24] <klausman>	 2009 now also on containerd and back on cluster
[14:05:45] <elukey>	 nice
[14:06:13] <elukey>	 so only 2007 and 2008 remaining, and then the ctrl vms
[14:56:36] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10674446 (10JArguello-WMF)
[15:01:50] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674487 (10klausman)
[15:02:17] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674497 (10klausman)
[15:10:10] <elukey>	 moving 2007 to containerd
[15:10:52] <elukey>	 klausman: before serviceops repools codfw we should run httpbb towards the codfw svc to make sure that isvc work (as in, the reply correctly to HTTP :D)
[15:11:00] <elukey>	 can you check it later on?
[15:12:51] <klausman>	 will do
[15:22:10] <klausman>	 elukey: hmm. I think the SVCs don't work atm, because of the depool, e.g.: curl "https://inference.svc.codfw.wmnet:30443/v1/models/ruwiki-damaging:predict" -X POST -d '{"rev_id": 137258595}' -H  "Host: ruwiki-damaging.revscoring-editquality-damaging.wikimedia.
[15:22:12] <klausman>	 org" -H "Content-Type: application/json" --http1.1
[15:22:14] <klausman>	 curl: (7) Failed to connect to inference.svc.codfw.wmnet port 30443: No route to host
[15:22:31] <klausman>	 i.e. the svc addresses are not reachable on the IP level
[15:25:07] <elukey>	 yeah but it shouldn't be because of the depool
[15:25:39] <elukey>	 https://config-master.wikimedia.org/pybal/codfw/inference shows that some ml-serve nodes are inactive (probably the reimage left them in that way)
[15:26:16] <elukey>	 if you re-run your curl command with "ml-serve2001.codfw.wmnet" instead of the inference svc, does it work?
[15:26:25] <klausman>	 sec
[15:27:12] <klausman>	 yeah, that did the trick
[15:28:34] <elukey>	 didn't we had the same issue in staging?
[15:28:43] <elukey>	 I have a vague sense of deja-vu
[15:28:53] <elukey>	 *have
[15:29:29] <elukey>	 but I don't recall what was the issue
[15:29:49] <elukey>	 ahhhh the VLAN moves!
[15:29:52] <elukey>	 of course!
[15:30:55] <elukey>	 the LVS host needs to have a leg in all VLANs to be able to mangle packets correctly
[15:31:19] <elukey>	 we moved the staging nodes as well, but IIRC you followed up on the LVS
[15:31:22] <elukey>	 do you recall what you did?
[15:32:23] <klausman>	 I think they just needed enabling.
[15:32:40] <klausman>	 I just confctl enabled 2002 and currently httpbb is running
[15:32:56] <klausman>	 and just concluded with 0 errors
[15:33:40] <klausman>	 But I'll see what I did about LVS last time
[15:34:58] <elukey>	 I checked ipvsadm on lvs2013
[15:35:02] <elukey>	 TCP  10.2.1.63:30443 wrr -> 10.192.7.24:30443            Route   1      1          1         -> 10.192.48.175:30443          Route   1      0          1     
[15:35:25] <elukey>	 so only two ips, the last is 2008
[15:35:36] <elukey>	 and I guess the first one is 2002
[15:35:44] <klausman>	 Apparently when IPs change, pybal needs a restart 
[15:36:33] <elukey>	 the first IP is ml-serve2001, that got vlan-moved
[15:36:34] <klausman>	 2025-03-05 15:21:59 topranks    we need a restart if the IPs for hostnames have changed?
[15:36:36] <klausman>	 2025-03-05 15:32:17 sukhe   topranks: no but pybal needs to be restarted to reprogram IPVS for the changed IP.
[15:36:58] <elukey>	 sigh
[15:43:31] <elukey>	 at this point we may just move 2008 to containerd and we ask traffic to restart pybal
[15:43:49] <klausman>	 sgtm
[15:50:22] <elukey>	 klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131033
[15:50:51] <elukey>	 they are going to pool codfw back tomorrow
[15:50:57] <klausman>	 Ack.
[15:51:01] <klausman>	 +1'd
[15:52:01] <klausman>	 shall I cordon 2008 to prep?
[15:52:26] <elukey>	 sure
[15:53:03] <elukey>	 really unfortunate for eqiad, we cannot really restart pybal every time
[15:53:27] <elukey>	 we could probably depool eqiad, do the reimages (some of them), restart pybal, check, repool
[15:53:37] <elukey>	 like 4 at the time
[15:53:56] <klausman>	 yeah, that may be the best approach
[15:59:19] <elukey>	 2008 is ready for the reimage, are you doing it?
[15:59:29] <klausman>	 Sure, I can do it
[15:59:36] <elukey>	 2007 is kinda failing sigh
[16:02:19] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2008.codfw.wmnet with OS bookworm
[16:08:52] <klausman>	 elukey: aaand I got a netbox failure
[16:08:55] <klausman>	 https://phabricator.wikimedia.org/P74417
[16:09:43] <klausman>	 I already tried "retry" once, no joy
[16:11:37] <elukey>	 during move-vlan?
[16:12:28] <elukey>	 yeah so during the push of the new IP probably
[16:14:35] <klausman>	 yeah
[16:14:48] <klausman>	 I could try skip, but I don't want to make things worse
[16:15:56] <elukey>	 I need to jump into a meeting, can you try to ping either Cathal or Arzhel?
[16:16:02] <klausman>	 will do
[16:32:23] <klausman>	 ok, DNS problem resolved, continuing with reinstall (and homer runs)
[16:32:35] <elukey>	 ah nice! What did you do to fix?
[16:33:01] <klausman>	 Poked Cathal. Apparently the last IP in a subnet got deleted, and a manual step was missed, and he did that for me
[16:33:17] <elukey>	 ah right perfect :)
[16:33:58] <wikibugs>	 (03PS1) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100)
[16:36:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10675003 (10gkyziridis) **New version of edit-check service handling batches**  Implement logic for handling an array of requests on the Kserve leve...
[16:40:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:42:28] <elukey>	 2007 done
[16:47:54] <wikibugs>	 (03PS2) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100)
[16:50:10] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100)
[17:07:25] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10675135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2008.codfw.wmnet with OS bookworm completed:...
[17:10:47] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10675158 (10klausman)
[17:15:45] <klausman>	 elukey: pybal has restarted, 2007 is currently down/depooled
[17:16:09] <elukey>	 you can repool it!
[17:16:31] <klausman>	 ack
[17:22:30] <klausman>	 2007/8 uncordoned, 7 repooled, httpb tests all ok
[17:23:04] <elukey>	 \o/
[17:23:22] <elukey>	 https://config-master.wikimedia.org/pybal/codfw/inference shows that we are missing 2009 -> 2011
[17:23:45] <klausman>	 oh, that's odd
[17:27:58] <klausman>	 fixed! weight was 0 for some reason
[17:29:57] <elukey>	 same https://config-master.wikimedia.org/pybal/eqiad/inference but probably better tomorrow :)