[06:33:06] 10Lift-Wing, 06Machine-Learning-Team: LiftWing articlecountry model logs improper json in stderr - https://phabricator.wikimedia.org/T389768#10672465 (10kevinbazira) 05Open→03In progress a:03kevinbazira [06:40:50] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:40:50] Deployment reference-need-predictor-00007-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00007-deployment - ... [06:40:50] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:58:18] (03PS1) 10Kevin Bazira: events: log events as JSON serialized output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130883 (https://phabricator.wikimedia.org/T389768) [07:00:50] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [07:00:50] Deployment reference-need-predictor-00007-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00007-deployment - ... [07:00:50] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:16:17] (03CR) 10Nik Gkountas: [C:03+2] Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [09:17:48] (03Merged) 10jenkins-bot: Optimize page collection metadata fetching with batch processing and concurrency limits [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [09:31:00] \o おはよう! [10:04:00] (03CR) 10Nik Gkountas: [C:03+2] Optimize page collection metadata fetching with batch processing and concurrency limits (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1130348 (owner: 10Santhosh) [11:24:26] ohayo o/ [11:52:51] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673681 (10elukey) [11:54:18] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673682 (10elukey) [11:55:57] klausman: o/ thanks for the changeprog staging review: https://gerrit.wikimedia.org/r/1130349 [11:55:57] please deploy this change whenever you get a minute. [12:16:06] Ack. I've poked Hugh for his ok, since Changeprop is a bit more sensitive than the APIGW [12:21:56] okok... thanks! [12:29:43] elukey: I'll be re-imaging 2010 in a moment or five [12:31:49] klausman: I am reimaging 2006, was about to write in here [12:32:02] I saw it drained and assumed you were [12:32:05] codfw is depooled so we can proceed in parallel [12:32:07] yes yes [12:42:19] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2010.codfw.wmnet with OS bookworm [13:15:58] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10673963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2010.codfw.wmnet with OS bookworm completed:... [13:16:09] ml-serve2006 back [13:16:19] ditto for 2010 [13:16:51] I'll do 2009 as well right now, while I got all the shells etc open [13:31:34] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2009.codfw.wmnet with OS bookworm [13:32:44] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674025 (10klausman) [13:33:02] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674027 (10klausman) [14:03:56] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2009.codfw.wmnet with OS bookworm completed:... [14:05:24] 2009 now also on containerd and back on cluster [14:05:45] nice [14:06:13] so only 2007 and 2008 remaining, and then the ctrl vms [14:56:36] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10674446 (10JArguello-WMF) [15:01:50] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674487 (10klausman) [15:02:17] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674497 (10klausman) [15:10:10] moving 2007 to containerd [15:10:52] klausman: before serviceops repools codfw we should run httpbb towards the codfw svc to make sure that isvc work (as in, the reply correctly to HTTP :D) [15:11:00] can you check it later on? [15:12:51] will do [15:22:10] elukey: hmm. I think the SVCs don't work atm, because of the depool, e.g.: curl "https://inference.svc.codfw.wmnet:30443/v1/models/ruwiki-damaging:predict" -X POST -d '{"rev_id": 137258595}' -H "Host: ruwiki-damaging.revscoring-editquality-damaging.wikimedia. [15:22:12] org" -H "Content-Type: application/json" --http1.1 [15:22:14] curl: (7) Failed to connect to inference.svc.codfw.wmnet port 30443: No route to host [15:22:31] i.e. the svc addresses are not reachable on the IP level [15:25:07] yeah but it shouldn't be because of the depool [15:25:39] https://config-master.wikimedia.org/pybal/codfw/inference shows that some ml-serve nodes are inactive (probably the reimage left them in that way) [15:26:16] if you re-run your curl command with "ml-serve2001.codfw.wmnet" instead of the inference svc, does it work? [15:26:25] sec [15:27:12] yeah, that did the trick [15:28:34] didn't we had the same issue in staging? [15:28:43] I have a vague sense of deja-vu [15:28:53] *have [15:29:29] but I don't recall what was the issue [15:29:49] ahhhh the VLAN moves! [15:29:52] of course! [15:30:55] the LVS host needs to have a leg in all VLANs to be able to mangle packets correctly [15:31:19] we moved the staging nodes as well, but IIRC you followed up on the LVS [15:31:22] do you recall what you did? [15:32:23] I think they just needed enabling. [15:32:40] I just confctl enabled 2002 and currently httpbb is running [15:32:56] and just concluded with 0 errors [15:33:40] But I'll see what I did about LVS last time [15:34:58] I checked ipvsadm on lvs2013 [15:35:02] TCP 10.2.1.63:30443 wrr -> 10.192.7.24:30443 Route 1 1 1 -> 10.192.48.175:30443 Route 1 0 1 [15:35:25] so only two ips, the last is 2008 [15:35:36] and I guess the first one is 2002 [15:35:44] Apparently when IPs change, pybal needs a restart [15:36:33] the first IP is ml-serve2001, that got vlan-moved [15:36:34] 2025-03-05 15:21:59 topranks we need a restart if the IPs for hostnames have changed? [15:36:36] 2025-03-05 15:32:17 sukhe topranks: no but pybal needs to be restarted to reprogram IPVS for the changed IP. [15:36:58] sigh [15:43:31] at this point we may just move 2008 to containerd and we ask traffic to restart pybal [15:43:49] sgtm [15:50:22] klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131033 [15:50:51] they are going to pool codfw back tomorrow [15:50:57] Ack. [15:51:01] +1'd [15:52:01] shall I cordon 2008 to prep? [15:52:26] sure [15:53:03] really unfortunate for eqiad, we cannot really restart pybal every time [15:53:27] we could probably depool eqiad, do the reimages (some of them), restart pybal, check, repool [15:53:37] like 4 at the time [15:53:56] yeah, that may be the best approach [15:59:19] 2008 is ready for the reimage, are you doing it? [15:59:29] Sure, I can do it [15:59:36] 2007 is kinda failing sigh [16:02:19] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10674772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-serve2008.codfw.wmnet with OS bookworm [16:08:52] elukey: aaand I got a netbox failure [16:08:55] https://phabricator.wikimedia.org/P74417 [16:09:43] I already tried "retry" once, no joy [16:11:37] during move-vlan? [16:12:28] yeah so during the push of the new IP probably [16:14:35] yeah [16:14:48] I could try skip, but I don't want to make things worse [16:15:56] I need to jump into a meeting, can you try to ping either Cathal or Arzhel? [16:16:02] will do [16:32:23] ok, DNS problem resolved, continuing with reinstall (and homer runs) [16:32:35] ah nice! What did you do to fix? [16:33:01] Poked Cathal. Apparently the last IP in a subnet got deleted, and a manual step was missed, and he did that for me [16:33:17] ah right perfect :) [16:33:58] (03PS1) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [16:36:03] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10675003 (10gkyziridis) **New version of edit-check service handling batches** Implement logic for handling an array of requests on the Kserve leve... [16:40:45] (03CR) 10CI reject: [V:04-1] inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:42:28] 2007 done [16:47:54] (03PS2) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [16:50:10] (03PS3) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [17:07:25] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10675135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-serve2008.codfw.wmnet with OS bookworm completed:... [17:10:47] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10675158 (10klausman) [17:15:45] elukey: pybal has restarted, 2007 is currently down/depooled [17:16:09] you can repool it! [17:16:31] ack [17:22:30] 2007/8 uncordoned, 7 repooled, httpb tests all ok [17:23:04] \o/ [17:23:22] https://config-master.wikimedia.org/pybal/codfw/inference shows that we are missing 2009 -> 2011 [17:23:45] oh, that's odd [17:27:58] fixed! weight was 0 for some reason [17:29:57] same https://config-master.wikimedia.org/pybal/eqiad/inference but probably better tomorrow :)