[07:35:58] 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10elukey) @jbond I ran `systemctl reset-failed kafkatee.service` since the unit is marked as masked, IIRC we use only the `kafkatee-webrequest` unit in t... [08:34:48] 10Machine-Learning-Team: [WikiGPT] Use moderation API from OpenAI - https://phabricator.wikimedia.org/T329058 (10isarantopoulos) 05Open→03Resolved [08:34:50] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10isarantopoulos) [08:35:28] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) 05Open→03Resolved [08:35:31] 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10isarantopoulos) [09:48:24] hello folks [09:48:39] I am going to start the upgrade to k8s 1.23 of ml-serve-codfw [09:48:47] \o [09:48:58] tmux shoulder-surf ok? [09:50:12] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10Aklapper) @isarantopoulos: Could you answer my last comment, please? Thanks in advance! [09:50:17] klausman: o/ [09:50:19] sure [09:50:26] so as first step, just to be sure [09:50:33] 1) downtime the whole cluster [09:50:37] 2) wipe etcd [09:50:58] 3) kick off the reimage of all etcd nodes [09:51:06] after this I'll start the upgrade cookbook [09:51:16] the wipe is not really necessary but it will clean up the pods etc.. [09:51:29] Aye. [09:53:12] klausman: in the meantime, could you depool codfw from the inference discovery endpoint? [09:53:17] 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) >>! In T329028#8592862, @Aklapper wrote: > Related, are there also plans to create a dedicated Phabricator project tag for this codebase? Not at this point as this was just some POC work done. [09:53:18] just to be sure [09:53:34] will do [09:54:02] o/ [09:54:57] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye [09:55:12] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2002.codfw.wmnet with OS bullseye [09:55:40] kicked off all reiamges for ml-etcd2* nodes on cumin1001 [09:55:43] *reimages [09:56:00] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2003.codfw.wmnet with OS bullseye [09:56:12] Janis already worked on this procedure, in theory it should end up with a brand new cluster [09:56:56] *grrrr* I hate conftool [10:00:58] I think I have done the right thing, but conftool is still a mystery to me [10:01:46] did you follow the wikitech docs? [10:01:50] If so you should be good [10:04:09] godspeed elukey :) [10:06:01] elukey: I tried to follow it, but it's hard to know what the specific magic words for things like "cluster" and "pool" are [10:06:54] klausman: sure, but in https://wikitech.wikimedia.org/wiki/DNS/Discovery it is all explained [10:07:14] https://wikitech.wikimedia.org/wiki/Conftool#Show_pool_status I was looking at this [10:07:30] jayme: <3 [10:13:23] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2001... [10:14:59] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2003.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2003... [10:16:00] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2002.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2002... [10:29:22] elukey@ml-etcd2001:~$ etcdctl -C https://$(hostname -f):2379 cluster-health [10:29:25] member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379 [10:29:28] member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379 [10:29:31] member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379 [10:29:34] cluster is healthy [10:29:34] I had to set the cluster status to "new" manually [10:29:36] finally [10:29:40] I'll add this to the docs [10:32:31] klausman: I created a root session on cumin1001 named "T330669" [10:32:57] attached! [10:33:30] Huh, I didn't know about P{} [10:33:56] it is cumin specific query to select the puppet backend basically [10:34:17] the idea is that we do ml-serve200[2-8] reimages manually, so we can do them in parallel [10:34:18] aaah, I wondered what it stood for (what it _does_ is obvious :)) [10:34:30] and the cookbook will do only 2001 [10:34:39] want me to do the rest of the workers? [10:35:03] we can split, but let's start only after the control plan [10:35:06] *plane [10:35:10] Of course [10:36:01] I already downtimed all the nodes previously, going to disable puppet on 2002-8 and stop kube* [10:36:10] Ack. [10:37:22] ok now merging the puppet change [10:37:38] that is https://gerrit.wikimedia.org/r/c/operations/puppet/+/892482 [10:39:25] Do we have to use a more specific cookbook than sre.hosts.reimage for the workers? [10:40:20] nono reimage is good [10:40:47] it is what we use in the upgrade one, but atm spicerack/cookbooks cannot launch 1+ of them [10:41:17] Aye. `bullseye` is the default for --os, right? [10:55:30] it should be mandatory IIRC [10:55:54] yeah, I asked in -sre and got helped :) [10:56:08] ah super [10:57:15] 2001 is doing its first Puppet run *drums fingers* [11:03:41] 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10jbond) 05In progress→03Resolved > Is there a reason to change it for this particular use case? (To better understand what's happening) no i think... [11:29:45] * isaranto lunch [11:32:51] 10Machine-Learning-Team, 10API-Portal: Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10Ameisenigel) [11:46:16] elukey: Spotted this on ml-serve-2007: https://phabricator.wikimedia.org/P44898 [11:46:48] Happened on three separate days around the same time, which is super weird. [11:47:34] weird indeed [11:48:02] I don't think it's actionable yet, but I'll keep checking the machine every few days, see if it continues after today [11:48:13] ack! [11:48:27] so ml-serve2001's reimage has kicked off [11:48:46] ack, saw the IPMI stuff just now [11:48:47] I'll do 2002-2004, if you want to get lunch break go ahead [11:49:11] I can do them, got the commands all set up (and not hungry yet) [11:50:06] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2002.codfw.wmnet with OS bullseye [11:50:31] Alright, you already started :D [11:50:58] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2003.codfw.wmnet with OS bullseye [11:51:10] klausman: then please do 2005->2008 :) [11:51:39] right now? [11:51:41] Ok! [11:51:50] yep! [11:52:00] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2004.codfw.wmnet with OS bullseye [11:52:01] I mean if you have time, otherwise np [11:52:50] we can kick them off and come back later [11:53:01] they will take a bit [11:53:38] Argh, I don't have the pw repo, need to reclone that [11:53:55] after all nodes are up we'd need to start from https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Apply_RBAC_rules_and_PSPs and proceed with the rest of the admin_ng settings [11:54:05] if you want to do them lemme know [11:55:39] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2008.codfw.wmnet with OS bullseye [11:55:44] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2007.codfw.wmnet with OS bullseye [11:55:50] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2006.codfw.wmnet with OS bullseye [11:56:06] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2005.codfw.wmnet with OS bullseye [11:56:26] * elukey afk for a bit [12:20:58] hi all are you aware of the bgp alert https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr1-codfw&service=BGP+status (cc elukey ) [12:21:18] i should say alers its on both crs in codfw [12:21:47] likley also related to this https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2009&service=PyBal+backends+health+check [12:21:58] "Servers ml-serve2001.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2002.codfw.wmnet, ml-serve2008.codfw.wmnet are marked down but pooled" [12:22:18] jayme: hi! Yes we are upgrading the cluster to k8s 1.23, but it should be only codfw [12:22:40] until calico pods are not up we'll see the alerts [12:22:51] hopefully be fixed in say 2hrs max [12:22:57] wrong ping I suppose [12:22:57] elukey: ack thanks [12:23:09] jayme: yes sorry :) [12:23:14] np :) [12:28:01] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2002.codfw.wmnet with OS bullseye completed: - ml-serve2002 (**PASS**)... [12:30:11] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2003.codfw.wmnet with OS bullseye completed: - ml-serve2003 (**PASS**)... [12:31:41] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2008.codfw.wmnet with OS bullseye completed: - ml-serve2008 (**PASS*... [12:34:11] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2005.codfw.wmnet with OS bullseye completed: - ml-serve2005 (**PASS*... [12:38:34] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2006.codfw.wmnet with OS bullseye completed: - ml-serve2006 (**PASS*... [12:39:09] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2004.codfw.wmnet with OS bullseye completed: - ml-serve2004 (**PASS**)... [12:40:20] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2007.codfw.wmnet with OS bullseye completed: - ml-serve2007 (**WARN*... [12:41:49] All hosts except 2001 are done [12:42:06] I'll run the rbac/policy sync once it is [12:42:32] klausman: already done :) [12:42:44] Always ahead of me :) [12:42:55] John asked about the alerts etc.. so I moved once some nodes were up [12:43:30] klausman: you can do the certmanager ones [12:43:33] I stopped at istio [12:44:12] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#cert-manager These? [12:44:38] yep [12:45:07] I am rolling out https://gerrit.wikimedia.org/r/c/operations/puppet/+/892949 [12:45:20] I forgot to add it, without it the istio-cni binaries on the workers are 1.9.5 [12:46:37] +1'd for completeness [12:47:07] thanks [12:47:17] going to apt-get install -y istio-cni as well on all workers [12:47:45] cert-manager syncs all done (including ns-certs) [12:47:55] you can also do knative-serving-crds and knative-serving [12:48:24] on it [12:49:06] and done as well [12:50:50] I think that we are up :) [12:51:07] https://alerts.wikimedia.org/?q=alertname%3DPyBal%20backends%20health%20check this is still firing. [12:51:16] they just recovered [12:51:26] ah, typical :D [12:51:41] on lvs2009, those are 2010 I think [12:51:52] should be fixed as well in a bit [12:51:55] yep. should recover soon, too [12:52:27] Should I pool codfw again? [12:52:33] (inference, that is) [12:52:46] klausman: we need to deploy the model servers [12:52:59] right. forgot the kserve crds [12:53:14] well, charts, not crds [12:53:15] that one as well (should not have crds as separate release) [12:53:34] please go ahead with kserve :) [12:54:14] and done [12:54:59] perfect, now it is the turn of model servers.. do you want to do it / split / etc..? [12:55:23] In theory we shouldn't see anymore the latency alerts (but something may fire for the biggest ns-es) [12:55:27] I don't think I've done it recently, so now's a good time as any [12:55:54] ack! [12:56:13] just need to find the docs, to make sure I get it right :) [12:58:13] doing articletopic first because abc :) [13:06:16] revscoring-editquality-damaging failed deployment, investigating [13:06:36] what did it say? [13:07:10] https://phabricator.wikimedia.org/P44900 [13:07:44] is g+r the root cause? [13:08:01] Nah, all those files in the dir are g+r [13:08:34] nono I think the kserve webhook was overwhelmed [13:08:35] https://grafana.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-kubernetes_namespace_controller=kserve&var-kubernetes_namespace_queue_proxy=revscoring-articletopic&viewPanel=2&from=now-1h&to=now [13:08:40] we'd need to scale it up later on [13:08:43] can you retry? [13:08:47] sec [13:09:13] Failed again [13:09:35] NAME READY STATUS RESTARTS AGE [13:09:38] kserve-controller-manager-645d68955f-4brjh 0/1 Error 1 (4m ago) 15m [13:09:41] sigh [13:10:10] now running again [13:10:25] yeah it needs to be scaled up [13:10:29] klausman: one last retry please [13:10:57] Still erroring [13:11:06] can you try reverted? [13:11:15] just to see if it is the number of pods or something else [13:11:23] Does it runrunning [13:11:26] gah [13:11:30] running -rev [13:11:34] that worked fine [13:11:52] interesting [13:12:13] and all those pods are running [13:12:28] Trying goodfaith [13:13:08] wait a sec [13:13:09] Also fails [13:13:20] yes same issue [13:13:29] I am increasing the kserve controller pods to two [13:14:31] klausman: let's retry [13:14:43] goodfaith or damaging? [13:14:51] damaging [13:15:32] nope, failed again [13:15:53] but I saw some pods coming up [13:15:59] yeah, same [13:16:24] Is there a way to make helm not remove the failed stuff, so we might see what breaks? [13:16:29] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10kevinbazira) The lrcwiki pipeline was failing during the spark job with the message `poinValueError: RDD is empty`. Thanks to @Bugreporter, we found out t... [13:17:54] klausman: what failed stuff? [13:18:48] So it seems helm starts _some_ pods but thinks they don't work right and then removes them [13:19:12] worked now [13:19:17] did you just deploy again? I see stuff starting [13:19:19] I scaled the deployment of the controller to 4 nodes [13:19:24] yep yep [13:19:31] So it was a resource issue on the controller. [13:19:59] the error in your paste pointed to timeouts contacting the webhook, and the metrics showed some pressure.. I think that kserve 0.9 needs more replicas [13:20:13] at least for big namespaces [13:20:20] Well, at least the controller pods are not huge [13:20:56] ok to sync goodfaith? [13:21:03] done, worked as well [13:21:07] aaand again. ahead of me [13:21:34] so it was indeed an issue with a bursts of http requests towards the kserve controller's webhook [13:21:49] The timeouts seemed a bit aggressive, but maybe that's just me [13:21:52] the new k8s stack is way faster to spin up pods [13:21:57] indeed yes [13:22:14] yeah, spinup is faster, feels like 2x-3x [13:23:22] latency alerts fired, but I think it is expected with some many pods and calls to the control plane [13:23:40] Yep. I expect them to go away as things settle [13:24:59] I'm gonna go have lunch and go for a walk before the VS meeting, bbiab [13:25:24] ack! [13:25:34] I am finishing up the deployments and repooling codfw [13:27:55] Latency alerts are gone [13:29:26] Sending to inference.svc.codfw.wmnet... [13:29:27] PASS: 102 requests sent to inference.svc.codfw.wmnet. All assertions passed. [13:29:31] \o/ \o/ \o/ [13:30:30] 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey) ` elukey@deploy1002:~$ httpbb --host inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/test_liftwing_production.yaml Sending to inference.svc.codfw.wmne... [13:34:59] and repooled :) [13:35:04] going to take a walk [13:37:05] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) a:03elukey Cluster upgraded! [13:47:41] (03Abandoned) 10Ilias Sarantopoulos: Deployment script examples [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 (owner: 10Ilias Sarantopoulos) [13:47:59] congrats teeeam! [14:38:10] 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10elukey) [14:50:43] (03PS1) 10Ilias Sarantopoulos: (WIP) - Create a translation endpoint between LiftWing/ORES [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/892998 (https://phabricator.wikimedia.org/T330414) [14:52:00] isaranto: fastapi is awesome :) [14:56:33] ok the ml-serve-eqiad upgrade is prepped as well [14:56:39] we should be able to do it tomorro [14:56:42] *tomorrow [14:56:47] to complete the migration [15:03:52] klausman - meeting :) [15:04:05] I am in the VS meeting [15:04:10] ack ack [16:03:05] * elukey taking a break [17:36:25] wrapping up folks, cu tomorrow! [17:37:02] o/ [17:44:18] * elukey afk as well! o/ [17:51:39] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdforrester-WMF) [17:57:25] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [17:57:45] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)