[07:35:58] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10elukey) @jbond I ran `systemctl reset-failed kafkatee.service` since the unit is marked as masked, IIRC we use only the `kafkatee-webrequest` unit in t...
[08:34:48] <wikibugs>	 10Machine-Learning-Team: [WikiGPT] Use moderation API from OpenAI - https://phabricator.wikimedia.org/T329058 (10isarantopoulos) 05Open→03Resolved
[08:34:50] <wikibugs>	 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10isarantopoulos)
[08:35:28] <wikibugs>	 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) 05Open→03Resolved
[08:35:31] <wikibugs>	 10Machine-Learning-Team, 10Epic: WikiGPT Experiment - https://phabricator.wikimedia.org/T328494 (10isarantopoulos)
[09:48:24] <elukey>	 hello folks
[09:48:39] <elukey>	 I am going to start the upgrade to k8s 1.23 of ml-serve-codfw
[09:48:47] <klausman>	 \o
[09:48:58] <klausman>	 tmux shoulder-surf ok?
[09:50:12] <wikibugs>	 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10Aklapper) @isarantopoulos: Could you answer my last comment, please? Thanks in advance!
[09:50:17] <elukey>	 klausman: o/
[09:50:19] <elukey>	 sure
[09:50:26] <elukey>	 so as first step, just to be sure
[09:50:33] <elukey>	 1) downtime the whole cluster
[09:50:37] <elukey>	 2) wipe etcd
[09:50:58] <elukey>	 3) kick off the reimage of all etcd nodes
[09:51:06] <elukey>	 after  this I'll start the upgrade cookbook
[09:51:16] <elukey>	 the wipe is not really necessary but it will clean up the pods etc..
[09:51:29] <klausman>	 Aye.
[09:53:12] <elukey>	 klausman: in the meantime, could you depool codfw from the inference discovery endpoint?
[09:53:17] <wikibugs>	 10Machine-Learning-Team: Create repository for WikiGPT - https://phabricator.wikimedia.org/T329028 (10isarantopoulos) >>! In T329028#8592862, @Aklapper wrote: > Related, are there also plans to create a dedicated Phabricator project tag for this codebase?  Not at this point as this was just some POC work done.
[09:53:18] <elukey>	 just to be sure
[09:53:34] <klausman>	 will do
[09:54:02] <isaranto>	 o/
[09:54:57] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye
[09:55:12] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2002.codfw.wmnet with OS bullseye
[09:55:40] <elukey>	 kicked off all reiamges for ml-etcd2* nodes on cumin1001
[09:55:43] <elukey>	 *reimages
[09:56:00] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2003.codfw.wmnet with OS bullseye
[09:56:12] <elukey>	 Janis already worked on this procedure, in theory it should end up with a brand new cluster
[09:56:56] <klausman>	 *grrrr* I hate conftool
[10:00:58] <klausman>	 I think I have done the right thing, but conftool is still a mystery to me
[10:01:46] <elukey>	 did you follow the wikitech docs?
[10:01:50] <elukey>	 If so you should be good
[10:04:09] <jayme>	 godspeed elukey :)
[10:06:01] <klausman>	 elukey: I tried to follow it, but it's hard to know what the specific magic words for things like "cluster" and "pool" are
[10:06:54] <elukey>	 klausman: sure, but in https://wikitech.wikimedia.org/wiki/DNS/Discovery it is all explained
[10:07:14] <klausman>	 https://wikitech.wikimedia.org/wiki/Conftool#Show_pool_status I was looking at this
[10:07:30] <elukey>	 jayme: <3
[10:13:23] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2001...
[10:14:59] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2003.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2003...
[10:16:00] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2002.codfw.wmnet with OS bullseye executed with errors: - ml-etcd2002...
[10:29:22] <elukey>	 elukey@ml-etcd2001:~$ etcdctl -C https://$(hostname -f):2379 cluster-health
[10:29:25] <elukey>	 member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
[10:29:28] <elukey>	 member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
[10:29:31] <elukey>	 member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
[10:29:34] <elukey>	 cluster is healthy
[10:29:34] <elukey>	 I had to set the cluster status to "new" manually
[10:29:36] <elukey>	 finally
[10:29:40] <elukey>	 I'll add this to the docs
[10:32:31] <elukey>	 klausman: I created a root session on cumin1001 named "T330669"
[10:32:57] <klausman>	 attached!
[10:33:30] <klausman>	 Huh, I didn't know about P{}
[10:33:56] <elukey>	 it is cumin specific query to select the puppet backend basically
[10:34:17] <elukey>	 the idea is that we do ml-serve200[2-8] reimages manually, so we can do them in parallel
[10:34:18] <klausman>	 aaah, I wondered what it stood for (what it _does_ is obvious :))
[10:34:30] <elukey>	 and the cookbook will do only 2001
[10:34:39] <klausman>	 want me to do the rest of the workers?
[10:35:03] <elukey>	 we can split, but let's start only after the control plan
[10:35:06] <elukey>	 *plane
[10:35:10] <klausman>	 Of course
[10:36:01] <elukey>	 I already downtimed all the nodes previously, going to disable puppet on 2002-8 and stop kube*
[10:36:10] <klausman>	 Ack.
[10:37:22] <elukey>	 ok now merging the puppet change
[10:37:38] <elukey>	 that is https://gerrit.wikimedia.org/r/c/operations/puppet/+/892482
[10:39:25] <klausman>	 Do we have to use a more specific cookbook than sre.hosts.reimage for the workers?
[10:40:20] <elukey>	 nono reimage is good
[10:40:47] <elukey>	 it is what we use in the upgrade one, but atm spicerack/cookbooks cannot launch 1+ of them
[10:41:17] <klausman>	 Aye. `bullseye` is the default for --os, right?
[10:55:30] <elukey>	 it should be mandatory IIRC
[10:55:54] <klausman>	 yeah, I asked in -sre and got helped :)
[10:56:08] <elukey>	 ah super
[10:57:15] <klausman>	 2001 is doing its first Puppet run *drums fingers*
[11:03:41] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10jbond) 05In progress→03Resolved >  Is there a reason to change it for this particular use case? (To better understand what's happening) no i think...
[11:29:45] * isaranto lunch
[11:32:51] <wikibugs>	 10Machine-Learning-Team, 10API-Portal: Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10Ameisenigel)
[11:46:16] <klausman>	 elukey: Spotted this on ml-serve-2007: https://phabricator.wikimedia.org/P44898
[11:46:48] <klausman>	 Happened on three separate days around the same time, which is super weird.
[11:47:34] <elukey>	 weird indeed
[11:48:02] <klausman>	 I don't think it's actionable yet, but I'll keep checking the machine every few days, see if it continues after today
[11:48:13] <elukey>	 ack!
[11:48:27] <elukey>	 so ml-serve2001's reimage has kicked off
[11:48:46] <klausman>	 ack, saw the IPMI stuff just now
[11:48:47] <elukey>	 I'll do 2002-2004, if you want to get lunch break go ahead
[11:49:11] <klausman>	 I can do them, got the commands all set up (and not hungry yet)
[11:50:06] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2002.codfw.wmnet with OS bullseye
[11:50:31] <klausman>	 Alright, you already started :D
[11:50:58] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2003.codfw.wmnet with OS bullseye
[11:51:10] <elukey>	 klausman: then please do 2005->2008 :)
[11:51:39] <klausman>	 right now?
[11:51:41] <klausman>	 Ok!
[11:51:50] <elukey>	 yep!
[11:52:00] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-serve2004.codfw.wmnet with OS bullseye
[11:52:01] <elukey>	 I mean if you have time, otherwise np
[11:52:50] <elukey>	 we can kick them off and come back later
[11:53:01] <elukey>	 they will take a bit 
[11:53:38] <klausman>	 Argh, I don't have the pw repo, need to reclone that
[11:53:55] <elukey>	 after all nodes are up we'd need to start from https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Apply_RBAC_rules_and_PSPs and proceed with the rest of the admin_ng settings
[11:54:05] <elukey>	 if you want to do them lemme know
[11:55:39] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2008.codfw.wmnet with OS bullseye
[11:55:44] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2007.codfw.wmnet with OS bullseye
[11:55:50] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2006.codfw.wmnet with OS bullseye
[11:56:06] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ml-serve2005.codfw.wmnet with OS bullseye
[11:56:26] * elukey afk for a bit
[12:20:58] <jbond>	 hi all are you aware of the bgp alert https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr1-codfw&service=BGP+status (cc elukey )
[12:21:18] <jbond>	 i should say alers its on both crs in codfw
[12:21:47] <jbond>	 likley also related to this https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2009&service=PyBal+backends+health+check
[12:21:58] <jbond>	 "Servers ml-serve2001.codfw.wmnet, ml-serve2004.codfw.wmnet, ml-serve2002.codfw.wmnet, ml-serve2008.codfw.wmnet are marked down but pooled"
[12:22:18] <elukey>	 jayme: hi! Yes we are upgrading the cluster to k8s 1.23, but it should be only codfw
[12:22:40] <elukey>	 until calico pods are not up we'll see the alerts
[12:22:51] <elukey>	 hopefully be fixed in say 2hrs max
[12:22:57] <jayme>	 wrong ping I suppose
[12:22:57] <jbond>	 elukey: ack thanks
[12:23:09] <elukey>	 jayme: yes sorry :)
[12:23:14] <jayme>	 np :)
[12:28:01] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2002.codfw.wmnet with OS bullseye completed: - ml-serve2002 (**PASS**)...
[12:30:11] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2003.codfw.wmnet with OS bullseye completed: - ml-serve2003 (**PASS**)...
[12:31:41] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2008.codfw.wmnet with OS bullseye completed: - ml-serve2008 (**PASS*...
[12:34:11] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2005.codfw.wmnet with OS bullseye completed: - ml-serve2005 (**PASS*...
[12:38:34] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2006.codfw.wmnet with OS bullseye completed: - ml-serve2006 (**PASS*...
[12:39:09] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-serve2004.codfw.wmnet with OS bullseye completed: - ml-serve2004 (**PASS**)...
[12:40:20] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ml-serve2007.codfw.wmnet with OS bullseye completed: - ml-serve2007 (**WARN*...
[12:41:49] <klausman>	 All hosts except 2001 are done
[12:42:06] <klausman>	 I'll run the rbac/policy sync once it is
[12:42:32] <elukey>	 klausman: already done :)
[12:42:44] <klausman>	 Always ahead of me :)
[12:42:55] <elukey>	 John asked about the alerts etc.. so I moved once some nodes were up
[12:43:30] <elukey>	 klausman: you can do the certmanager ones 
[12:43:33] <elukey>	 I stopped at istio
[12:44:12] <klausman>	 https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#cert-manager These?
[12:44:38] <elukey>	 yep
[12:45:07] <elukey>	 I am rolling out https://gerrit.wikimedia.org/r/c/operations/puppet/+/892949
[12:45:20] <elukey>	 I forgot to add it, without it the istio-cni binaries on the workers are 1.9.5
[12:46:37] <klausman>	 +1'd for completeness
[12:47:07] <elukey>	 thanks
[12:47:17] <elukey>	 going to apt-get install -y istio-cni as well on all workers
[12:47:45] <klausman>	 cert-manager syncs all done (including ns-certs)
[12:47:55] <elukey>	 you can also do knative-serving-crds and knative-serving
[12:48:24] <klausman>	 on it
[12:49:06] <klausman>	 and done as well
[12:50:50] <elukey>	 I think that we are up :)
[12:51:07] <klausman>	 https://alerts.wikimedia.org/?q=alertname%3DPyBal%20backends%20health%20check this is still firing.
[12:51:16] <elukey>	 they just recovered
[12:51:26] <klausman>	 ah, typical :D
[12:51:41] <elukey>	 on lvs2009, those are 2010 I think
[12:51:52] <elukey>	 should be fixed as well in a bit
[12:51:55] <klausman>	 yep. should recover soon, too
[12:52:27] <klausman>	 Should I pool codfw again?
[12:52:33] <klausman>	 (inference, that is)
[12:52:46] <elukey>	 klausman: we need to deploy the model servers
[12:52:59] <klausman>	 right. forgot the kserve crds
[12:53:14] <klausman>	 well, charts, not crds
[12:53:15] <elukey>	 that one as well (should not have crds as separate release)
[12:53:34] <elukey>	 please go ahead with kserve :)
[12:54:14] <klausman>	 and done
[12:54:59] <elukey>	 perfect, now it is the turn of model servers.. do you want to do it / split / etc..?
[12:55:23] <elukey>	 In theory we shouldn't see anymore the latency alerts (but something may fire for the biggest ns-es)
[12:55:27] <klausman>	 I don't think I've done it recently, so now's a good time as any
[12:55:54] <elukey>	 ack!
[12:56:13] <klausman>	 just need to find the docs, to make sure I get it right :)
[12:58:13] <klausman>	 doing articletopic first because abc :)
[13:06:16] <klausman>	 revscoring-editquality-damaging failed deployment, investigating
[13:06:36] <elukey>	 what did it say?
[13:07:10] <klausman>	 https://phabricator.wikimedia.org/P44900
[13:07:44] <klausman>	 is g+r the root cause?
[13:08:01] <klausman>	 Nah, all those files in the dir are g+r
[13:08:34] <elukey>	 nono I think the kserve webhook was overwhelmed
[13:08:35] <elukey>	 https://grafana.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-kubernetes_namespace_controller=kserve&var-kubernetes_namespace_queue_proxy=revscoring-articletopic&viewPanel=2&from=now-1h&to=now
[13:08:40] <elukey>	 we'd need to scale it up later on
[13:08:43] <elukey>	 can you retry?
[13:08:47] <klausman>	 sec
[13:09:13] <klausman>	 Failed again
[13:09:35] <elukey>	 NAME                                         READY   STATUS   RESTARTS     AGE
[13:09:38] <elukey>	 kserve-controller-manager-645d68955f-4brjh   0/1     Error    1 (4m ago)   15m
[13:09:41] <elukey>	 sigh
[13:10:10] <elukey>	 now running again
[13:10:25] <elukey>	 yeah it needs to be scaled up
[13:10:29] <elukey>	 klausman: one last retry please
[13:10:57] <klausman>	 Still erroring
[13:11:06] <elukey>	 can you try reverted?
[13:11:15] <elukey>	 just to see if it is the number of pods or something else
[13:11:23] <klausman>	 Does it runrunning
[13:11:26] <klausman>	 gah
[13:11:30] <klausman>	 running -rev
[13:11:34] <klausman>	 that worked fine
[13:11:52] <elukey>	 interesting
[13:12:13] <klausman>	 and all those pods are running
[13:12:28] <klausman>	 Trying goodfaith
[13:13:08] <elukey>	 wait a sec
[13:13:09] <klausman>	 Also fails
[13:13:20] <elukey>	 yes same issue
[13:13:29] <elukey>	 I am increasing the kserve controller pods to two
[13:14:31] <elukey>	 klausman: let's retry
[13:14:43] <klausman>	 goodfaith or damaging?
[13:14:51] <elukey>	 damaging
[13:15:32] <klausman>	 nope, failed again
[13:15:53] <elukey>	 but I saw some pods coming up
[13:15:59] <klausman>	 yeah, same
[13:16:24] <klausman>	 Is there a way to make helm not remove the failed stuff, so we might see what breaks?
[13:16:29] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Northern Luri Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T330616 (10kevinbazira) The lrcwiki pipeline was failing during the spark job with the message `poinValueError: RDD is empty`.  Thanks to @Bugreporter, we found out t...
[13:17:54] <elukey>	 klausman: what failed stuff?
[13:18:48] <klausman>	 So it seems helm starts _some_ pods but thinks they don't work right and then removes them
[13:19:12] <elukey>	 worked now
[13:19:17] <klausman>	 did you just deploy again? I see stuff starting
[13:19:19] <elukey>	 I scaled the deployment of the controller to 4 nodes
[13:19:24] <elukey>	 yep yep
[13:19:31] <klausman>	 So it was a resource issue on the controller.
[13:19:59] <elukey>	 the error in your paste pointed to timeouts contacting the webhook, and the metrics showed some pressure.. I think that kserve 0.9 needs more replicas
[13:20:13] <elukey>	 at least for big namespaces
[13:20:20] <klausman>	 Well, at least the controller pods are not huge
[13:20:56] <klausman>	 ok to sync goodfaith?
[13:21:03] <elukey>	 done, worked as well
[13:21:07] <klausman>	 aaand again. ahead of me
[13:21:34] <elukey>	 so it was indeed an issue with a bursts of http requests towards the kserve controller's webhook
[13:21:49] <klausman>	 The timeouts seemed a bit aggressive, but maybe that's just me
[13:21:52] <elukey>	 the new k8s stack is way faster to spin up pods
[13:21:57] <elukey>	 indeed yes
[13:22:14] <klausman>	 yeah, spinup is faster, feels like 2x-3x
[13:23:22] <elukey>	 latency alerts fired, but I think it is expected with some many pods and calls to the control plane
[13:23:40] <klausman>	 Yep. I expect them to go away as things settle
[13:24:59] <klausman>	 I'm gonna go have lunch and go for a walk before the VS meeting, bbiab
[13:25:24] <elukey>	 ack!
[13:25:34] <elukey>	 I am finishing up the deployments and repooling codfw
[13:27:55] <klausman>	 Latency alerts are gone
[13:29:26] <elukey>	 Sending to inference.svc.codfw.wmnet...
[13:29:27] <elukey>	 PASS: 102 requests sent to inference.svc.codfw.wmnet. All assertions passed.
[13:29:31] <elukey>	 \o/ \o/ \o/
[13:30:30] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-serve-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330669 (10elukey) ` elukey@deploy1002:~$ httpbb --host inference.svc.codfw.wmnet --https_port 30443 /srv/deployment/httpbb-tests/liftwing/test_liftwing_production.yaml  Sending to inference.svc.codfw.wmne...
[13:34:59] <elukey>	 and repooled :)
[13:35:04] <elukey>	 going to take a walk 
[13:37:05] <wikibugs>	 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) a:03elukey Cluster upgraded!
[13:47:41] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: Deployment script examples [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 (owner: 10Ilias Sarantopoulos)
[13:47:59] <isaranto>	 congrats teeeam!
[14:38:10] <wikibugs>	 10Machine-Learning-Team: Upgrade the ml-serve-eqiad cluster to k8s 1.23 - https://phabricator.wikimedia.org/T330758 (10elukey)
[14:50:43] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: (WIP) - Create a translation endpoint between LiftWing/ORES [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/892998 (https://phabricator.wikimedia.org/T330414)
[14:52:00] <elukey>	 isaranto: fastapi is awesome :)
[14:56:33] <elukey>	 ok the ml-serve-eqiad upgrade is prepped as well
[14:56:39] <elukey>	 we should be able to do it tomorro
[14:56:42] <elukey>	 *tomorrow
[14:56:47] <elukey>	 to complete the migration
[15:03:52] <elukey>	 klausman - meeting :)
[15:04:05] <klausman>	 I am in the VS meeting
[15:04:10] <elukey>	 ack ack
[16:03:05] * elukey taking a break
[17:36:25] <isaranto>	 wrapping up folks, cu tomorrow!
[17:37:02] <elukey>	 o/
[17:44:18] * elukey afk as well! o/
[17:51:39] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdforrester-WMF)
[17:57:25] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)
[17:57:45] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)