[07:03:37] hello folks! [07:23:20] 10Machine-Learning-Team, 10Patch-For-Review: Add 4 new Kubernetes worker nodes to ml-serve-eqiad - https://phabricator.wikimedia.org/T306545 (10elukey) 05Open→03Stalled The new nodes are in row E/F, that have a different network configuration. Task blocked until T306649 is solved. [07:23:46] 10ORES, 10Beta-Cluster-Infrastructure, 10Machine-Learning-Team (Active Tasks): Upgrade deployment-ores01 host to Buster - https://phabricator.wikimedia.org/T306053 (10elukey) [07:24:54] 10Machine-Learning-Team: Use the scipy wheel instead of python3-scipy for ORES - https://phabricator.wikimedia.org/T305441 (10elukey) 05Stalled→03Declined In the parent task I was able to run ORES on Buster without using the scipy wheel, but leveraging the new version provided by Debian. [07:24:57] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ORES to Debian Buster - https://phabricator.wikimedia.org/T303801 (10elukey) [07:26:12] 10ORES, 10Beta-Cluster-Infrastructure, 10Machine-Learning-Team (Active Tasks): Upgrade deployment-ores01 host to Buster - https://phabricator.wikimedia.org/T306053 (10elukey) We created deployment-ores02 as part of T303801 with Debian Buster, once we'll finish the task we should be able to delete deployment-... [07:31:24] (03CR) 10Elukey: outlink: handle http bad request (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/785112 (https://phabricator.wikimedia.org/T306029) (owner: 10AikoChou) [07:35:40] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [07:35:45] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add an envoy proxy sidecar to Kserve inference pods - https://phabricator.wikimedia.org/T294414 (10elukey) 05Open→03Resolved Done as part of T297612 [09:11:58] elukey: welcome back Luca! :) can we have a short meeting today? just wanna give you an update what I'm doing and ask for suggestions about the project with Diego [09:12:44] aiko: thanks! Sure, we can do it anytime [09:16:27] kevinbazira_: o/ you can deploy the new models if you want [09:17:07] elukey thanks for the merge. deploying now ... [09:38:14] both eqiad and codfw deployments have been completed successfully. [09:38:25] checking pods now ... [09:43:54] 2/3 new pods are up and running: [09:43:54] NAME READY STATUS RESTARTS AGE [09:43:54] ukwiki-damaging-predictor-default-gx9s6-deployment-64d7c55sknpg 3/3 Running 0 14m [09:43:54] ukwiki-goodfaith-predictor-default-rksvf-deployment-898755c7trr 3/3 Running 0 12m [09:43:54] viwiki-reverted-predictor-default-247wh-deployment-55f8b9cgcdkd 0/3 Init:CrashLoopBackOff 3 8m38s [09:44:02] hmm... [09:45:01] investigating why the viwiki-reverted-predictor has a CrashLoopBackOff issue [09:54:13] hmmm... the storage-initializer seems not to be able to connect to thanos-swift. [09:54:13] botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=reverted%2Fviwiki%2F20220214192315%2F&encoding-type=url" [09:56:28] checking the model, seems to have been uploaded successfully: [09:56:29] kevinbazira@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/reverted/viwiki/20220214192315/ [09:56:29] 2022-02-14 19:23 11416152 s3://wmf-ml-models/reverted/viwiki/20220214192315/model.bin [10:22:31] very weird [10:22:32] socket.gaierror: [Errno -3] Temporary failure in name resolution [10:24:00] ahhh kevinbazira I think it is my bad, I added 4 new nodes to eqiad but they are not still fully working from the network perspective [10:24:09] I'll ban them from the cluster and re-create the pod [10:24:50] great ... I was here thinking whether anything (credentials, IP address, etc) had changed with thanos-swift [10:25:16] nono viwiki was not lucky and scheduled on the new nodes that don't work [10:25:33] yeah it should work now kevinbazira, can you re-check? [10:26:43] yep, it's now up and running. thanks for your help: [10:26:43] NAME READY STATUS RESTARTS AGE [10:26:44] viwiki-reverted-predictor-default-247wh-deployment-55f8b9csfhdz 3/3 Running 0 67s [10:27:12] super, the new nodes are now banned so no more chances to get new pods scheduled on them [10:27:30] great. thank you! [10:32:15] np!! [10:32:23] going afk in a bit for lunch, ttl! [10:35:44] soliti geni - https://twitter.com/taffoofficial/status/1518503701077450752?s=20&t=zV5sYW3j95mtmzBWtuED8A [10:35:50] me lo sono perso ieri [11:37:00] For those that don't know: April 25 is Liberation Day in Italy: https://en.wikipedia.org/wiki/Liberation_Day_(Italy) [12:55:32] 10ORES, 10Beta-Cluster-Infrastructure, 10Machine-Learning-Team (Active Tasks): Upgrade deployment-ores01 host to Buster - https://phabricator.wikimedia.org/T306053 (10Jdforrester-WMF) >>! In T306053#7879658, @elukey wrote: > We created deployment-ores02 as part of T303801 with Debian Buster, once we'll finis... [13:03:40] Morning! [13:03:45] o/ [13:12:05] klausman: o/ a lot of italian politician still don't like the date, very sad :( [13:12:11] *politicians [13:12:45] elukey: the last 30+ years have been wild in that regard. And not in a good way [13:13:40] yep [13:14:21] Though sometimes I wonder if maybe I am overly sensitive to these matters due to being German [13:14:47] Then again, the earlier countered, the better. [13:14:55] I think that we are both sensitive to the matter for good reasons [13:19:21] When I first saw your mention up there, I thought it was the anniversary of the Bologna Centrale bombing (but that was August 2). I was quite shocked by learning of it in school in the mid-80s. [13:20:30] And then when I came across it randomly again in the early 2000s, I learned about the whole P2 story. [13:20:42] yeah another dark piece of history, it is sadly not the only one happened in the 70s and 80s [13:21:01] yes the Italian history is full of shame if one looks deep enough :D [13:21:29] "Anni di piombo" [13:21:48] yep :) [13:22:16] As someone who had until then only thought of Italy as the country some of my classmates were from and where you'd go on holiday, it was quite the contrast [13:25:56] klausman: if you are ok I'd deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/786264/2/helmfile.d/admin_ng/values/ml-serve-eqiad/calico-values.yaml [13:26:24] I had a chat with Cathal and in theory it should work, but of course Calico can behave in its own way :D [13:26:42] SGTM [13:26:49] (also +1d for posterity) [13:26:54] super [13:27:41] we have all 4 new nodes in the new rows E/F, IIUC they are layer-3 switches that can do BGP as well [13:27:57] (iBGP between leaf and spines, eBGP between spines and crx routers) [13:28:13] so some calico pods needs to peer with the ToRs [13:30:18] all calico pods seem up and running [13:30:49] Very nice. [13:31:41] Plenty of (good) BGP messages over in -ops [13:32:37] yep! \o/ [13:36:27] I drink some local alcoholic apple cider last night and I have regrets. [13:36:36] I am too old for this [13:37:18] ahahahah [13:37:33] Cider is especially nasty since all that Fructose messes with ethanol metabolisation, and thus promotes hangovers. [13:37:46] whhhyyyyyyyyyy [13:38:17] Nature likes to mess with us humans. [13:38:35] And we bipedal monkeys are just too damn curious [13:41:43] I also have to point out that *some* (one pint) local apple cider may cause some side effects, meanwhile *some* (4 pints of apple cider) can cause others :D [13:42:05] Also true [13:42:17] Many variables, needs further testing [13:42:33] I agree [13:48:28] klausman: I see that the ml-staging-ctrl nodes have their disk full, due to the kubelet spamming the logs (I guess because of the missing VIP). If you have time we can try to add it (we can ping Valentin in Traffic and see if he is available to support us) [13:49:05] Sounds like a good idea, yeah. I guess Puppet re-enabled the apiserver after I disabled it for disk-filling reasons [13:49:28] oh wait, the kubelet, not apiserver. Either way, yes we should proceed. [13:49:48] I have my 1:1 with Chris in 40m, so I may have to dip out a bit [13:50:13] sure sure [13:52:25] I'll just have to find my notes on that part of the setup again [13:53:15] Basically just edit hieradata/common/service.yaml to move ml-staging-ctrl to lvs_setup? [13:53:51] I'm not sure if that is enough to trigger the lo-intf setup [13:54:46] Oh nvm, the IP is already config'd on the machines [13:55:48] So, these steps, then: https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers ? [13:57:12] 10Machine-Learning-Team, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), 10Platform Team Workboards (Platform Engineering Reliability): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10DAbad) [13:58:44] klausman: in theory yes if all the rest is done, conftool configs etc.. are ok? [13:58:55] I _think_ so [14:01:08] Hrm. Maybe not. [14:01:38] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/node/codfw.yaml ok, ctrl is there [14:02:58] Though no discovery entries for anything ML- -- I think that is deliberate [14:03:55] yep that is good [14:14:35] klausman: mmm I don't see ml-staging-ctrl.svc.codfw.wmnet in our authoritative DNS records [14:14:44] I see it reserved in netbox [14:14:58] are the zones defined in base puppet? [14:15:08] Or is it all in Netbox? [14:15:24] the svc records are still manual, they need to be added to the dns repo [14:15:30] + dns auth-update etc.. [14:15:40] https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) [14:16:15] aaah, I have a local patch that never made it to review [14:17:03] Um. [14:18:19] should be the only thing missing, then we can ping Valentin [14:18:20] Oh, me dumb-dumb [14:18:33] The etcd stuff is there, but nothing for ctrl [14:21:29] https://gerrit.wikimedia.org/r/c/operations/dns/+/786320 [14:22:37] +1ed [14:22:40] Should I wait for Valentin's lgtm? [14:22:48] for the DNS change you can proceed [14:23:17] then it needs an authdns-update once merged [14:23:18] This needs no further steps for now (akin to a puppet-merge), right? [14:23:35] (and sre.dns.netbox) [14:23:36] yeah the authdns-update is sufficient [14:24:05] the dns netbox step should be ok since you reserved the IP address a while ago, but it doesn't hurt to verify it again [14:24:13] Ok [14:27:58] authdns-update is done, netbox cookbook still running [14:28:30] No changes to deploy, as expected. [14:29:14] ok to submit the hiera change? [14:29:47] going to 1:1 now, bbiab [14:35:24] I am still not getting any A record for ml-staging-ctrl.svc.codfw.wmnet though [14:36:18] ah no now I see it [14:37:08] but only if I use @ns1.wikimedia.org from my laptop, on ml-serve-ctrl2001 it returns nx domain [14:38:57] ah it has been negative cached, the SOA record is cached for 3600 seconds [14:41:39] going afk for a little break, bbl [14:55:54] SOA is 1h, but the A records have 1h ttl as well, I think [14:56:18] yep but IIUC the nx record cache is the same as the SOA [14:57:09] anyway, I think that we can rollout the new VIP [14:57:25] let's write down a plan in here [14:58:19] https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers [15:01:23] so the lvs::configuration class states that the two lvs nodes for low-traffic VIPs are 'lvs2009', 'lvs2010' [15:01:26] in codfw [15:01:58] 2010 being the secondary and 2009 the primary (please double check it) [15:02:27] so we need to run puppet on those, restart pybal on 2010 first, check ipvsadm, and then do the same with the primary [15:02:33] does it make sense klausman ? [15:05:38] sorry, phonecall. [15:06:06] Yes, sounds good. [15:06:20] (after submitting chage 786319 [15:06:23] ) [15:06:51] ack so let's contact Valentin on #traffic explaining the plan [15:07:13] how did you determine prim/sec? [15:07:40] there is a profile::pybal::primary flag for it in hiera [15:07:46] for 2009 I mean [15:08:00] nope, for 2010 :) [15:08:02] profile::pybal::primary: false :) [15:08:14] 'lvs2009' => 'low-traffic', [15:08:16] 'lvs2010' => 'secondary', [15:08:18] the default is [15:08:19] hieradata/role/common/lvs/balancer.yaml:profile::pybal::primary: true [15:08:40] (what I pasted is from modules/lvs/manifests/configuration.pp) [15:08:52] ah nice TIL [15:08:59] The comment mentions that it's technically redundant, but hey, two sources! [15:09:06] yep :) [15:09:23] Alright, let's move the rest to traffic while I submit that change. [15:09:47] klausman: ping valentin before submitting [15:09:50] just to be sure [15:10:00] too late :) [15:10:03] because when the change is merged we have to do things quick [15:10:12] ok then let's ping Valentin [15:10:18] explaining the plan [15:40:17] klausman: so now we have the vip, let's clean up the root partition of the ml-staging-ctrl nodes [15:40:45] Should we just ditch the log files entirely? There's not going to be anything interesting in there. [15:40:53] definitely [15:41:36] We should probably also revisit the fact that syslog, messages, daemon log and user.log were all spammed [15:42:08] yep there is probably a missing rsyslog filter rule somewhere [15:42:57] the kubelet is still spamming Failed to list *v1.Node: Unauthorized [15:43:09] maybe the system accounts are still not deployd [15:43:13] I think I know why [15:43:29] https://gerrit.wikimedia.org/r/c/labs/private/+/775823 <- This, maybe? Not sure. [15:44:09] I think it is the kubelet iself that is not able to authenticate to the kube-api [15:44:22] but I see the account in /etc/kubernetes/infrastructure-users [15:44:40] Oh, and you probably already added that istio-cni thing in labs [15:45:01] I don't recall but that should cause problems when we spin up pods in theory [15:45:16] this seems to be one step before [15:45:50] trying to restart the kube api just in case [15:48:49] klausman: so afaics the kubelet token set in profile::kubernetes::infrastructure_users is not the one that I see in hieradata/role/common/ml_k8s/master/staging.yaml [15:48:54] in the private repo I mean [15:48:58] could it be a token mismatch? [15:49:08] Possible c&p error at some point [15:49:19] same thing for hieradata/role/common/kubernetes/staging/worker.yaml [15:49:43] and hieradata/role/common/kubernetes/staging/master.yaml [15:49:45] So what/where exactly are you comparing? [15:49:57] ah no sorry forget the last two lines [15:50:06] those are not related to use [15:50:09] *us [15:50:18] Ok :) [15:50:19] so I am comparing the kubelet token in hieradata/common/profile/kubernetes.yaml [15:50:52] with the one stated in hieradata/role/common/ml_k8s/master/staging.yaml and hieradata/role/common/ml_k8s/worker/staging.yaml [15:50:57] in the private repo [15:51:24] I... may have not been aware that they need to be synced [15:51:48] well they need to be, the kubelets need to know the token to authenticate to the kube api [15:52:30] I was under the impression that where secrets are involved, all involved parties get them from the same location [15:53:44] Should I fix them or do you want to do it? [15:54:24] ah yes this is the ideal config setup, then there is puppet :D [15:54:31] yep please go ahead with the fix [15:54:56] I've also copied off syslog.1 (~5g) off of 2001 and am compressing it [15:55:17] you can rm it if you want [15:56:15] profile::kubernetes::master::scheduler_token: and system:kube-scheduler: probably also need to match? [15:57:05] as well as profile::kubernetes::node::kubeproxy_token: and system:kube-proxy: [15:57:38] That leaves the rsyslog and calico tokens in hieradata/role/common/ml_k8s/master/staging.yaml --- do they need to be synced to another location as well? [15:58:15] in theory yes, let's check what we do with the ml-serve clusters, now I have a doubt [15:58:46] Also, do the proxy/kubelet/scheduler tokens need to match between worker and master or are those different acces roles? [15:59:00] they should be the same [16:00:23] yes yes all need to sync [16:00:53] (1:1 brb) [16:16:16] elukey: logspam stopped after also restarting the controller-manager [16:16:45] I am temped to reboot the two nodes tho, who knows what other components have not yet picked up the fixed secrets [16:30:33] Doing staggered reboots now [16:31:45] +1 seems good! [16:34:46] yeah, and now it's spamming again :( [16:35:21] the reboot cookbook also is waiting for some icinga checks [16:37:13] the ml-staging nodes are also not working (kubelets I mean) [16:38:30] The kubelet token is definitely the same between all three files [16:38:39] But I have not restarted anything on the workers. [16:38:47] (or done puppet runs there) [16:39:01] did you restart both kube-api etc.. ? [16:39:13] 've only rebooted 2002 [16:40:00] restarting apiserver and kubelet on 2001 just now didn't change the 2002 logspam [16:40:07] maybe we are missing pod security policies [16:40:25] let's see [16:41:39] in theory I don't see anything related to the kubelet in the related helmfile [16:42:45] ahhh yes [16:42:50] /etc/kubernetes/kubelet_config has the right token, as far as I can tell [16:42:52] we need the helmfile rbac [16:43:36] # ClusterRoleBindings [16:43:36] ## wmf-node-authorization adds the kubelet users group ("system:nodes") [16:43:39] ## to the system:node ClusterRole so that the kubelet's can register nodes [16:43:42] ## with the API. See: [16:43:44] ## https://kubernetes.io/docs/reference/access-authn-authz/node/#migration-considerations [16:43:48] okok [16:43:53] so `helmfile -e ml-staging-codfw -l name=rbac-rules sync`? [16:44:09] IIRC we still don't have any ml-staging-codfw config set though [16:44:14] in deployment-charts [16:44:18] so we need to first add it [16:44:23] right [16:44:52] ack so we can definitely do it tomorrow morning :) [16:45:03] I am going to step afk now! have a good rest of the day folks [16:45:28] \o [16:45:44] I'll stop the kubelet on 2002 and 2001 [17:02:32] My head still hurts [17:04:41] Take two aspirin and call us again tomorrow ;)