[07:19:56] good morning :) [07:26:47] wow lovely [07:26:48] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-7d&to=now [07:30:47] it seems related to a validating webhook config [07:37:15] I've restarted the kube-api server on ctrl2002 and the latency seems going down [07:37:25] from the logs it is not clear what was causing the mess [07:39:42] ah interesting there was a series of 504s thrown by the api [07:43:03] seemed like one of the kube-api servers was stuck somehow [07:53:57] -- [07:54:36] kevinbazira: o/ one thing that I wanted to chat with you is https://phabricator.wikimedia.org/T301878, that in theory is a key piece to allow us to move traffic from ORES to LiftWing [07:54:57] (so maybe not completely blocking the MVP, but surely the very next steps after) [07:55:40] I tried to summarize in the task my understanding of the problem, but surely we'll need to dig more into it and decide how/what to do [07:56:09] is it something that can be in your radar? (Aiko would be probably interested too) [08:01:53] elukey: o/ [08:01:53] Thank you for suggesting the ideas in the ticket. I will be happy to dig into the Python function / HTTP POST idea whenever we are ready to experiment with it in the MVP. [08:03:34] ack perfect! [10:07:12] folks I think that we'll need to migrate ORES to python3.7 [10:07:49] the LTS support for Debian 9 ends in June 2022, and there are little chances in my opinion that we'll be ready to decom ORES at that point [10:08:18] the Lift Wing MVP will be probably ready but the whole migration will require time [10:12:37] 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) p:05Triage→03High [10:30:17] 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10MoritzMuehlenhoff) We can find some pragmatic middleground here. E.g. to minimise changes in ORES and how it gets deployed, Infrastructure Foundations SREs can provide a build of Python 3.5 (along wi... [10:37:51] 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) The dependencies are all in the `ores::base` puppet class, with python3.5 on Buster we could reimage all the nodes easily. Once the new component is ready we can add a new VM in the ORES clou... [10:38:15] ok very nice news from SRE, they will maintain a python3.5 component on Buster to ease the migration for us [10:38:33] if tests in cloud will be good, upgrading to Buster should be relatively easy [10:47:46] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) Thanks to Janis' patch I am now able to see a pod with the istio-proxy sidecar, together with CNI logs in the kubelet (to inject the iptables rule... [11:05:16] Phew that is a relief to hear. (but I am also saddened by the necessity) [11:06:47] I think that we could also upgrad to python 3.7 if we wanted, it shouldn't be that scare [11:06:50] *scary [11:07:16] with all the automated tests that Aiko created for the cloud instance it should be easy to spot if any model misbehaves during testing [11:07:40] the ORES deprecation will be long so in my opinion we should be confident in deploying to it (or making changes) [11:07:46] Good point. I would just be wary of the "For 3.7 you need to update dependency X" problem [11:07:52] we cannot keep going with the idea that we don't touch it [11:09:17] yes definitely changing deps will require some time [11:09:22] (in the 3.7 use case) [11:29:21] aiko: o/ I saw your change but it unveiled a problem in the template, if you check the CI output it is empty (like the change wasn't doing anything) [11:32:33] elukey: o/ where can I check the CI output? [11:33:16] Ohhh I found it [11:34:18] https://integration.wikimedia.org/ci/ <- somewhere there, right? [11:35:08] Oh, and linked in Gerrit, of course :facepalm: [11:35:20] exactly yes [11:43:42] $custom_predictor.image_version -> so I should use "image_version" instead of "version"? [11:44:55] aiko: exactly yes, but probably it is less confusing to change the templating to just use "version", what do you think? [11:45:47] elukey: yep to be consistent with $generic_predictor.version [11:46:30] aiko: ack, I am testing a small change, will give you the link in a sec [11:49:15] elukey: no worries I'll send a patch for that :) [11:49:27] basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/770895 [11:54:03] elukey: I see!! thanks Luca [11:55:50] elukey: I have a question. What is yaml under .fixtures used for? [11:56:24] aiko: those are basically like test fixtures, you can use helm3 template to see how the chart renders some values [11:56:51] so if you change something that modifies the behavior of the chart, you'll see changes in CI for those fixtures [11:57:10] locally you can test those via [11:57:17] helm3 template -f 'charts/kserve-inference/.fixtures/revscoring_inference_no_transformer.yaml' 'charts/kserve-inference/' [11:57:29] (the helm3 binary can be downloaded from upstream) [11:59:33] elukey: got it. Thanks :) [12:00:11] I am going out for lunch, will merge the change to unblock you when I am back [12:00:18] and we can deploy to codfw if you want! [12:01:38] elukey: yes sounds good! see you later Luca 👋 [12:02:24] * elukey lunch! [13:03:00] Morning! [13:46:20] chrisalbon: o/ [13:46:30] I opened a task for OS upgrade on ORES nodes [13:46:49] I know that you wanted to start your day with this info [13:56:04] lol [13:57:08] Exactly what I wanted to hear, major ORES updates [13:59:11] chrisalbon: jokes aside, the current OS is not supported anymore from June onwards, but SRE will help us backporting python3.5 to Buster [13:59:29] so in theory we'll be able to keep the current python deps [13:59:31] without changing them [13:59:37] only the underlying OS will change [14:05:20] aiko: rebased https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/770886, let's see if now there is a diff [14:08:27] drain + uncordon is totally ok [14:08:40] sorry this one was for serviceops :D [14:10:11] aiko: it seems working this time! Do you want to deploy? [14:11:43] elukey: \o/ yes! [14:18:54] aiko: so I am testing on the ml-serve-eqiad cluster, if you don't mind let's do it only on codfw [14:19:32] so I am going to merge your change [14:19:36] then you can start from https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy [14:19:37] elukey: ok! [14:21:12] kevinbazira: I saw your comment in the code review, since we are not adding new models but just changing what it is existing we can proceed [14:22:51] (this one is good so we can test the extended output etc..) [14:25:42] aiko: do you want to do the deployment via meet? [14:26:35] elukey: yes that's a really good idea [14:27:06] aiko: ack, can you open a meeting? [14:27:40] elukey: yep, wait a sec [14:30:11] elukey: here https://meet.google.com/prx-evwv-xvh [14:45:03] elukey: thanks for the clarification. [14:49:15] kevinbazira: np! We just tested Aiko's change, worked nicely :) [14:52:54] chrisalbon: first kubernetes deployment for Aiko, the new feature worked nicely [14:54:01] \o/ thanks Luca for your help :) [14:54:58] Great work aiko and elukey 👏👏👏 [15:12:32] Nice! Great job Aiko! [16:21:07] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10calbon) a:03kevinbazira [16:48:52] elukey: I have questions about pcc :D [16:48:58] sure [16:49:10] elukey: AIUI it both would tell me if the whole config was invald. That part I have solved [16:49:36] But shouldn't it _also_ tell me what cnages would happen to the listed host(s) if I submitetd the chnages and did a puppet run? [16:50:07] Oh [string of expletives] [16:50:07] it tells you what changes yes [16:50:40] So I've been running pcc yesterday and a few times in the evening, but it would never show changes. [16:51:06] Now I did it again, and, boom, changes visible [16:51:27] Only difference: I ran it locally instead of via the webui [16:51:36] No local changes that ren't on gerrit, mind [16:52:00] Oh well, at least that is now settled. [16:52:45] I'll disable puppet on 2002 and 2003, merge the changes (after your review), force puppet on 2001, check it comes up correctly, then activate 2002 and 2003, see the cluster converge -> happy [16:52:48] Sound good? [16:53:14] yep [16:58:28] klausman: I was reviewing the change, why did you +2 ? [16:58:46] oops. [16:58:59] Somehow I thought you'd +1'd %-/ [16:59:09] Sorry! [16:59:19] I left a nit, all good, but I was confused since you asked the review :D [16:59:58] the name of the role could benefit a rename, you can do it after the rollout [17:00:07] puppet disabled on 2001, while I figure this out [17:06:11] Too much espresso, clearly :D [17:18:43] I haven't +1ed, it was PCC adding the result, but it looks good :D [17:18:51] Dammit. [17:19:30] alright, re-enabled puppet on 2001, doing puppet run [17:22:27] Hmmm. Puppet can't find the certs [17:22:39] Error: /Stage[main]/Profile::Etcd::V3/Sslcert::Certificate[_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet]/File[/etc/ssl/localcerts/_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet.crt]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/ssl/_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet.crt [17:22:54] Did I miss something in the private-but-not repo? [17:27:01] yeah the .ctr needs to be copied in the public repo [17:27:26] Ah, right [17:27:36] under modules/profile/files/ssl [17:34:51] root@ml-staging-etcd2001:~# etcdctl cluster-health [17:34:53] member 8e9e05c52164694d is healthy: got healthy result from http://localhost:2379 [17:34:55] cluster is healthy [17:34:57] whee [17:35:03] Now to add the other two [17:38:42] nice! [17:39:09] weird. [17:39:19] the 2002 etcd can't parse the keyfile? [17:40:48] oooh, did I add a passphrased key? [17:41:15] yep. [17:41:22] welp, that's an easy fix [17:53:05] # etcdctl -C https://ml-staging-etcd2003.codfw.wmnet:2379 cluster-health|grep member.*is.healthy -c [17:53:06] 3 [17:53:08] y [17:53:10] ..ay [17:53:30] goood [17:55:38] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Create etcd cluster for ml-serve-staging k8s - https://phabricator.wikimedia.org/T302197 (10klausman) 05Open→03Resolved ` # etcdctl -C https://ml-staging-etcd2001.codfw.wmnet:2379 cluster-health member 493aa03d462725d1 i... [17:55:40] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman) [17:57:53] it is interesting that profile::etcd::v3::cluster_bootstrap: true stays true in multiple clusters [18:23:10] AIUI, it's needed to make a new cluster from scratch [18:23:28] yeah but then not sure if it needs to be kept [18:23:33] It would then be logical that it *could* be turned off after it's setup [18:23:44] yep, maybe it doesn't really count that much after wards [18:23:52] But: if it doesn't break anything, one might just leave it on. But then why have a setting at all? [18:23:59] going to dinner, have a nice rest of the day folks! [18:24:14] enjoy! talk to you tomorrow [18:24:14] maybe it is needed only for the bootstrap case [19:13:08] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10QChris) > I have requested a new istio repository in: > https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests Gerrit just had the new [[https:/...