[07:19:56] <elukey>	 good morning :)
[07:26:47] <elukey>	 wow lovely
[07:26:48] <elukey>	 https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlserve&from=now-7d&to=now
[07:30:47] <elukey>	 it seems related to a validating webhook config
[07:37:15] <elukey>	 I've restarted the kube-api server on ctrl2002 and the latency seems going down
[07:37:25] <elukey>	 from the logs it is not clear what was causing the mess
[07:39:42] <elukey>	 ah interesting there was a series of 504s thrown by the api
[07:43:03] <elukey>	 seemed like one of the kube-api servers was stuck somehow
[07:53:57] <elukey>	 --
[07:54:36] <elukey>	 kevinbazira: o/ one thing that I wanted to chat with you is https://phabricator.wikimedia.org/T301878, that in theory is a key piece to allow us to move traffic from ORES to LiftWing
[07:54:57] <elukey>	 (so maybe not completely blocking the MVP, but surely the very next steps after)
[07:55:40] <elukey>	 I tried to summarize in the task my understanding of the problem, but surely we'll need to dig more into it and decide how/what to do
[07:56:09] <elukey>	 is it something that can be in your radar? (Aiko would be probably interested too)
[08:01:53] <kevinbazira>	 elukey: o/
[08:01:53] <kevinbazira>	 Thank you for suggesting the ideas in the ticket. I will be happy to dig into the Python function / HTTP POST idea whenever we are ready to experiment with it in the MVP.
[08:03:34] <elukey>	 ack perfect!
[10:07:12] <elukey>	 folks I think that we'll need to migrate ORES to python3.7
[10:07:49] <elukey>	 the LTS support for Debian 9 ends in June 2022, and there are little chances in my opinion that we'll be ready to decom ORES at that point
[10:08:18] <elukey>	 the Lift Wing MVP will be probably ready but the whole migration will require time
[10:12:37] <wikibugs>	 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) p:05Triage→03High
[10:30:17] <wikibugs>	 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10MoritzMuehlenhoff) We can find some pragmatic middleground here. E.g. to minimise changes in ORES and how it gets deployed, Infrastructure Foundations SREs can provide a build of Python 3.5 (along wi...
[10:37:51] <wikibugs>	 10Machine-Learning-Team: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) The dependencies are all in the `ores::base` puppet class, with python3.5 on Buster we could reimage all the nodes easily. Once the new component is ready we can add a new VM in the ORES clou...
[10:38:15] <elukey>	 ok very nice news from SRE, they will maintain a python3.5 component on Buster to ease the migration for us
[10:38:33] <elukey>	 if tests in cloud will be good, upgrading to Buster should be relatively easy
[10:47:46] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) Thanks to Janis' patch I am now able to see a pod with the istio-proxy sidecar, together with CNI logs in the kubelet (to inject the iptables rule...
[11:05:16] <klausman>	 Phew that is a relief to hear. (but I am also saddened by the necessity)
[11:06:47] <elukey>	 I think that we could also upgrad to python 3.7 if we wanted, it shouldn't be that scare
[11:06:50] <elukey>	 *scary
[11:07:16] <elukey>	 with all the automated tests that Aiko created for the cloud instance it should be easy to spot if any model misbehaves during testing
[11:07:40] <elukey>	 the ORES deprecation will be long so in my opinion we should be confident in deploying to it (or making changes)
[11:07:46] <klausman>	 Good point. I would just be wary of the "For 3.7 you need to update dependency X" problem
[11:07:52] <elukey>	 we cannot keep going with the idea that we don't touch it 
[11:09:17] <elukey>	 yes definitely changing deps will require some time
[11:09:22] <elukey>	 (in the 3.7 use case)
[11:29:21] <elukey>	 aiko: o/ I saw your change but it unveiled a problem in the template, if you check the CI output it is empty (like the change wasn't doing anything)
[11:32:33] <aiko>	 elukey: o/ where can I check the CI output?
[11:33:16] <aiko>	 Ohhh I found it
[11:34:18] <klausman>	 https://integration.wikimedia.org/ci/ <- somewhere there, right?
[11:35:08] <klausman>	 Oh, and linked in Gerrit, of course :facepalm:
[11:35:20] <elukey>	 exactly yes
[11:43:42] <aiko>	 $custom_predictor.image_version -> so I should use "image_version" instead of "version"?
[11:44:55] <elukey>	 aiko: exactly yes, but probably it is less confusing to change the templating to just use "version", what do you think?
[11:45:47] <aiko>	 elukey: yep to be consistent with $generic_predictor.version
[11:46:30] <elukey>	 aiko: ack, I am testing a small change, will give you the link in a sec
[11:49:15] <aiko>	 elukey: no worries I'll send a patch for that :)
[11:49:27] <elukey>	 basically https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/770895
[11:54:03] <aiko>	 elukey: I see!! thanks Luca 
[11:55:50] <aiko>	 elukey: I have a question. What is yaml under .fixtures used for? 
[11:56:24] <elukey>	 aiko: those are basically like test fixtures, you can use helm3 template to see how the chart renders some values
[11:56:51] <elukey>	 so if you change something that modifies the behavior of the chart, you'll see changes in CI for those fixtures
[11:57:10] <elukey>	 locally you can test those via
[11:57:17] <elukey>	 helm3 template -f 'charts/kserve-inference/.fixtures/revscoring_inference_no_transformer.yaml' 'charts/kserve-inference/'
[11:57:29] <elukey>	 (the helm3 binary can be downloaded from upstream)
[11:59:33] <aiko>	 elukey: got it. Thanks :)
[12:00:11] <elukey>	 I am going out for lunch, will merge the change to unblock you when I am back
[12:00:18] <elukey>	 and we can deploy to codfw if you want!
[12:01:38] <aiko>	 elukey: yes sounds good! see you later Luca 👋
[12:02:24] * elukey lunch!
[13:03:00] <chrisalbon>	 Morning!
[13:46:20] <elukey>	 chrisalbon: o/
[13:46:30] <elukey>	 I opened a task for OS upgrade on ORES nodes 
[13:46:49] <elukey>	 I know that you wanted to start your day with this info
[13:56:04] <chrisalbon>	 lol
[13:57:08] <chrisalbon>	 Exactly what I wanted to hear, major ORES updates
[13:59:11] <elukey>	 chrisalbon: jokes aside, the current OS is not supported anymore from June onwards, but SRE will help us backporting python3.5 to Buster
[13:59:29] <elukey>	 so in theory we'll be able to keep the current python deps
[13:59:31] <elukey>	 without changing them
[13:59:37] <elukey>	 only the underlying OS will change
[14:05:20] <elukey>	 aiko: rebased https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/770886, let's see if now there is a diff
[14:08:27] <elukey>	 drain + uncordon is totally ok
[14:08:40] <elukey>	 sorry this one was for serviceops :D
[14:10:11] <elukey>	 aiko: it seems working this time! Do you want to deploy?
[14:11:43] <aiko>	 elukey: \o/ yes!
[14:18:54] <elukey>	 aiko: so I am testing on the ml-serve-eqiad cluster, if you don't mind let's do it only on codfw
[14:19:32] <elukey>	 so I am going to merge your change
[14:19:36] <elukey>	 then you can start from https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy
[14:19:37] <aiko>	 elukey: ok!
[14:21:12] <elukey>	 kevinbazira: I saw your comment in the code review, since we are not adding new models but just changing what it is existing we can proceed
[14:22:51] <elukey>	 (this one is good so we can test the extended output etc..)
[14:25:42] <elukey>	 aiko: do you want to do the deployment via meet?
[14:26:35] <aiko>	 elukey: yes that's a really good idea
[14:27:06] <elukey>	 aiko: ack, can you open a meeting?
[14:27:40] <aiko>	 elukey: yep, wait a sec
[14:30:11] <aiko>	 elukey: here https://meet.google.com/prx-evwv-xvh
[14:45:03] <kevinbazira>	 elukey: thanks for the clarification.
[14:49:15] <elukey>	 kevinbazira: np! We just tested Aiko's change, worked nicely :)
[14:52:54] <elukey>	 chrisalbon: first kubernetes deployment for Aiko, the new feature worked nicely 
[14:54:01] <aiko>	 \o/ thanks Luca for your help :)
[14:54:58] <kevinbazira>	 Great work aiko and elukey 👏👏👏
[15:12:32] <chrisalbon>	 Nice! Great job Aiko!
[16:21:07] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10calbon) a:03kevinbazira
[16:48:52] <klausman>	 elukey: I have questions about pcc :D
[16:48:58] <elukey>	 sure
[16:49:10] <klausman>	 elukey: AIUI it both would tell me if the whole config was invald. That part I have solved
[16:49:36] <klausman>	 But shouldn't it _also_ tell me what cnages would happen to the listed host(s) if I submitetd the chnages and did a puppet run?
[16:50:07] <klausman>	 Oh [string of expletives]
[16:50:07] <elukey>	 it tells you what changes yes
[16:50:40] <klausman>	 So I've been running pcc yesterday and a few times in the evening, but it would never show changes. 
[16:51:06] <klausman>	 Now I did it again, and, boom, changes visible
[16:51:27] <klausman>	 Only difference: I ran it locally instead of via the webui
[16:51:36] <klausman>	 No local changes that ren't on gerrit, mind
[16:52:00] <klausman>	 Oh well, at least that is now settled.
[16:52:45] <klausman>	 I'll disable puppet on 2002 and 2003, merge the changes (after your review), force puppet on 2001, check it comes up correctly, then activate 2002 and 2003, see the cluster converge -> happy
[16:52:48] <klausman>	 Sound good?
[16:53:14] <elukey>	 yep
[16:58:28] <elukey>	 klausman: I was reviewing the change, why did you +2 ?
[16:58:46] <klausman>	 oops.
[16:58:59] <klausman>	 Somehow I thought you'd +1'd %-/
[16:59:09] <klausman>	 Sorry!
[16:59:19] <elukey>	 I left a nit, all good, but I was confused since you asked the review :D
[16:59:58] <elukey>	 the name of the role could benefit a rename, you can do it after the rollout
[17:00:07] <klausman>	 puppet disabled on 2001, while I figure this out
[17:06:11] <klausman>	 Too much espresso, clearly :D
[17:18:43] <elukey>	 I haven't +1ed, it was PCC adding the result, but it looks good :D
[17:18:51] <klausman>	 Dammit.
[17:19:30] <klausman>	 alright, re-enabled puppet on 2001, doing puppet run 
[17:22:27] <klausman>	 Hmmm. Puppet can't find the certs
[17:22:39] <klausman>	 Error: /Stage[main]/Profile::Etcd::V3/Sslcert::Certificate[_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet]/File[/etc/ssl/localcerts/_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet.crt]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/ssl/_etcd-server-ssl._tcp.ml_staging_etcd.codfw.wmnet.crt
[17:22:54] <klausman>	 Did I miss something in the private-but-not repo?
[17:27:01] <elukey>	 yeah the .ctr needs to be copied in the public repo
[17:27:26] <klausman>	 Ah, right
[17:27:36] <elukey>	 under modules/profile/files/ssl
[17:34:51] <klausman>	 root@ml-staging-etcd2001:~# etcdctl  cluster-health
[17:34:53] <klausman>	 member 8e9e05c52164694d is healthy: got healthy result from http://localhost:2379
[17:34:55] <klausman>	 cluster is healthy
[17:34:57] <klausman>	 whee
[17:35:03] <klausman>	 Now to add the other two
[17:38:42] <elukey>	 nice!
[17:39:09] <klausman>	 weird.
[17:39:19] <klausman>	 the 2002 etcd can't parse the keyfile?
[17:40:48] <klausman>	 oooh, did I add a passphrased key?
[17:41:15] <klausman>	 yep.
[17:41:22] <klausman>	 welp, that's an easy fix
[17:53:05] <klausman>	 # etcdctl -C https://ml-staging-etcd2003.codfw.wmnet:2379 cluster-health|grep member.*is.healthy -c
[17:53:06] <klausman>	 3
[17:53:08] <klausman>	 y
[17:53:10] <klausman>	 ..ay
[17:53:30] <elukey>	 goood
[17:55:38] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Create etcd cluster for ml-serve-staging k8s - https://phabricator.wikimedia.org/T302197 (10klausman) 05Open→03Resolved ` # etcdctl -C https://ml-staging-etcd2001.codfw.wmnet:2379  cluster-health  member 493aa03d462725d1 i...
[17:55:40] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10klausman)
[17:57:53] <elukey>	 it is interesting that profile::etcd::v3::cluster_bootstrap: true stays true in multiple clusters
[18:23:10] <klausman>	 AIUI, it's needed to make a new cluster from scratch
[18:23:28] <elukey>	 yeah but then not sure if it needs to be kept
[18:23:33] <klausman>	 It would then be logical that it *could* be turned off after it's setup
[18:23:44] <elukey>	 yep, maybe it doesn't really count that much after wards
[18:23:52] <klausman>	 But: if it doesn't break anything, one might just leave it on. But then why have a setting at all?
[18:23:59] <elukey>	 going to dinner, have a nice rest of the day folks!
[18:24:14] <klausman>	 enjoy! talk to you tomorrow
[18:24:14] <elukey>	 maybe it is needed only for the bootstrap case
[19:13:08] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10QChris) > I have requested a new istio repository in: > https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests  Gerrit just had the new [[https:/...