[06:26:06] <elukey>	 hello folks :)
[07:15:32] <elukey>	 today I am going to visit a co-working space with some friends, I may be on/off due to commuting/coffee/etc.. :)
[08:28:30] <aiko>	 good morning :)
[08:35:36] <elukey>	 back!
[08:35:39] <elukey>	 morning :)
[08:38:52] <aiko>	 hey Luca, how was the co-working space? :)
[08:39:35] <elukey>	 not bad! 
[08:40:19] <elukey>	 do you want to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/772811 ?
[08:40:35] <elukey>	 for the moment eqiad is still not ready, so we could limit/test it to ml-serve-codfw
[08:46:20] <aiko>	 Kevin hasn't reviewed yet, I just sent a msg to him 
[08:56:05] <elukey>	 ack sure
[09:20:57] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10ayounsi) Nothing to add :)  @elukey good luck with the (re)initialization
[09:43:36] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) 05Open→03Resolved Thanks a lot everybody for the help!
[09:43:38] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[09:43:45] <elukey>	 klausman: o/ --^ everything seems approved from the netops, new ip ranges can be used :)
[09:44:10] <klausman>	 \o 
[09:44:12] <klausman>	 yay!
[09:59:58] <elukey>	 so staging is unblocked, and we can also  think about the cluster re-init
[10:00:08] <elukey>	 (once the procedure is reviewed and +1ed)
[10:11:41] <klausman>	 ack, I will try and get the staging changes split up today. I am still puzzled about one of the IP ranges :)
[10:14:00] <elukey>	 anything odd?
[10:15:30] <klausman>	 So 10.192.77.0/24 and 10.192.78.0/23 are used as service and pod ip ranges in codfw for the current prod setup, right?
[10:16:09] <klausman>	 But the only mention of 10.192.78.0 in puppet that I can find is in modules/network/data/data.yaml
[10:16:45] <klausman>	 So how does the k8s in codfw know about it?
[10:16:58] <elukey>	 profile::kubernetes::master::service_cluster_ip_range: 10.192.77.0/24
[10:17:10] <klausman>	 That's 77/24, not 78/23
[10:17:34] <klausman>	 Is that just through helm?
[10:18:09] <elukey>	 no no it is puppet
[10:18:54] <klausman>	 But where?
[10:19:28] <elukey>	 77/24 is the reserved svc pool https://netbox.wikimedia.org/ipam/prefixes/382/
[10:19:32] <elukey>	 it is a /24
[10:19:35] <elukey>	 and /23 for pods
[10:19:38] <klausman>	 I am not talking about 77, tho :)
[10:20:07] <elukey>	 ahhh sorry so you want to know where 78 is defined
[10:20:24] <klausman>	 if you grep -rli for 10.192.78.0 in puppet, there is only the data.yaml hit
[10:20:30] <elukey>	 It should be in deployment charts, it is calico that handles that subnet
[10:20:52] <klausman>	 That's what I meant by "thorugh helm", so we on the same page :)
[10:21:23] <elukey>	 yep confirmed
[10:21:23] <elukey>	 helmfile.d/admin_ng/values/ml-serve-codfw/calico-values.yaml:    cidr: "10.192.78.0/23"
[10:21:26] <elukey>	 helmfile.d/admin_ng/values/ml-serve.yaml:            - "10.192.78.0/23"
[10:21:41] <elukey>	 sorry I didn't get it at first :)
[10:21:43] <klausman>	 So I don't have to worry about the new pod net for now.
[10:23:43] <elukey>	 for the moment I'd say no
[10:51:36] <wikibugs>	 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10JMeybohm) I think you might need to downtime/depool from LVS maybe? Also I guess you will still see a bunch of alerts regarding BGP peerings which you can't downtime in a dedicated fashion (but...
[10:54:06] <wikibugs>	 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10elukey)
[10:54:19] <wikibugs>	 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10elukey) >>! In T304673#7810262, @JMeybohm wrote: > I think you might need to downtime/depool from LVS maybe? > Also I guess you will still see a bunch of alerts regarding BGP peerings which you...
[11:03:21] * elukey lunch!
[11:17:21] <klausman>	 ditto
[12:56:36] <elukey>	 aiko: when you are ok let's deploy the change1
[13:02:35] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) To keep archives happy: this is the config that I have used to allow pods to contact api-ro.discovery. wmnet.  ` - apiVersion: networking.istio.io...
[13:34:54] <aiko>	 elukey: I'm ok! can we do it in a meeting?
[13:35:50] <elukey>	 aiko: sure! I am doing one thing atm, is it ok in ~30 mins?
[13:36:54] <aiko>	 elukey: sure!! :)
[14:08:35] <elukey>	 aiko: I am ready if you want
[14:08:59] <aiko>	 elukey: here https://meet.google.com/ctx-ouqv-ysp
[15:20:33] <klausman>	 elukey: sent a CR your way. Not sure if I can/should ditch the hierdata/common/yaml stuff as well
[15:20:39] <klausman>	 common.yaml*
[15:33:05] <elukey>	 sure, will review in a bit
[15:37:41] <elukey>	 aiko: I just asked to wikimedia-cloud, when I get an answer from them I'll tell you :)
[15:38:21] <elukey>	 (the wikimedia sandbox's attached volume, that populates the /srv partition, seems in a weird state after the last reboot)
[15:38:34] <elukey>	 (wikimedia ml's sandbox)
[16:01:34] <aiko>	 elukey: thanks Luca! :)
[16:01:38] <chrisalbon>	 Morning all!
[16:01:44] <chrisalbon>	 I'm going to miss you this week!
[16:02:40] <aiko>	 o/ morning :)
[16:04:25] <elukey>	 o/
[16:04:32] <klausman>	 \o heyo Chris. btw, where in nyc does the thing happen?
[16:05:33] <chrisalbon>	 80 Bowery is the venue. It is where the board of trustees met last week (the offsite is apparently tagging onto all the covid prep they did for the board meeting)
[16:05:39] <chrisalbon>	 Its in... uh... NYC
[16:05:57] <chrisalbon>	 I don't know the geography of NYC to say anything else
[16:06:17] <klausman>	 Lower Manhattan, near Chinatown
[16:06:19] <chrisalbon>	 Sorry 50 Bowery
[16:06:37] <klausman>	 Then _in_ Chinatown :D
[16:07:31] <klausman>	 Right next to the HSBC dome, too
[16:22:31] <elukey>	 klausman: qq about https://netbox.wikimedia.org/ipam/prefixes/530/prefixes/
[16:22:46] <elukey>	 in theory what we have in prod now is /24 for svcs and /23 for pods 
[16:22:46] <klausman>	 Yes?
[16:23:02] <elukey>	 did we decide to flip them? I don't recall (I am triple checking)
[16:23:04] <klausman>	 wait, really?
[16:23:42] <elukey>	 see https://netbox.wikimedia.org/ipam/prefixes/377/prefixes/
[16:24:03] <klausman>	 we indeed have 77/24 as pods, ergo /23 for svcs
[16:24:22] <klausman>	 I mean, I can just change it in the CR and change the description in NB
[16:24:40] <klausman>	 (i.e. use 62.0/23 for pods)
[16:24:59] <elukey>	 yes yes sure no big deal, I am wondering what was best (the subnet in the cr looks not one of the new ip ranges though)
[16:25:06] <klausman>	 and 61/24 for svcs
[16:25:14] <klausman>	 https://netbox.wikimedia.org/ipam/prefixes/530/prefixes/
[16:25:24] <klausman>	 The subnet I had was indeed wrong
[16:25:49] <klausman>	 So in the 530 URL, we  currently have a /24 for pods, and a /23 for svcs
[16:26:04] <klausman>	 I'd swap the description there, and put 61/24 into the CR
[16:26:10] <elukey>	 super
[16:26:32] <klausman>	 (we'd probably also want to swap the descriptions of the /20 and /21 in the same page
[16:27:42] <klausman>	 At least I git wrong _consistently_ :D
[16:27:53] <klausman>	 (also fortuitous typo there)
[16:30:20] <klausman>	 I'll also fix these: https://netbox.wikimedia.org/ipam/prefixes/535/prefixes/
[16:31:22] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye
[16:36:59] <wikibugs>	 10Machine-Learning-Team, 10Cloud-Services: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10elukey)
[16:37:05] <elukey>	 aiko: --^
[16:38:25] <wikibugs>	 10Machine-Learning-Team, 10Cloud-VPS: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10Majavah)
[16:59:34] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit...
[17:12:21] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[17:12:50] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) 05In progress→03Resolved
[17:27:01] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey Since moving the server, I cannot get it to install the OS correctly, can you please take a look. Thanks