[06:26:06] hello folks :) [07:15:32] today I am going to visit a co-working space with some friends, I may be on/off due to commuting/coffee/etc.. :) [08:28:30] good morning :) [08:35:36] back! [08:35:39] morning :) [08:38:52] hey Luca, how was the co-working space? :) [08:39:35] not bad! [08:40:19] do you want to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/772811 ? [08:40:35] for the moment eqiad is still not ready, so we could limit/test it to ml-serve-codfw [08:46:20] Kevin hasn't reviewed yet, I just sent a msg to him [08:56:05] ack sure [09:20:57] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10ayounsi) Nothing to add :) @elukey good luck with the (re)initialization [09:43:36] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10elukey) 05Open→03Resolved Thanks a lot everybody for the help! [09:43:38] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [09:43:45] klausman: o/ --^ everything seems approved from the netops, new ip ranges can be used :) [09:44:10] \o [09:44:12] yay! [09:59:58] so staging is unblocked, and we can also think about the cluster re-init [10:00:08] (once the procedure is reviewed and +1ed) [10:11:41] ack, I will try and get the staging changes split up today. I am still puzzled about one of the IP ranges :) [10:14:00] anything odd? [10:15:30] So 10.192.77.0/24 and 10.192.78.0/23 are used as service and pod ip ranges in codfw for the current prod setup, right? [10:16:09] But the only mention of 10.192.78.0 in puppet that I can find is in modules/network/data/data.yaml [10:16:45] So how does the k8s in codfw know about it? [10:16:58] profile::kubernetes::master::service_cluster_ip_range: 10.192.77.0/24 [10:17:10] That's 77/24, not 78/23 [10:17:34] Is that just through helm? [10:18:09] no no it is puppet [10:18:54] But where? [10:19:28] 77/24 is the reserved svc pool https://netbox.wikimedia.org/ipam/prefixes/382/ [10:19:32] it is a /24 [10:19:35] and /23 for pods [10:19:38] I am not talking about 77, tho :) [10:20:07] ahhh sorry so you want to know where 78 is defined [10:20:24] if you grep -rli for 10.192.78.0 in puppet, there is only the data.yaml hit [10:20:30] It should be in deployment charts, it is calico that handles that subnet [10:20:52] That's what I meant by "thorugh helm", so we on the same page :) [10:21:23] yep confirmed [10:21:23] helmfile.d/admin_ng/values/ml-serve-codfw/calico-values.yaml: cidr: "10.192.78.0/23" [10:21:26] helmfile.d/admin_ng/values/ml-serve.yaml: - "10.192.78.0/23" [10:21:41] sorry I didn't get it at first :) [10:21:43] So I don't have to worry about the new pod net for now. [10:23:43] for the moment I'd say no [10:51:36] 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10JMeybohm) I think you might need to downtime/depool from LVS maybe? Also I guess you will still see a bunch of alerts regarding BGP peerings which you can't downtime in a dedicated fashion (but... [10:54:06] 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10elukey) [10:54:19] 10Machine-Learning-Team: Re-initialize the Kubernetes ML Serve clusters - https://phabricator.wikimedia.org/T304673 (10elukey) >>! In T304673#7810262, @JMeybohm wrote: > I think you might need to downtime/depool from LVS maybe? > Also I guess you will still see a bunch of alerts regarding BGP peerings which you... [11:03:21] * elukey lunch! [11:17:21] ditto [12:56:36] aiko: when you are ok let's deploy the change1 [13:02:35] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) To keep archives happy: this is the config that I have used to allow pods to contact api-ro.discovery. wmnet. ` - apiVersion: networking.istio.io... [13:34:54] elukey: I'm ok! can we do it in a meeting? [13:35:50] aiko: sure! I am doing one thing atm, is it ok in ~30 mins? [13:36:54] elukey: sure!! :) [14:08:35] aiko: I am ready if you want [14:08:59] elukey: here https://meet.google.com/ctx-ouqv-ysp [15:20:33] elukey: sent a CR your way. Not sure if I can/should ditch the hierdata/common/yaml stuff as well [15:20:39] common.yaml* [15:33:05] sure, will review in a bit [15:37:41] aiko: I just asked to wikimedia-cloud, when I get an answer from them I'll tell you :) [15:38:21] (the wikimedia sandbox's attached volume, that populates the /srv partition, seems in a weird state after the last reboot) [15:38:34] (wikimedia ml's sandbox) [16:01:34] elukey: thanks Luca! :) [16:01:38] Morning all! [16:01:44] I'm going to miss you this week! [16:02:40] o/ morning :) [16:04:25] o/ [16:04:32] \o heyo Chris. btw, where in nyc does the thing happen? [16:05:33] 80 Bowery is the venue. It is where the board of trustees met last week (the offsite is apparently tagging onto all the covid prep they did for the board meeting) [16:05:39] Its in... uh... NYC [16:05:57] I don't know the geography of NYC to say anything else [16:06:17] Lower Manhattan, near Chinatown [16:06:19] Sorry 50 Bowery [16:06:37] Then _in_ Chinatown :D [16:07:31] Right next to the HSBC dome, too [16:22:31] klausman: qq about https://netbox.wikimedia.org/ipam/prefixes/530/prefixes/ [16:22:46] in theory what we have in prod now is /24 for svcs and /23 for pods [16:22:46] Yes? [16:23:02] did we decide to flip them? I don't recall (I am triple checking) [16:23:04] wait, really? [16:23:42] see https://netbox.wikimedia.org/ipam/prefixes/377/prefixes/ [16:24:03] we indeed have 77/24 as pods, ergo /23 for svcs [16:24:22] I mean, I can just change it in the CR and change the description in NB [16:24:40] (i.e. use 62.0/23 for pods) [16:24:59] yes yes sure no big deal, I am wondering what was best (the subnet in the cr looks not one of the new ip ranges though) [16:25:06] and 61/24 for svcs [16:25:14] https://netbox.wikimedia.org/ipam/prefixes/530/prefixes/ [16:25:24] The subnet I had was indeed wrong [16:25:49] So in the 530 URL, we currently have a /24 for pods, and a /23 for svcs [16:26:04] I'd swap the description there, and put 61/24 into the CR [16:26:10] super [16:26:32] (we'd probably also want to swap the descriptions of the /20 and /21 in the same page [16:27:42] At least I git wrong _consistently_ :D [16:27:53] (also fortuitous typo there) [16:30:20] I'll also fix these: https://netbox.wikimedia.org/ipam/prefixes/535/prefixes/ [16:31:22] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [16:36:59] 10Machine-Learning-Team, 10Cloud-Services: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10elukey) [16:37:05] aiko: --^ [16:38:25] 10Machine-Learning-Team, 10Cloud-VPS: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10Majavah) [16:59:34] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit... [17:12:21] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [17:12:50] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) 05In progress→03Resolved [17:27:01] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey Since moving the server, I cannot get it to install the OS correctly, can you please take a look. Thanks