[10:52:41] <dcaro>	 is the k8s upgrade meeting hapenning today?
[10:54:12] <arturo>	 yes
[10:54:30] <dcaro>	 ACK
[10:57:09] <blancadesal>	 i won't join today, my daughter has a doc appointment
[11:32:08] <arturo>	 oh, then I understand now the question by david
[11:50:53] <dcaro>	 I did not know, just thought that as yesterday everything was kind of ready, here was nothing else until the actual upgrade
[11:52:52] <arturo>	 yeah, let's cancel the meeting
[11:55:30] <dcaro>	 👍
[12:03:29] <topranks>	 fyi guys I'm removing unused cloud vlans from asw2-b-eqiad 
[12:03:33] <topranks>	 won't have any effect 
[12:04:45] <topranks>	 I'm leaving the link from it to cloudsw1-c8 as is otherwise, with "cloud-hosts1-eqiad" vlan on it to support cloudcephmon1001 (in rack B7)
[12:05:04] <dcaro>	 topranks: thanks!@
[12:05:22] <dcaro>	 topranks: how's the qos going btw? is there anything you need from ceph?
[12:05:41] <dcaro>	 (we can do some tests or something if you want at some point)
[12:05:48] <topranks>	 dcaro: actually I will be leaving cloud-storage1-eqiad vlan in place too 
[12:06:05] <topranks>	 as it's configured for cloudcephmon1001 port, although not 100% it's in use there as I don't see a MAC 
[12:06:16] <topranks>	 anyway we can review again no problem to have it there just doing some clean-up 
[12:06:42] <topranks>	 dcaro: qos is mostly done, I'm waiting on one last review in netops and then will be rolling it out
[12:06:56] <topranks>	 the puppet stuff is done, so yeah we can look at adding some rules for ceph 
[12:07:28] <dcaro>	 nice :) thanks for all that work, really appreciate it
[12:08:26] <topranks>	 no probs, sry it took so long.  I'll have to dig up the task where we talked about ceph, I was looking at it last week and doing some tests 
[12:08:51] <topranks>	 there is a config option that needs to be set so that ceph marks the heartbeat traffic 
[12:09:03] <topranks>	 I'll dig it out and we can discuss what the best way forward is 
[12:09:56] <topranks>	 but I guess the goal is to allow a major ceph incident / re-sync of data to go full-speed, while not overly impacting other services for cloud, i.e. max the switch<->switch links 
[12:10:31] <dcaro>	 yep, including ceph itself (client traffic)
[12:10:51] <topranks>	 yeah, we need to make sure to classify both sides 
[12:11:29] <topranks>	 you probably already have a "firewall::service" def in puppet for the servers to allow it in the firewall, but we'll probably need to add a "firewall::client" def for the other hosts that use it to mark the requests 
[12:11:35] <arturo>	 are ceph profiles already using nftables-backed firewalls?
[12:11:53] <topranks>	 arturo: I don't think so, a few weeks back I was looking 
[12:12:13] <topranks>	 but the "firewall::service" and "firewall::client" are wrappers and can be used whether the back-end is ferm or nftables 
[12:12:32] <arturo>	 so I think the qos only works with nftables-based firewall, right? so that switch needs to happens first
[12:12:38] <topranks>	 we may need for ceph to use "ferm::rule" or "nftables::rules" instead if we need something a little custom but it should be ok 
[12:13:03] <topranks>	 arturo: no we were going to go that way to encourage people to move to nftables, but there is too much stuff on ferm 
[12:13:09] <dcaro>	 I'm in the process of reimaging a bunch of the hosts, and I'll have to upgrade all of them (with reboots and such), maybe that can be done then?
[12:13:15] <topranks>	 so I added what's needed for both ferm and nftables 
[12:13:34] <arturo>	 ack
[12:13:37] <topranks>	 dcaro: which, the qos or nftables?  
[12:13:52] <dcaro>	 I'd start with nftables at least
[12:13:57] <topranks>	 adding the qos marking rules can probably be done any time, it'll just add a few new rules to mangle table to set DSCP, so non-disruptive 
[12:14:05] <topranks>	 no need to do at reimage time etc 
[12:14:21] <topranks>	 nftables sounds like a bigger change so perhaps would make sense to do it while they were offline, test etc 
[12:14:25] <dcaro>	 better then :), we can do the nftables during the upgrade, and then the qos after
[12:14:27] <dcaro>	 yep
[12:14:31] <topranks>	 sure, either way 
[12:14:50] <topranks>	 but the qos can work in both scenario so we can do in either order 
[12:15:08] <dcaro>	 the thing is that if we have to take one node out of the pool it takes ~6h to depool/repool
[12:15:19] <topranks>	 it would be good to add that before we tackle the switch upgrades, and then hopefully things are more stable when we have a lot of activity after hosts come back online?
[12:15:30] <dcaro>	 🤞
[12:15:32] <dcaro>	 yep
[12:15:45] <topranks>	 yep sure.  so like adding the qos rules I don't think we'd need to take anything out of a pool 
[12:15:52] <topranks>	 nftables a bigger change though 
[12:17:34] <topranks>	 if you look now you'll see they already have the basic rule marking everything as 'normal' priority 
[12:17:40] <topranks>	 https://www.irccloud.com/pastebin/oOdZe4lO/
[12:18:27] <topranks>	 to mark certain traffic up/down we just have additional rules here 
[12:19:40] <dcaro>	 neat, did not know that TOS field was deprecated xd
[12:19:51] <dcaro>	 (22 years ago)
[12:19:57] <topranks>	 ugh... TOS... DSCP.. so confusing! 
[12:20:11] <topranks>	 DSCP takes the last 6 bits of the TOS 8-bit field and gives everything new names 
[12:20:33] <topranks>	 which is fine.... but the hex/decimal representation of all these things then gets confusing!
[12:20:47] <dcaro>	 hahahaha
[12:20:56] <topranks>	 this table is my bible :P
[12:20:57] <topranks>	 https://www.tucny.com/Home/dscp-tos
[12:21:09] <dcaro>	 awesome! thanks!
[12:23:27] <dcaro>	 that goes to my bookmarks
[12:23:59] <topranks>	 yeah same.
[12:24:06] <topranks>	 for reference we're gonna use these markings:
[12:24:07] <topranks>	 https://wikitech.wikimedia.org/wiki/Quality_of_Service_(Network)#QoS_Classes
[12:24:24] <topranks>	 but ultimately the actual numbers aren't that important, what's important is they are set up the same way everywhere 
[12:30:52] <arturo>	 👍
[12:31:30] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/44 
[12:34:22] <arturo>	 dcaro: +1'd
[12:34:30] <dcaro>	 thanks!
[12:37:21] <arturo>	 dcaro: would you please stamp https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/55 ?
[12:39:23] <dcaro>	 I don't see why it's critical, but sure, is anything on fire?
[12:39:54] <arturo>	 no
[12:39:59] <arturo>	 why would anything be on fire?
[12:40:29] <dcaro>	 because you are asking for stamps
[12:40:57] <arturo>	 that's the kind of code reviews that I like the most :-)
[12:41:36] <dcaro>	 xd, no code reviews essentially
[12:42:54] <arturo>	 I don't agree, but also wont debate over it
[12:44:25] <dcaro>	 to clarify, I do not review the code when you ask for a stamp, I trust you to have a meaningful reason for needed that code merged without reviews
[12:54:46] <dcaro>	 another quick review (not stamp) please: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/103
[12:56:26] <arturo>	 well, I'm happy to +1 this trivial change, but I think you should feel free to self-merge this one
[12:56:27] <arturo>	 :-P
[12:57:46] <arturo>	 we have now maintain-kubeuser randomly renewing a few certs each run
[12:57:48] <arturo>	 https://usercontent.irccloud-cdn.com/file/kXK0xoSK/image.png
[12:58:18] <arturo>	 the new renewed certs will have an expiration of 10 days (instead of 1 year)
[12:58:42] <dcaro>	 we might want to not renew certs that were just renewed
[12:58:49] <dcaro>	 (even if it's randomly)
[12:59:44] <dcaro>	 are you monitoring anything specific to see if there's any issues?
[12:59:51] <dcaro>	 (on the users side I mean)
[12:59:53] <arturo>	 why? certs should be renewable anytime, even multiple times in a row
[13:01:21] <arturo>	 I'm not monitoring anything in particular. I've checked that the new certs are valid for the few tools that got them renewed
[13:01:31] <arturo>	 manually checked, that is
[13:03:12] <dcaro>	 it's a pretty critical bit, there's bugs that happen, unexpected events... many things can go wrong in unexpected ways, feels like something we should keep an eye on for a while
[13:10:45] <arturo>	 I will do 👍
[13:15:46] <dcaro>	 does the jobs-api cache the certs in any way?
[13:18:22] <dcaro>	 does not look like it, that's ok
[13:55:01] * dcaro paged
[13:55:17] <dcaro>	 toolsdb alerts of being down, looking
[13:55:39] <dcaro>	 oh, the alert went away
[13:59:25] <dcaro>	 hmm... the cluster is ok
[13:59:38] <dcaro>	 https://www.irccloud.com/pastebin/ABbiOiqH/
[14:02:18] <dcaro>	 hmmm... prometheus still shows 0 primaries to me
[14:02:21] <dcaro>	 https://usercontent.irccloud-cdn.com/file/1shSz6pQ/image.png
[14:05:05] <dcaro>	 hmm, it seems as if tools-db-1 stopped reporting
[14:05:31] <dcaro>	 yep
[14:05:33] <dcaro>	 https://www.irccloud.com/pastebin/uTN0Yk36/
[15:43:13] <dcaro>	 the alert is there already, it's just the grep that does not show up the next line 🤦‍♂️
[16:00:29] <arturo>	 just saw another LDAP server error on maintain-kubeusers
[16:00:32] <arturo>	 https://usercontent.irccloud-cdn.com/file/mx95frU1/image.png
[16:00:52] <arturo>	 (it is harmless at this point, the pod just gets restarted)
[16:04:27] * arturo off
[17:51:45] * dcaro off
[18:26:48] <andrewbogott>	 Raymond_Ndibe: I'm going to do some maintenance on the toolsbeta harbor db, will likely cause some lima-kila downtime.