[10:52:41] is the k8s upgrade meeting hapenning today? [10:54:12] yes [10:54:30] ACK [10:57:09] i won't join today, my daughter has a doc appointment [11:32:08] oh, then I understand now the question by david [11:50:53] I did not know, just thought that as yesterday everything was kind of ready, here was nothing else until the actual upgrade [11:52:52] yeah, let's cancel the meeting [11:55:30] 👍 [12:03:29] fyi guys I'm removing unused cloud vlans from asw2-b-eqiad [12:03:33] won't have any effect [12:04:45] I'm leaving the link from it to cloudsw1-c8 as is otherwise, with "cloud-hosts1-eqiad" vlan on it to support cloudcephmon1001 (in rack B7) [12:05:04] topranks: thanks!@ [12:05:22] topranks: how's the qos going btw? is there anything you need from ceph? [12:05:41] (we can do some tests or something if you want at some point) [12:05:48] dcaro: actually I will be leaving cloud-storage1-eqiad vlan in place too [12:06:05] as it's configured for cloudcephmon1001 port, although not 100% it's in use there as I don't see a MAC [12:06:16] anyway we can review again no problem to have it there just doing some clean-up [12:06:42] dcaro: qos is mostly done, I'm waiting on one last review in netops and then will be rolling it out [12:06:56] the puppet stuff is done, so yeah we can look at adding some rules for ceph [12:07:28] nice :) thanks for all that work, really appreciate it [12:08:26] no probs, sry it took so long. I'll have to dig up the task where we talked about ceph, I was looking at it last week and doing some tests [12:08:51] there is a config option that needs to be set so that ceph marks the heartbeat traffic [12:09:03] I'll dig it out and we can discuss what the best way forward is [12:09:56] but I guess the goal is to allow a major ceph incident / re-sync of data to go full-speed, while not overly impacting other services for cloud, i.e. max the switch<->switch links [12:10:31] yep, including ceph itself (client traffic) [12:10:51] yeah, we need to make sure to classify both sides [12:11:29] you probably already have a "firewall::service" def in puppet for the servers to allow it in the firewall, but we'll probably need to add a "firewall::client" def for the other hosts that use it to mark the requests [12:11:35] are ceph profiles already using nftables-backed firewalls? [12:11:53] arturo: I don't think so, a few weeks back I was looking [12:12:13] but the "firewall::service" and "firewall::client" are wrappers and can be used whether the back-end is ferm or nftables [12:12:32] so I think the qos only works with nftables-based firewall, right? so that switch needs to happens first [12:12:38] we may need for ceph to use "ferm::rule" or "nftables::rules" instead if we need something a little custom but it should be ok [12:13:03] arturo: no we were going to go that way to encourage people to move to nftables, but there is too much stuff on ferm [12:13:09] I'm in the process of reimaging a bunch of the hosts, and I'll have to upgrade all of them (with reboots and such), maybe that can be done then? [12:13:15] so I added what's needed for both ferm and nftables [12:13:34] ack [12:13:37] dcaro: which, the qos or nftables? [12:13:52] I'd start with nftables at least [12:13:57] adding the qos marking rules can probably be done any time, it'll just add a few new rules to mangle table to set DSCP, so non-disruptive [12:14:05] no need to do at reimage time etc [12:14:21] nftables sounds like a bigger change so perhaps would make sense to do it while they were offline, test etc [12:14:25] better then :), we can do the nftables during the upgrade, and then the qos after [12:14:27] yep [12:14:31] sure, either way [12:14:50] but the qos can work in both scenario so we can do in either order [12:15:08] the thing is that if we have to take one node out of the pool it takes ~6h to depool/repool [12:15:19] it would be good to add that before we tackle the switch upgrades, and then hopefully things are more stable when we have a lot of activity after hosts come back online? [12:15:30] 🤞 [12:15:32] yep [12:15:45] yep sure. so like adding the qos rules I don't think we'd need to take anything out of a pool [12:15:52] nftables a bigger change though [12:17:34] if you look now you'll see they already have the basic rule marking everything as 'normal' priority [12:17:40] https://www.irccloud.com/pastebin/oOdZe4lO/ [12:18:27] to mark certain traffic up/down we just have additional rules here [12:19:40] neat, did not know that TOS field was deprecated xd [12:19:51] (22 years ago) [12:19:57] ugh... TOS... DSCP.. so confusing! [12:20:11] DSCP takes the last 6 bits of the TOS 8-bit field and gives everything new names [12:20:33] which is fine.... but the hex/decimal representation of all these things then gets confusing! [12:20:47] hahahaha [12:20:56] this table is my bible :P [12:20:57] https://www.tucny.com/Home/dscp-tos [12:21:09] awesome! thanks! [12:23:27] that goes to my bookmarks [12:23:59] yeah same. [12:24:06] for reference we're gonna use these markings: [12:24:07] https://wikitech.wikimedia.org/wiki/Quality_of_Service_(Network)#QoS_Classes [12:24:24] but ultimately the actual numbers aren't that important, what's important is they are set up the same way everywhere [12:30:52] 👍 [12:31:30] quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/44 [12:34:22] dcaro: +1'd [12:34:30] thanks! [12:37:21] dcaro: would you please stamp https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/55 ? [12:39:23] I don't see why it's critical, but sure, is anything on fire? [12:39:54] no [12:39:59] why would anything be on fire? [12:40:29] because you are asking for stamps [12:40:57] that's the kind of code reviews that I like the most :-) [12:41:36] xd, no code reviews essentially [12:42:54] I don't agree, but also wont debate over it [12:44:25] to clarify, I do not review the code when you ask for a stamp, I trust you to have a meaningful reason for needed that code merged without reviews [12:54:46] another quick review (not stamp) please: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/103 [12:56:26] well, I'm happy to +1 this trivial change, but I think you should feel free to self-merge this one [12:56:27] :-P [12:57:46] we have now maintain-kubeuser randomly renewing a few certs each run [12:57:48] https://usercontent.irccloud-cdn.com/file/kXK0xoSK/image.png [12:58:18] the new renewed certs will have an expiration of 10 days (instead of 1 year) [12:58:42] we might want to not renew certs that were just renewed [12:58:49] (even if it's randomly) [12:59:44] are you monitoring anything specific to see if there's any issues? [12:59:51] (on the users side I mean) [12:59:53] why? certs should be renewable anytime, even multiple times in a row [13:01:21] I'm not monitoring anything in particular. I've checked that the new certs are valid for the few tools that got them renewed [13:01:31] manually checked, that is [13:03:12] it's a pretty critical bit, there's bugs that happen, unexpected events... many things can go wrong in unexpected ways, feels like something we should keep an eye on for a while [13:10:45] I will do 👍 [13:15:46] does the jobs-api cache the certs in any way? [13:18:22] does not look like it, that's ok [13:55:01] * dcaro paged [13:55:17] toolsdb alerts of being down, looking [13:55:39] oh, the alert went away [13:59:25] hmm... the cluster is ok [13:59:38] https://www.irccloud.com/pastebin/ABbiOiqH/ [14:02:18] hmmm... prometheus still shows 0 primaries to me [14:02:21] https://usercontent.irccloud-cdn.com/file/1shSz6pQ/image.png [14:05:05] hmm, it seems as if tools-db-1 stopped reporting [14:05:31] yep [14:05:33] https://www.irccloud.com/pastebin/uTN0Yk36/ [15:43:13] the alert is there already, it's just the grep that does not show up the next line 🤦‍♂️ [16:00:29] just saw another LDAP server error on maintain-kubeusers [16:00:32] https://usercontent.irccloud-cdn.com/file/mx95frU1/image.png [16:00:52] (it is harmless at this point, the pod just gets restarted) [16:04:27] * arturo off [17:51:45] * dcaro off [18:26:48] Raymond_Ndibe: I'm going to do some maintenance on the toolsbeta harbor db, will likely cause some lima-kila downtime.