[00:05:45] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) T287348#7699428 [06:41:16] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) Packages uploaded to Bullseye. What I did: * reprepro seems not supporting the copy of debs from one component in a distr... [11:31:21] 10serviceops, 10SRE, 10GitLab (Infrastructure): gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05Resolved→03Open It seems gitlab-runner metrics exporter for trusted runner have issues now. The auto-detected address of these runners changed to IPv6 as well and exporter... [13:23:21] hello folks [13:23:53] ml-serve2006 is running with bullseye and overlayfs, but kubelet doesn't come up due to missing cgroup cpu [13:24:05] afaics the support for cgroupsv2 was added in k8s 1.19 [13:24:36] and the old stuff is no more with bullseye? [13:25:19] in theory if we set systemd.unified_cgroup_hierarchy = 0 we should get back the old cgroup settings [13:25:25] docker should be pleased as well in theory [13:25:51] but now I am wondering if it makes sense to move to bullseye or not [13:26:37] * jayme recalls https://gerrit.wikimedia.org/r/c/operations/puppet/+/524186 [13:27:44] sorry, I'm not really into that topic - just poking around here [13:28:50] I was reading https://github.com/kubernetes/kubernetes/issues/90710#issuecomment-624129922 [13:29:16] but maybe I am missing something [13:31:24] I was reading https://github.com/kubernetes/enhancements/issues/2254 [13:32:54] added a note to the task [13:32:55] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) ml-serve2006 is running with bullseye and overlayfs, but kubelet doesn't start: ` failed to run Kubelet: mountpoint for c... [13:33:46] yeah alpha release 1.18 [13:35:50] do you have any particular concern against "systemd.unified_cgroup_hierarchy = 0" ? [13:39:57] jayme: not really, in theory docker on bullseye should support both, I can try to quickly reboot and see how it goes [13:42:57] ack. I think it would still be worth it to go bullseye. We could in theory switch to cgroupv2 without reimaging then (after k8s upgrade) [13:43:13] <_joe_> yeah no software supports only cgroups v2 until there is a supported RHEL without it [13:43:32] <_joe_> jayme: ++ [13:48:53] all serviceops people against the poor ml engineer, too easy [13:48:55] tsk [14:01:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Provide a convenient way to connect to services in kubernetes staging clusters - https://phabricator.wikimedia.org/T300740 (10JMeybohm) [14:04:24] looks like it works [14:06:38] jayme: if you have 1 min https://gerrit.wikimedia.org/r/c/operations/homer/public/+/761619 [14:07:10] absolutely [14:07:37] <3 [14:17:53] /etc/calico/confd/config/bird6.cfg: No such file or directory [14:17:56] does it ring a bell? [14:18:10] maybe the host is not pooled [14:20:28] pebcak of course [14:27:21] node up and running :) [14:28:08] I'll keep testing it for some days, and possibly add the other two nodes to the ml-codfw cluster as bullseye [14:28:34] <_joe_> elukey: that's great [14:28:36] the last step is to file a change to set the grub setting to revert cgroupsv2 [14:29:19] once we are confident enough we can schedule the reimage of a kubernetes node [14:30:07] <_joe_> elukey: about htat [14:30:35] <_joe_> on monday or so, I'd like to test the code I wrote in spicerack for draining a node [14:30:48] <_joe_> i thought the ML cluster is a good candidate for the test [14:30:54] definitely yes [14:31:27] <_joe_> if that works, then reimaging a k8s node just needs a hook to drain a node before the reboot [14:31:37] <_joe_> in the reimage cookbook I mean [14:34:53] it would be very handy [14:36:10] <_joe_> jayme: do you want to take another look at the rakefile refactor or it's ok to merge it? [14:36:57] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) I have manually edited /etc/default/grub and added `systemd.unified_cgroup_hierarchy=0`, after a reboot the kubelet works... [14:44:46] _joe_: currently looking [15:10:32] 10serviceops, 10SRE, 10WMDE-Technical-Wishes-Maintenance: Migrate kartotherian production service to node12 - https://phabricator.wikimedia.org/T301475 (10awight) [15:10:40] 10serviceops, 10SRE, 10WMDE-Technical-Wishes-Maintenance: Migrate geoshapes production service to node12 - https://phabricator.wikimedia.org/T301476 (10awight) [15:12:21] 10serviceops, 10SRE, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10awight) [15:14:33] 10serviceops, 10SRE, 10WMDE-Technical-Wishes-Maintenance: Migrate kartotherian production service to node12 - https://phabricator.wikimedia.org/T301475 (10awight) [15:32:20] 10serviceops, 10Phabricator, 10Release-Engineering-Team (Next): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Osnard) Just a remark: @ItSpiderman (@Dsavuljesku) and me are using `tool-cr-grants-team-metasync.git`. It can be moved at any time to either g... [16:10:51] 10serviceops, 10SRE, 10GitLab (Infrastructure), 10Patch-For-Review: gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05Open→03Resolved Metrics of trusted runners are fixed. GitLab seems to automagically parse the runners address from the request/register flow. With IP... [17:29:02] hello! I'd like to increase the cpu limits for changeprop-jobqeue by a bit because we're seeing concurrency issues and some cpu throttling of jobs - is that any cause for concern? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/761677 [18:44:46] <_joe_> hnowlan: I wouldn't think so, no [18:45:44] <_joe_> oh there was no limit set for cpus? [19:13:37] 10serviceops, 10Security-Team, 10Wikidata Query UI, 10SecTeam-Processed, and 3 others: Wikidata Query UI lets users build links with arbitrary link text and javascript: URL - https://phabricator.wikimedia.org/T297686 (10Addshore) 05Open→03Resolved [20:06:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Jclark-ctr) [23:10:30] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) Done. This is in use now in production and etherpad1002 does not have the etherpad role anymore. [23:37:10] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) What went wrong here at first: When we switched from etherpad1002 to etherpad1003, etherpad itself worked (curl ht... [23:38:10] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [23:38:46] 10serviceops, 10SRE, 10Wikimedia-Etherpad, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) 05In progress→03Resolved a:03Dzahn