[03:42:18] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Response: https://lists.apache.org/thread/dont796lp84vfqnovolryw0y0470mqsv > The ap... [03:46:59] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @akosiaris I don't love it, but I added a `engine: flink` label to the pods, and am... [08:03:15] hello folks [08:03:40] I noticed that kubernetes2007 shows an alarm for pybal (k8s-ingress), and `kubectl get nodes) shows it as NotReady [08:05:08] the only thing that I see are some OOM kills for "node" in the dmesg [08:05:44] doesn't seem to be the kubelet [08:06:43] ah no ok nevermind the oom kills are cgroup related, probably pods, I see them in other places as well [08:09:21] and in describe node I see [08:09:27] kubelet stopped posting node status. [08:10:37] (restarted it) [08:11:28] ah nice now Ready [08:12:32] it was marked down the 17th [08:12:52] so I suspect it was a downfall of the switch issue [08:17:15] now I only see things like "Pybal backend kubernetes1009:0 is down (thumbor_8800)" (3/4 of them) [08:19:22] ah but the nodes are inactive ok [08:21:52] (and the alert/metric doesn't know about its status) [12:04:24] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc2040.codfw.wmnet with OS bullseye [12:38:48] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc2040.codfw.wmnet with OS bullseye completed: - mc2040 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [14:18:19] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar), 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10lmata) thank you @fgiunchedi ! [15:33:52] 10serviceops, 10Maps: Upgrade maps servers to bullseye - https://phabricator.wikimedia.org/T327513 (10hnowlan) [16:33:42] akosiaris: any better ideas than 'engine'? i don't like it. [16:33:50] k8s_client_name ? [16:34:00] client_name? [16:34:02] agent? [16:34:21] a boolean label? [16:34:36] k8s_api_allowed: true [16:34:36] ? [16:34:50] k8s_service_egress: true [16:36:16] "k8s_api_allowed: true" sounds good [16:36:26] For your metadata.label right ? [17:03:14] yes [17:03:32] claime: just seems annoying to make a boolean label [17:05:44] but, i like it better than engine: flink so i'll go with that for now. [17:09:20] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) [17:09:45] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) [18:13:49] fyi, we are removing racktables entirely now. started decom :) [19:27:11] 10serviceops, 10Data-Engineering, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > 'engine' is the best I came up with, please bikeshed away (if you care?) :) I cha...