[08:54:57] hi folks, asking here first but feel free to redirect me as needed. I'm reviewing the pages we have in icinga in T305847 and came across "check_procs" for zookeeper, basically check whether the process is running and page if not. is that sth we should be porting to alertmanager/prometheus as-is ? (i.e. alerting on unit status) I'm asking because IMHO nowadays we can alert on higher level (and [08:55:03] thus higher signal) metrics, such as quorum size, what do you think ? [09:23:40] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [09:44:40] 10serviceops, 10Release-Engineering-Team, 10Patch-For-Review: PendingDeprecationWarning on update_version.py - https://phabricator.wikimedia.org/T310133 (10JMeybohm) [09:45:45] 10serviceops, 10Scap, 10Release-Engineering-Team (Radar): Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [09:46:48] 10serviceops, 10Machine-Learning-Team: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) [09:46:52] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [11:39:23] 10serviceops, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Jelto) [11:39:42] 10serviceops, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Jelto) p:05Triage→03Medium [13:34:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey) [15:18:08] godog: I'm not sure but I think zookeeper is only used for analytics systems -- 302 to that team and see what they tell you [15:25:54] zookeeper is also used on the main conf* hosts [15:26:40] thank you moritzm rzl ! [15:26:51] yeah IIRC kafka also requires zk [15:28:22] I chatted with L.uca about that and he said it we could probably drop eht check_progs if we have a metric showing active/up zookeeper nodes [15:28:28] same for kafka I guess [17:24:42] hm. godog i think that makes sense, although it is nice to know which node's proc is down when it happens. [19:16:43] 10serviceops, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Dzahn) > moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet This is possible but would require reaching out to dcops to physically connect it to a different netw... [19:23:19] 10serviceops, 10GitLab (Infrastructure): Reduce usage of public IPv4 addresses on GitLab hosts - https://phabricator.wikimedia.org/T310265 (10Dzahn) First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not be... [19:46:46] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle Yep, that summary sounds right to me. That's wha...