[07:53:45] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) p:05Triage→03High [08:00:55] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) [08:32:07] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) Nothing suspicious in kernel or syslog apart from the fact that logging stops with some random garbage on around 2022-02-06 16:31:12Z [08:50:35] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) [08:51:10] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) I removed the downtime but did not yet uncordon the node [09:11:00] 10serviceops, 10Prod-Kubernetes: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 (10JMeybohm) 05Open→03Resolved a:03JMeybohm As there where no visible errors and the node seemed fine after reboot, I'll resolve this for now. [11:40:38] FYI, already spoke about it with Janis, I'll revert the k8s eqiad etcd nodes away from DRBD disk storage in a bit (and will also do the same for the codfw nodes, although those were on DRBD also before the Ganeti migration, but makes sense for them to be in sync) [11:41:07] <_joe_> +1 [11:43:08] (forgot to mention this is about the staging nodes only, the main ones are already back on plain disk storage) [14:13:25] hello folks, https://gerrit.wikimedia.org/r/c/operations/puppet/+/759716 is for the new partman recipe for k8s/overlay [14:13:36] if there is consensus I'll merge and test it on a new ml-serve node [14:13:46] (just the partitioning, to see if it works or not etc..) [14:18:35] o/ [14:20:05] I'm investigating a failure we had this week-end, is there a prometheus metric that records the value you get when running "kubectl get deployment"? [14:24:09] the flink metrics report that some resource went down (equivalent to one POD) but I can't find a confirmation of this event in k8s metrics [14:28:51] dcausse: this may give you some info https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=now-2d&to=now&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-pod=All [14:29:31] elukey: thanks, looking [14:31:29] (bbiab) [14:47:30] 10serviceops, 10SRE, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) 05Resolved→03Open Sorry for reopening this; woul... [15:27:59] 10serviceops, 10Prod-Kubernetes, 10Pybal, 10Kubernetes: Have PyBal monitor Istio-Ingressgateway health - https://phabricator.wikimedia.org/T301137 (10JMeybohm) [15:28:09] 10serviceops, 10Prod-Kubernetes, 10Pybal, 10Kubernetes: Have PyBal monitor Istio-Ingressgateway health - https://phabricator.wikimedia.org/T301137 (10JMeybohm) p:05Triage→03Low [15:32:06] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy Scap version 4.3.0 - https://phabricator.wikimedia.org/T300804 (10JMeybohm) a:03JMeybohm [15:44:55] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy Scap version 4.3.0 - https://phabricator.wikimedia.org/T300804 (10JMeybohm) `scap pull` and restbase canary deploy LGTM [16:19:37] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) [16:23:29] 10serviceops, 10SRE, 10envoy: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) 05Open→03Resolved [16:24:41] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) Tested the recipe on ml-serve2005, and it looks good. We have 2x480GBs SSDs + 2x2TB HDDs (that we don't currently use), and the recipe u... [18:03:46] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) [18:04:01] 10serviceops, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) p:05Triage→03Medium a:03elukey [18:07:10] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [18:43:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [19:17:58] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [20:18:58] 10serviceops, 10DC-Ops, 10ops-eqiad, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [20:19:20] 10serviceops, 10DC-Ops, 10ops-eqiad, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [21:23:14] 10serviceops, 10Phabricator, 10Release-Engineering-Team: move "releng-secrets" git repo away from Phabricator - https://phabricator.wikimedia.org/T301170 (10Dzahn) If we could establish that pwstore already replaced releng-secrets then we might be able to simply delete it. (before we even get into newer pla... [21:26:40] 10serviceops, 10Phabricator, 10Release-Engineering-Team: move "releng-secrets" git repo away from Phabricator - https://phabricator.wikimedia.org/T301170 (10thcipriani) FWIW, that repo is a pwstore; we store gpg encrypted passwords there. And it's also private (belt and suspenders). @hashar (as a low priori... [21:54:40] 10serviceops, 10Phabricator, 10Release-Engineering-Team: move "releng-secrets" git repo away from Phabricator - https://phabricator.wikimedia.org/T301170 (10Dzahn) >>! In T301170#7691190, @thcipriani wrote: > Does phabricator not have a way to authenticate over https for git repos? Only git-ssh? You can set... [23:21:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul)