[15:34:11] I'd love some feedback on whether varnish-mmap-count is still a metric we want to monitor. I'd definitely say "yes" since it seems there've been incidents in the past regarding that... but AFAICT there's no way to get vm.max_map_count with Prometheus (cc vgutierrez) [15:34:13] https://phabricator.wikimedia.org/T300723#8019270 [15:35:28] sukhe, kwakuofori: I'd also love feedback on https://phabricator.wikimedia.org/T310303 :) [15:38:12] brett: just confirming, which three are we talking about? [15:38:27] I was looking at the action logs for pontoondebian-10.0-buster [15:39:00] brett: per https://phabricator.wikimedia.org/T242417#5822657 I think so [15:39:14] seems like ema created it, I am not sure if vgutierrez is using it or not, so I am happy going with Filippo's recommendation fwiw [15:39:19] sukhe: pontoon.traffic.eqiad1.wikimedia.cloud, cptext.traffic.eqiad1.wikimedia.cloud, and cpupload.traffic.eqiad1.wikimedia.cloud [15:41:34] vgutierrez: Not sure what you mean; Are you saying that you think it's worth continuing to monitor? [15:43:24] yes.. we got an icinga check in place that depends on that metric [15:43:41] vgutierrez: The problem is that we're moving away from Icinga checks and to Prometheus. [15:43:50] slash alertmanager [15:44:21] Presently the Icinga check gets the sysctl value directly but Prometheus does not have that same luxury [15:44:26] it shouldn't be a problem considering that the metric is still on prometheus [15:44:49] The max value is not in prometheus [15:44:54] right [15:45:11] got it [15:47:50] brett: worst case scenario... providing something similar to prometheus::node_varnishd_mmap_count to fetch vm.max_map_count should be feasible, right? [15:49:31] AFAICT there is no interface to get the sysctl value. [15:49:58] There's a link in that ticket which suggests that *eventually* there will be but not in the near future [15:50:49] sure, that's why I'm suggesting the prometheus::node_varnishd_mmap_count approach [15:51:19] varnishd_mmap_count is provided by a simple bash script that runs MMAP_COUNT=$(/usr/bin/wc -l < /proc/${VPID}/maps) [15:51:34] ah, I see. [15:51:52] So you would say that this alarm is important enough to pursue? [15:52:36] I think it's worth tracking it yes [15:53:34] I'll create a new child ticket to track that work. Thanks for the help! [16:00:20] 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) Spoke with @Vgutierrez on IRC and they confirmed that the mmap maximum is worth monitoring... [17:45:32] vgutierrez: Are you using pontoon.traffic.eqiad1.wikimedia.cloud, cptext.traffic.eqiad1.wikimedia.cloud, and cpupload.traffic.eqiad1.wikimedia.cloud? [17:45:44] It seems like we're on course for deleting those three instances [17:46:43] `last` suggests that only ema ever used it [17:59:24] 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 3 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10LSobanski) @Varnent could we get a clarification of the timeline for this request? The description says end of this month and yo... [18:28:09] 10Traffic, 10SRE, 10SRE Observability (FY2021/2022-Q4): Create vm.max_map_count metrics for Prometheus - https://phabricator.wikimedia.org/T311445 (10BCornwall) [19:35:54] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) If it is just 'depool' from command line and stop puppet, I can handle so you don't need to take it down in advance of the work, just lemme know! [19:51:03] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) >>! In T311264#8030866, @RobH wrote: > If it is just 'depool' from command line+stop puppet+icinga maint mode, I can handle so you don't need to take it down in advance of the work,... [20:01:15] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10BTullis) Hello, in case it's helpful, I fixed one of these the other day in eqiad by using `ipmitool` to do a cold reset of the BMC. {T311042} [20:10:03] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) >>! In T311264#8030961, @BTullis wrote: > Hello, in case it's helpful, I fixed one of these the other day in eqiad by using `ipmitool mc reset cold` to do a cold reset of the BMC. > {... [20:28:12] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) a:03ssingh updated idrac from 2.50.x to 2.81.81.81, A00 @ssingh, this should clear up our errors, i can login to the idrac and system is powered back up. once this is back onlin... [21:45:28] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10BTullis) I've happened upon this tracking ticket where many similar SSH related mgmt checks are mentioned: {T304289} Mentioning it here for cross-referencing purposes.