[15:34:11] <brett>	 I'd love some feedback on whether varnish-mmap-count is still a metric we want to monitor. I'd definitely say "yes" since it seems there've been incidents in the past regarding that... but AFAICT there's no way to get vm.max_map_count with Prometheus (cc vgutierrez)
[15:34:13] <brett>	 https://phabricator.wikimedia.org/T300723#8019270
[15:35:28] <brett>	 sukhe, kwakuofori: I'd also love feedback on https://phabricator.wikimedia.org/T310303 :)
[15:38:12] <sukhe>	 brett: just confirming, which three are we talking about?
[15:38:27] <sukhe>	 I was looking at the action logs for pontoondebian-10.0-buster 
[15:39:00] <vgutierrez>	 brett: per https://phabricator.wikimedia.org/T242417#5822657 I think so
[15:39:14] <sukhe>	 seems like ema created it, I am not sure if vgutierrez is using it or not, so I am happy going with Filippo's recommendation fwiw
[15:39:19] <brett>	 sukhe: pontoon.traffic.eqiad1.wikimedia.cloud, cptext.traffic.eqiad1.wikimedia.cloud, and cpupload.traffic.eqiad1.wikimedia.cloud
[15:41:34] <brett>	 vgutierrez: Not sure what you mean; Are you saying that you think it's worth continuing to monitor?
[15:43:24] <vgutierrez>	 yes.. we got an icinga check in place that depends on that metric
[15:43:41] <brett>	 vgutierrez: The problem is that we're moving away from Icinga checks and to Prometheus.
[15:43:50] <brett>	 slash alertmanager
[15:44:21] <brett>	 Presently the Icinga check gets the sysctl value directly but Prometheus does not have that same luxury
[15:44:26] <vgutierrez>	 it shouldn't be a problem considering that the metric is still on prometheus
[15:44:49] <brett>	 The max value is not in prometheus
[15:44:54] <vgutierrez>	 right
[15:45:11] <vgutierrez>	 got it
[15:47:50] <vgutierrez>	 brett: worst case scenario... providing something similar to prometheus::node_varnishd_mmap_count to fetch vm.max_map_count should be feasible, right?
[15:49:31] <brett>	 AFAICT there is no interface to get the sysctl value.
[15:49:58] <brett>	 There's a link in that ticket which suggests that *eventually* there will be but not in the near future
[15:50:49] <vgutierrez>	 sure, that's why I'm suggesting the prometheus::node_varnishd_mmap_count approach
[15:51:19] <vgutierrez>	 varnishd_mmap_count is provided by a simple bash script that runs MMAP_COUNT=$(/usr/bin/wc -l < /proc/${VPID}/maps)
[15:51:34] <brett>	 ah, I see.
[15:51:52] <brett>	 So you would say that this alarm is important enough to pursue?
[15:52:36] <vgutierrez>	 I think it's worth tracking it yes
[15:53:34] <brett>	 I'll create a new child ticket to track that work. Thanks for the help!
[16:00:20] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) Spoke with @Vgutierrez on IRC and they confirmed that the mmap maximum is worth monitoring...
[17:45:32] <brett>	 vgutierrez: Are you using pontoon.traffic.eqiad1.wikimedia.cloud, cptext.traffic.eqiad1.wikimedia.cloud, and cpupload.traffic.eqiad1.wikimedia.cloud?
[17:45:44] <brett>	 It seems like we're on course for deleting those three instances
[17:46:43] <brett>	 `last` suggests that only ema ever used it
[17:59:24] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 3 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10LSobanski) @Varnent could we get a clarification of the timeline for this request? The description says end of this month and yo...
[18:28:09] <wikibugs>	 10Traffic, 10SRE, 10SRE Observability (FY2021/2022-Q4): Create vm.max_map_count metrics for Prometheus - https://phabricator.wikimedia.org/T311445 (10BCornwall)
[19:35:54] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) If it is just 'depool' from command line and stop puppet, I can handle so you don't need to take it down in advance of the work, just lemme know!
[19:51:03] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) >>! In T311264#8030866, @RobH wrote: > If it is just 'depool' from command line+stop puppet+icinga maint mode, I can handle so you don't need to take it down in advance of the work,...
[20:01:15] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10BTullis) Hello, in case it's helpful, I fixed one of these the other day in eqiad by using `ipmitool` to do a cold reset of the BMC. {T311042}
[20:10:03] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) >>! In T311264#8030961, @BTullis wrote: > Hello, in case it's helpful, I fixed one of these the other day in eqiad by using `ipmitool mc reset cold` to do a cold reset of the BMC. > {...
[20:28:12] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) a:03ssingh updated idrac from 2.50.x to 2.81.81.81, A00   @ssingh,  this should clear up our errors, i can login to the idrac and system is powered back up.  once this is back onlin...
[21:45:28] <wikibugs>	 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10BTullis) I've happened upon this tracking ticket where many similar SSH related mgmt checks are mentioned: {T304289} Mentioning it here for cross-referencing purposes.