[10:27:47] godog: you working on thanos? [10:28:10] jbond: yes, see -sre [10:28:51] thanks was afk when alert came in so just catching up [10:29:05] sure np, sorry for the mispage [10:56:12] I've got an expired certificate error message coming from rsyslogd on centrallog1002. Is this expected? [10:56:16] Nov 15 10:55:00 an-launcher1002 rsyslogd: invalid cert info: peer provided 3 certificate(s). Certificate 1 info: certificate valid from Sun Nov 12 12:37:08 2023 to Sat Nov 11 12:37:08 2028; Certificate public key: RSA; DN: CN=centrallog2002.codfw.wmnet; Issuer DN: C=US,L=San Francisco,O=Wikimedia Foundation\, Inc,OU=SRE Foundations,CN=puppet_rsa; SAN:DNSname: centrallog2002.codfw.wmnet; [v8.1901.0] [11:10:54] btullis: yes known, being worked on in https://phabricator.wikimedia.org/T351181 [11:11:12] Ack, great. [12:07:59] (PuppetFailure) firing: Puppet has failed on webperf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:17:59] (PuppetFailure) resolved: Puppet has failed on webperf1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:59:30] (SLOMetricAbsent) firing: - https://alerts.monitoring.wmflabs.org/?q=alertname%3DSLOMetricAbsent [14:24:26] btullis: would you mind checking again re: an-launcher1002 ? things should be back to normal after a puppet run [14:26:10] godog: Confirmed, looks good now. Thanks. [14:26:18] cheers [16:16:14] herron: o/ (if you have a min) - did you do anything yesterday with pyrra daemons to pick up the multiple lift wing revscoring pages? [16:16:42] because I changed the config this morning, and I triggered a manual reload of pyrra-filesystem, but I keep seeing only one page [16:16:48] (basically the aggregate) [16:17:46] elukey: no I didn't do anything with them since we deployed yesterday [16:18:16] but it could be that we need to kick thanos rule [16:18:44] I checked https://thanos.wikimedia.org/rules#liftwing-requests-revscoring and it still shows the old ones [16:18:49] maybe it didn't like the change in metric [16:19:32] I just issued a reload for thanos rule, yeah I think thanos rule is not detecting the output rules when they are placed/changed by pyrra [16:20:25] ah ok shall I do it? [16:21:18] (done, seemed safe enough, on titan1001) [16:21:27] thanks, yeah it should be fine [16:22:20] yeah need to figure out an automatic fix since pyrra tries to bounce prometheus instead of issuing a thanos rule reload [16:22:32] and afaict that's not configurable yet [16:24:11] yep I see the new rules! [16:25:56] perfect, I'll recheck tomorrow to see if the dashboards are more sound :) [16:28:49] ok! sounds good [16:47:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:52:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:59:30] (SLOMetricAbsent) firing: - https://alerts.monitoring.wmflabs.org/?q=alertname%3DSLOMetricAbsent [20:56:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [20:59:30] (SLOMetricAbsent) firing: - https://alerts.monitoring.wmflabs.org/?q=alertname%3DSLOMetricAbsent [21:01:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [21:44:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [21:49:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures