[00:47:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:38] cwhite, herron: you are also listed as members of the SSO project in cloud vps, are you using these for anything at this point? https://phabricator.wikimedia.org/T367554 [07:04:18] they were probably added for the OIDC proxy work, but if they are currently unused and given that the production IDPs are on Bookworm by now, it's probably more sensible to remove these and recreated them with current Debian/current IDP versions if needed again [08:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:20] godog: Hi! I'm going to deploy the statsd exporter on all mw-on-k8s namespaces. I'm not turning on sending data to them yet, I've prepared a series of patches to turn it on one namespace at a time. [10:21:49] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043704 < That's the deployment patch, the next patches in the stack are to turn it on. [12:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:49] moritzm: hey, nope no longer using that on my end [14:49:17] moritzm: I'm also not using it [15:03:01] herron, cwhite: thanks [15:12:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:08] that's new [15:59:12] never seen that before [15:59:21] but we're failing to upload metrics to the status page for some reason [16:01:34] Metric 'Wiki response time' (id lyfcttm2lhw4) with most recent data at Tue, 18 Jun 2024 15:55:00 +0000 (@1718726100.0) [16:01:36] this one [16:01:40] yes [16:01:43] I know [16:01:49] Jun 18 15:56:05 alert1001 statograph[832028]: statograph.datasources.PrometheusError: Query returned a single timeseries with labels still attached [16:01:51] Jun 18 15:56:05 alert1001 statograph[832028]: raise PrometheusError('Query returned a single timeseries with labels still attached') [16:01:53] Jun 18 15:56:05 alert1001 statograph[832028]: File "/usr/lib/python3/dist-packages/statograph/datasources.py", line 63, in query_range [16:01:55] Jun 18 15:56:05 alert1001 statograph[832028]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [16:01:59] claime: I mean, we can just edit the query it uses [16:02:02] ah so my query is bad [16:02:10] did you change it? [16:02:15] yeah [16:02:17] ah ok [16:02:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047115 [16:02:25] FIRING: [3x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:43] I guess it doesn't like that it's still got labels [16:02:46] it does not [16:02:53] also that is supposed to be the mean latency and that is on purpose :) [16:02:58] yes and yes [16:03:24] but I can't guarantee the accuracy of the mean latency as it was calculated with benthos [16:03:27] ah [16:03:34] so I swapped to a .5 histogram [16:03:56] I guess what's important for the statuspage is mw-web, so I can remove the double deployment [16:04:07] let me try to fix it directly on alert1001 and backport the change [16:04:13] ok! [16:08:14] btw if the move to 0.5 quantile causes a big difference from previous data, we can rewrite the old data as well [16:08:55] the move from totally underloaded bare metal servers to k8s deployments is big difference [16:09:03] true [16:09:13] the metric hasn't been accurate since we passed ~70% traffic [16:09:37] very true [16:09:46] ok let's erase the old data on the status page regardless :) [16:09:53] just, once we have a metric we're happy with [16:11:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047121 [16:12:08] so now it's the .5 quantile of mw-web and mw-api-ext [16:12:24] which is the closest I can give to a mean of what was appservers [16:12:25] FIRING: [3x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:29] actually wait, it was just appservers [16:13:32] so it's mw-web alone [16:21:12] cdanis: fixed [16:22:03] claime: do you want to do the honors? [16:22:13] cdanis: of? resetting the data? [16:22:20] yes [16:22:30] `sudo statograph -c /etc/statograph/config.yml erase_metric_data lyfcttm2lhw4` [16:22:40] it will get backfilled automatically [16:22:46] ok [16:22:54] I'll go and !log it though [16:23:17] <3 [16:23:51] sigh I also need to fix my CSS overrides on the status page that got broken, I forgot about that [16:23:54] to make the tooltips not awful [16:24:04] done [16:24:44] so now there's no metric [16:24:54] * claime waits for the inevitable phab ticket [16:25:16] hmm [16:25:32] Jun 18 16:25:24 alert1001 statograph[1694926]: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='thanos-query.discovery.wmnet', port=443): Read timed out. (read timeout=20) [16:25:37] I'll bump that up lol [16:25:42] ah well [16:25:44] we might want to make this into a recording rule [16:26:11] yeah it's probably more intensive than the former rule, cardinality is way higher [16:27:25] FIRING: [3x] SystemdUnitFailed: statograph_post.service on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:46] Jun 18 16:28:34 alert1001 systemd[1]: statograph_post.service: Deactivated successfully. [16:28:48] Jun 18 16:28:34 alert1001 systemd[1]: Finished statograph_post.service - Runs statograph to publish data to statuspage.io. [16:28:50] with 40 seconds [16:29:02] gonna patch that by hand on both alert hosts rn [16:29:11] ack [16:31:33] Metric 'Wiki response time' (id lyfcttm2lhw4) with most recent data at Mon, 27 May 2024 20:01:00 +0000 (@1716840060.0) [16:31:35] it's backfilling :) [16:32:25] FIRING: [2x] SystemdUnitFailed: statograph_post.service on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:21] sigh we have a timeout with 40s now lol [16:33:37] Dog gambit [16:34:11] I'm bumping to 50 [16:35:08] Let me check how different the benthos metric is [16:35:42] the benthos metric won't have any history, correct? [16:39:15] probably not as much yeah [16:39:31] eh if it has a few days it's fine :) [16:39:44] it also takes quite a long time to load as well [16:39:50] Load time: 27523ms [16:39:54] eesh [16:40:28] now we're getting 503s from thanos instead of timing out [16:40:52] yeaaaah [16:41:19] it's ooming again i bet [16:41:43] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=thanos&var-instance=All&from=now-3h&to=now [16:41:54] it's certainly saturating NICs [16:42:45] cdanis: it took 10C in temp as well [16:42:47] oops [16:42:49] haha [16:42:54] well I'm not sure that *we're* that traffic [16:43:05] thanos was sick earlier today before we started hammering on it with this new hammer [16:43:08] the last few runs have been successful fwiw [16:43:13] cool [16:43:24] and only taking like ~14s [16:43:45] "only", it used to take like 3-4s including all startup costs [16:44:30] yeah ok it's definitely because of that rule [16:44:42] it has to pull the cardinality for all envoys of the depl [16:44:51] the benthos query now takes 628ms [16:44:57] soooo [16:45:00] yeah that probably merits a recording rule [16:45:12] I can file a task at least [16:45:24] I'd say we maybe should use the benthos query, it doesn't look that bad tbh [16:45:32] 👍 [16:45:45] I'll make a CR, and fix the grafana dash [16:45:48] thank you [16:46:56] T367894 filed as placeholder for now, I really need some food [16:47:21] I have to go as well, feel free to merge the CR and do the reset/backfill when you're back [16:47:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047138 [16:47:44] ack! [16:50:34] I've updated https://grafana.wikimedia.org/goto/9c6K98UIg?orgId=1 [20:32:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed