[10:18:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [11:07:25] FIRING: SystemdUnitFailed: apache2.service on prometheus2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:53] we (traffic) got some MSS alerts related to prometheus2007 [11:11:04] but I see that apache is failing there so that explains that [11:11:36] vgutierrez: yes indeed, thank you I'm in the middle of https://phabricator.wikimedia.org/T383232 [11:12:07] apache should be back up shortly [11:12:12] thx [11:19:36] RESOLVED: SystemdUnitFailed: apache2.service on prometheus2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:41] vgutierrez: ^ how does the situation look on your end ? [11:26:55] godog: all good [11:27:00] we got a recovery as well [11:27:12] vgutierrez: sweet! [11:27:31] and of course MSS value got back to normal: https://grafana.wikimedia.org/goto/beLVLihNg?orgId=1 [11:27:51] turns out the bits I forgot to test in pontoon don't work out of the box in production, not a surprise [11:27:59] neat [11:33:35] RESOLVED: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:29:13] hello! curious on why https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126030 is failing, I copied/pasted the `{{ $labels.instance | reReplaceAll "^([^:]+).*" "${1}" }}` from a different alert definition but looks like it can't parse it? [13:31:28] yo XioNoX! there are double quotes nested in there [13:32:13] i.e. once for the yaml string and once for the argument to rereplaceall [13:32:19] ahhh right [13:32:20] thx! [13:32:56] sure np [13:51:15] godog: success!! https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126030 it only took me 8 PS :) [13:51:37] I'm going to have many of those to add, so better I learn it well first [13:52:55] XioNoX: lol! you can also run tests locally btw [13:53:08] and skip grinding gerrit [13:53:24] good point :) [13:55:39] I'll take a look later today or tomorrow btw [14:03:32] no rush at all [15:03:40] Hello 0lly! I've taken go-dog's suggestion from https://phabricator.wikimedia.org/T388270#10618446 and created this CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126067 . If anyone can review, LMK. cc andrewbogott [15:09:55] I'll take a look inflatador [15:11:54] godog ACK, thanks for the quick review! Fixing... [15:30:26] godog thanks again, just merged/puppet-merged the CR. LMK if you'd like me to run puppet on alerts hosts or if I can do anything else [15:32:34] inflatador: np, I'll kick puppet [17:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:05] ^ This one should auto resole. [17:37:23] denisse: I'm seeing @ERROR: Unknown module 'vopsbot-sync-db-to-alert2002.wikimedia.org' in the log there, but that should be sorted out soon? [17:37:56] herron: I may have missed that part from the logs, my bad. [17:38:21] ahh ok, I wonder if something changed on teh other host rsync config [17:38:38] But IIRC that unit must be disabled on the inactive host. [17:40:35] Where did you see that error? I was unable to find it in `journalctl -u rsync-vopsbot-sync-db`. [17:41:17] I found the culprit. [17:41:34] nice ok [17:42:11] I meant, the culprit of why I couldn't see those logs, I need to look at `journalctl -u rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service`. I'm still looking at the issue. [17:50:13] denisse: kk gotcha, I'll standby since you are looking into it [17:52:31] Running the rsync manually (without the wrapper) throws the same error: `@ERROR: Unknown module 'vopsbot-sync-db-to-alert2002.wikimedia.org'` [17:52:43] This makes me wonder if alert1002 may be rejecting the connection. 🤔 [17:54:41] Listing all available rsync modules on alert1002 the `vopsbot-sync-db-to-alert2002.wikimedia.org` is missing. I'm looking at why... [18:07:24] Interestingly I can't find where the `rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.servicersync-vopsbot-sync-db-to-alert2002.wikimedia.org.service` service is defined, I can only see `vopsbot-sync-db`. 🤔 [18:15:01] Ok, I found the issue. Working on a patch. [21:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:35] ^ Patch sent for that one. :)