[08:08:19] taavi: yes and no, I nuked the host now [12:46:25] (SystemdUnitFailed) firing: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:50] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [12:52:50] (ThanosCompactIsDown) firing: (2) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [13:06:07] godog: is there a way to see "dropped" messages elasticsearch (like because of duplicate fields or alike)? [13:07:12] I'm looking at https://phabricator.wikimedia.org/T357616 and all I see on the host level are stange SSL_ERROR_SYSCALL and omfwd: TCPSendBuf error messages that seem to happen fleet wide at a regular basis [13:07:53] also the node in question (mw1460) seems to still send syslog messages to logstash so the above smells like red hering [13:09:55] jayme: checking, IIRC logstash error messages are in logstash itself [13:10:04] logstash.w.o that is [13:20:06] I do see these (https://paste.debian.net/1307401/) a lot all accross the fleet [13:20:40] maybe related to https://phabricator.wikimedia.org/T351710 - but maybe not related to the issue of missing k8s container logs [13:23:37] could be though IIRC the kafka output shouldn't be affected, that's for central syslog [13:24:20] jayme: would you mind either filing a new task or repurpose what you linked above for this? and tag observability, I can't find anything obvious atm [13:31:25] (SystemdUnitFailed) resolved: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:25] (SystemdUnitFailed) firing: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:50] (ThanosCompactIsDown) resolved: (2) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [13:40:25] (SystemdUnitFailed) resolved: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:34] godog: yeah, I'll use the ipoid thing, tag o11y and add the potentially unrelated stuff :) I did not see anything obvious as well but I recall we had the situation where elastic had indexed a field as type A and some service used as type B so the type B logs got dropped [13:54:35] jayme: the fact that logs stops at around index rollover time is a bit sus alright [14:52:02] godog: one thing I've noticed for a particular deployment logs (eventrouter, which gives us the k8s cluster events in logstash) is a good sized hole during which I'm sure there's been events https://logstash.wikimedia.org/goto/4876040ddeb8ca17cdfc748c137b4b8b [14:52:06] (will add to the task) [14:56:00] claime: thank you, yeah that's quite curious too [15:32:45] cc cwhite ^ in case some of those things ring a bell [15:35:19] Thanks for the ping. I see no issues on the post-kafka side of the pipeline at the moment. I'm looking at tapping into kafka to see if there's any logs in its topics. [15:41:47] good idea yeah [16:34:55] godog: Been watching kafka topics for a while now. No sign of any ipoid-production-daily-updates events so far. [16:54:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:59:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:02:21] cwhite: sigh