[08:08:19] <godog>	 taavi: yes and no, I nuked the host now
[12:46:25] <jinxer-wm>	 (SystemdUnitFailed) firing: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:50] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[12:52:50] <jinxer-wm>	 (ThanosCompactIsDown) firing: (2) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[13:06:07] <jayme>	 godog: is there a way to see "dropped" messages elasticsearch (like because of duplicate fields or alike)?
[13:07:12] <jayme>	 I'm looking at https://phabricator.wikimedia.org/T357616 and all I see on the host level are stange SSL_ERROR_SYSCALL and omfwd: TCPSendBuf error messages that seem to happen fleet wide at a regular basis
[13:07:53] <jayme>	 also the node in question (mw1460) seems to still send syslog messages to logstash so the above smells like red hering
[13:09:55] <godog>	 jayme: checking, IIRC logstash error messages are in logstash itself
[13:10:04] <godog>	 logstash.w.o that is
[13:20:06] <jayme>	 I do see these (https://paste.debian.net/1307401/) a lot all accross the fleet
[13:20:40] <jayme>	 maybe related to https://phabricator.wikimedia.org/T351710 - but maybe not related to the issue of missing k8s container logs
[13:23:37] <godog>	 could be though IIRC the kafka output shouldn't be affected, that's for central syslog
[13:24:20] <godog>	 jayme: would you mind either filing a new task or repurpose what you linked above for this? and tag observability, I can't find anything obvious atm 
[13:31:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:35:25] <jinxer-wm>	 (SystemdUnitFailed) firing: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:37:50] <jinxer-wm>	 (ThanosCompactIsDown) resolved: (2) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[13:40:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:52:34] <jayme>	 godog: yeah, I'll use the ipoid thing, tag o11y and add the potentially unrelated stuff :) I did not see anything obvious as well but I recall we had the situation where elastic had indexed a field as type A and some service used as type B so the type B logs got dropped
[13:54:35] <godog>	 jayme: the fact that logs stops at around index rollover time is a bit sus alright
[14:52:02] <claime>	 godog: one thing I've noticed for a particular deployment logs (eventrouter, which gives us the k8s cluster events in logstash) is a good sized hole during which I'm sure there's been events https://logstash.wikimedia.org/goto/4876040ddeb8ca17cdfc748c137b4b8b
[14:52:06] <claime>	 (will add to the task)
[14:56:00] <godog>	 claime: thank you, yeah that's quite curious too
[15:32:45] <godog>	 cc cwhite ^ in case some of those things ring a bell
[15:35:19] <cwhite>	 Thanks for the ping.  I see no issues on the post-kafka side of the pipeline at the moment.  I'm looking at tapping into kafka to see if there's any logs in its topics.
[15:41:47] <godog>	 good idea yeah
[16:34:55] <cwhite>	 godog: Been watching kafka topics for a while now.  No sign of any ipoid-production-daily-updates events so far.
[16:54:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[16:59:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[17:02:21] <godog>	 cwhite: sigh