[16:10:06] <denisse>	 Hello team, regarding the centrallog disk issue. Yesterday I took a look at the Wiki and implemented these steps: https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space#Short_term_fix
[16:10:59] <denisse>	 It helped us to recover 1Gb only but the disk seems fine, it has plenty of space left so this has more to do with the inodes.
[16:12:00] <denisse>	 I think that forcing a log rotation may be a good approach for this issue. What do you think?
[16:12:30] <denisse>	 Any pointers or ideas on potential solutions for this are greatly appreciated. :)
[16:14:08] <volans>	 if it's centrallog1002 I don't see an inode problem, just 1% used
[16:15:29] <denisse>	 Yes, that's also confusing. I checked that yesterday with 'df -i' and the disk also seems fine. 🤔
[16:15:45] <volans>	 it's just getting full :D
[16:15:47] <volans>	 too much data
[16:33:58] <volans>	 denisse: AFAICT /srv/syslog has 1.1T and the top offending hosts with most space used are hte eqiad/codfw promethes and thanos-be hosts
[16:34:38] <volans>	 215G among jsut 12 hosts
[16:51:00] <denisse>	 Thanks Riccardo.
[16:52:18] <volans>	 it seems that prometheus might have some debug logging enabled too
[16:53:18] <denisse>	 Ah, this makes a lot of sense. I think it was enabled to get the query of death that caused the OOM problems.
[16:56:19] <volans>	 for how long? I see debug since Nov. (oldest file)
[16:56:28] <volans>	 ah no sorry it's the prometheus-blackbox-exporter
[16:56:32] <volans>	 not prometheus itself
[16:56:33] <volans>	 my bad 
[16:57:49] <volans>	 $ wc -l syslog.log-20240227
[16:57:49] <volans>	 18456563 syslog.log-20240227
[16:57:53] <volans>	 $ grep -c 'prometheus-blackbox-exporter' syslog.log-20240227
[16:57:53] <volans>	 17942050
[17:35:26] <mutante>	 godog: thanks for the fix to apply planet monitoring only to eqiad/codfw and not POPs. I am trying to see actual data on https://grafana-rw.wikimedia.org/alerting/list?search=planet to verify things work. I see the state of the checks is normal/ok but also everything says "no data". Can it really be in an OK state without any data? Is it normal I don't see actual numbers there yet?
[17:37:09] <mutante>	 I clicked the button under "actions" and then went to "Query & Results"
[19:24:45] <herron>	 denisse: since /srv is at 98% full we could free up ~100G by trimming prometheus and thanos-be log gzips older than 45 days, something like that
[19:25:27] <jinxer-wm>	 (SystemdUnitFailed) firing: logrotate.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:25:34] <denisse>	 herron: Thank you, I also noticed that the logrotate service is failing on centrallog1002. I'm debugging it.
[19:26:45] <herron>	 denisse: it looks like there's a file /etc/logrotate.d/syslog that was updated today, I think it overlaps/conflicts with rsyslog_receiver
[19:28:19] <denisse>	 Indeed, I've removed it and it's working now.
[19:30:27] <jinxer-wm>	 (SystemdUnitFailed) resolved: logrotate.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:37] <denisse>	 I trimmed prometheus and thanos-be logs older than 45 days but it only freed 8%. I trimmed them again for logs older than 30 days: 1.4T  1.2T  159G  89% /srv
[19:35:08] <denisse>	 I'm aslo taking a look at the logs with debug enabled Riccardo mentioned.
[19:37:27] <herron>	 thanks denisse