[16:10:06] Hello team, regarding the centrallog disk issue. Yesterday I took a look at the Wiki and implemented these steps: https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space#Short_term_fix [16:10:59] It helped us to recover 1Gb only but the disk seems fine, it has plenty of space left so this has more to do with the inodes. [16:12:00] I think that forcing a log rotation may be a good approach for this issue. What do you think? [16:12:30] Any pointers or ideas on potential solutions for this are greatly appreciated. :) [16:14:08] if it's centrallog1002 I don't see an inode problem, just 1% used [16:15:29] Yes, that's also confusing. I checked that yesterday with 'df -i' and the disk also seems fine. 🤔 [16:15:45] it's just getting full :D [16:15:47] too much data [16:33:58] denisse: AFAICT /srv/syslog has 1.1T and the top offending hosts with most space used are hte eqiad/codfw promethes and thanos-be hosts [16:34:38] 215G among jsut 12 hosts [16:51:00] Thanks Riccardo. [16:52:18] it seems that prometheus might have some debug logging enabled too [16:53:18] Ah, this makes a lot of sense. I think it was enabled to get the query of death that caused the OOM problems. [16:56:19] for how long? I see debug since Nov. (oldest file) [16:56:28] ah no sorry it's the prometheus-blackbox-exporter [16:56:32] not prometheus itself [16:56:33] my bad [16:57:49] $ wc -l syslog.log-20240227 [16:57:49] 18456563 syslog.log-20240227 [16:57:53] $ grep -c 'prometheus-blackbox-exporter' syslog.log-20240227 [16:57:53] 17942050 [17:35:26] godog: thanks for the fix to apply planet monitoring only to eqiad/codfw and not POPs. I am trying to see actual data on https://grafana-rw.wikimedia.org/alerting/list?search=planet to verify things work. I see the state of the checks is normal/ok but also everything says "no data". Can it really be in an OK state without any data? Is it normal I don't see actual numbers there yet? [17:37:09] I clicked the button under "actions" and then went to "Query & Results" [19:24:45] denisse: since /srv is at 98% full we could free up ~100G by trimming prometheus and thanos-be log gzips older than 45 days, something like that [19:25:27] (SystemdUnitFailed) firing: logrotate.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:34] herron: Thank you, I also noticed that the logrotate service is failing on centrallog1002. I'm debugging it. [19:26:45] denisse: it looks like there's a file /etc/logrotate.d/syslog that was updated today, I think it overlaps/conflicts with rsyslog_receiver [19:28:19] Indeed, I've removed it and it's working now. [19:30:27] (SystemdUnitFailed) resolved: logrotate.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:37] I trimmed prometheus and thanos-be logs older than 45 days but it only freed 8%. I trimmed them again for logs older than 30 days: 1.4T 1.2T 159G 89% /srv [19:35:08] I'm aslo taking a look at the logs with debug enabled Riccardo mentioned. [19:37:27] thanks denisse