[00:07:25] (SystemdUnitFailed) firing: logrotate.service on logstash2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:21] ^ i'm taking a look. [00:59:42] This task describes a similar incident: https://phabricator.wikimedia.org/T153940 [01:08:11] This issue is resolved, I added my findings on T153940 [01:08:12] T153940: Logrotate fails for: "$FILE No such file or directory" - https://phabricator.wikimedia.org/T153940 [01:12:25] (SystemdUnitFailed) resolved: logrotate.service on logstash2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:17] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:15:40] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:20:40] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:32:17] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:12:13] Hello team, I see some alert emails of "DegradedArray event on /dev/md1:centrallog1002". I found these instructions on Wikitech related to this alert: https://wikitech.wikimedia.org/wiki/Logstash#Replace_failed_disk_and_rebuild_RAID [14:12:29] Do you know if I should open a ticket to DC Ops to install the new disk? [14:13:51] denisse: those gets auto-created, like https://phabricator.wikimedia.org/T360862 [14:14:21] For prometheus checks, do we default to polling from all prometheus hosts? I see 8 active hosts, if there are more/less LMK [14:15:00] Context is T360993 where we're trying to figure out how many queries are coming from pollers vs user traffic [14:15:01] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [14:15:03] volans: Thank you! [14:15:53] inflatador: in general, all hosts in that DC [14:19:15] Thanks taavi . I'm looking at the access log on wdqs2019 and it looks like the pollers are hitting almost every second [14:21:08] need to fix the nginx logs to use XFF, right now it just logs its own IP. But still, that seems really high [15:17:17] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:24:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:25:01] ^ It happened again. [15:28:29] denisse: which one of the two issues are you referring to ? [15:28:47] Both, more mostly the ThanosSidecarNoConnectionToStartedPrometheus. [15:29:04] But I can no longer see the alert in karma. [15:29:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:29:43] That's why. Good. [15:30:46] but yeah prometheus@k8s on prometheus1006 was OOM [15:31:23] good news is that it was able to come back up without going into oom death spiral [15:34:24] re: logstash it looks like the known apache cpu stuck issue, I'll repair [15:35:12] then append thinking about a bandaid or even better a fix for this [15:36:39] I'm wondering if we need a CPU upgrade for those hosts and pair it with faster RAM. My assumption is that our current hardware can't keep up with the volume of logs we produce. [15:38:45] it is an apache bug, the related task has more context on what's happening https://phabricator.wikimedia.org/T337818 [15:57:17] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:57:44] I wonder if cgroups would help work around that bug [16:04:21] maybe yeah, better than nothing [18:44:59] Hi team, I'm trying to upgrade the alert hosts from Puppet 5 to 7 following the instructions in https://phabricator.wikimedia.org/T349619 [18:44:59] When executing: `sudo cookbook sre.puppet.migrate-host alert2001.wikimedia.org` [18:44:59] The command fails with the following error: `spicerack.icinga.IcingaStatusNotFoundError: Host alert2001 was not found in Icinga status`. [18:45:28] I understand this issue happens because this is the passive alert host therefore, Icinga is not aware of its existence. [18:45:47] Does anybody know if there's a way to overcome this issue? [18:52:43] I'm on mobile but I think we need to patch the cookbook. This unicorn setup keeps on biting us in so many ways... Is there any plan to fix it by any chance? [19:19:13] volans: Yes, if I understand correctly this issues is caused because of the way Icinga works and how it behaves in an infrastructure setup like ours where we have one active and one passive host. [19:19:13] We plan on decommissioning Icinga in favor of Prometheus + Alertmanager as specified in our Alerting Infrastructure roadmap: https://wikitech.wikimedia.org/wiki/File:Alerting_Infrastructure_design_document_%26_roadmap.pdf [22:05:14] denisse: in reality the limitation is all in the puppettization, not in icinga [22:06:32] and there are a bunch of icinga checks for which AFAIK there is no immediate solution on how to migrate them (some are part of T350694 ) [22:06:33] T350694: Infrastructure Foundation Alerts to migrate - https://phabricator.wikimedia.org/T350694 [22:08:43] volans: You're right, I also confused the goal of reducing Icinga's scope, to clarify this is what we strive to achieve according to our docs: "The idea being to have Icinga as a purely monitoring tool". Apologies for the confussion. [22:08:58] no prob :) [22:10:02] but the issue you're hitting is purely generated by the way we setup icinga via puppet, not some icinga limitation, hence my question if there was any plan to fix it and have the secondary host being monitored by the primary and viceversa [22:25:39] denisse: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1014633/1 should unblock you [22:26:48] volans: Molte grazie!! [22:29:02] De nada ;) [22:29:48] denisse: I'm +2ing it, once merged just run puppet on the cumin host of your choice and retry [22:30:13] Thank you very much, I can Puppet merge and run puppet on the hosts. [22:31:07] there is no puppet-merge involved :) [22:31:11] CI will merge [22:31:17] and once automatically merged just run puppet [22:32:17] Good to know, thank you! [22:32:34] https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Deployment [22:32:44] easier than remembering :D [22:34:56] Totally, I'll bookmark it. :)