[01:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:27] I've now made multiple attempts to remove the Icinga conntrack check, but apparently I can't find the correct approach. When submitting this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126492/2/modules/profile/manifests/firewall.pp Icinga is confused and reports: NRPE: Command 'check_conntrack_table_size' not defined showing unknown [09:12:05] Is there a documented correct approach to removing that check. I don't recall having issues removing other checks. [09:14:40] slyngs: mmhh interesting, IIRC the approach you had in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126492 is the correct one, there will be unknowns until puppet runs on all hosts (so they stop exporting the resource) and the puppet runs on alert hosts and icinga finally removes the check [09:15:02] slyngs: so unknowns are expected for max 3 puppet runs total [09:16:08] Well that's annoying :-) [09:16:49] I'll do a new set of patches, it's getting a little silly with reverts x4 [09:18:08] agreed [09:25:12] I've created a new patch and added you as the reviewer. I'll deploy it ... probably tomorrow morning and then just announce it on the SRE channel and just let Puppet figure it out :-) [09:25:14] Thanks [09:26:15] slyngs: ok! you can also speed things up by doing a fleetwide puppet run [09:26:33] the problem is that nrpe::monitor_service doesn't allow to separately set ensure=absent to monitoring::service and nrpe::check [09:26:58] That just seems to excessive to trigger a fleetwide run :-) [09:28:22] volans: It mostly becomes an visible issue on things like that that touches pretty much everything. [09:31:37] slyngs: up to you, the other option is to keep running puppet on alert hosts, i.e. speed up converging [09:33:12] That clears out the "unknown" errors? [09:33:43] I would also consider a patch to naggen as option :) [09:34:17] slyngs: it will clear the unknowns for all hosts that have run puppet already [09:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:56] Cool, so that would bringe the run time down to about 30 minutes [09:36:05] volans: interesting, how would that look like ? [09:36:50] if you want it to be general, a list either via hiera or command line of exported resources to ignore [09:37:02] you add the one to be removed to the ignore, run puppet on icinga that will remove all the checks [09:37:10] then run the patch to remove the check [09:37:16] the day after remove the ignore from naggen [09:37:20] something along those lines [09:38:07] it's probably quicker to modify nrpe::monitor_service to allow to absent separately the monitoring::service first and the nrpe::check at a second stage though [09:38:40] nice, yes both options are viable I think, and will come handy when removing other fleetwide icinga checks [09:38:43] I'll open a task [09:39:12] given the pattern of removing thigs from icinga this seems a problem bound to re-happen :D [09:41:45] hehe indeed, {{done}} as https://phabricator.wikimedia.org/T388506 [10:47:41] FIRING: PrometheusLowRetention: Prometheus k8s-aux is storing less than 20 days of data on prometheus2006:9911. - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-Prometheus=prometheus2006:9911 - https://alerts.wikimedia.org/?q=alertname%3DPrometheusLowRetention [10:52:41] FIRING: [2x] PrometheusLowRetention: Prometheus k8s-aux is storing less than 20 days of data on prometheus2005:9911. - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusLowRetention [11:01:37] working as intended ^ [11:01:51] and recently fixed, that's why it is firing [11:02:00] I'll ack [11:52:41] FIRING: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:57:41] RESOLVED: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [13:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:40:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [15:41:28] ^ looks like it's already coming down, will keep an eye on it [15:45:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [15:47:13] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [15:50:58] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:34:55] FIRING: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:55] RESOLVED: SystemdUnitFailed: rsync-vopsbot-sync-db-to-alert2002.wikimedia.org.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:15] Hello 0lly, we have a CR for cleaning up some opensearch puppet code. No hurry, just wanted to make y'all aware before we merge anything https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126653 [20:54:52] inflatador: I'm taking a look. [21:02:13] denisse thanks! If you wanna add more hosts to the PCC or whatever, feel free