[03:19:06] <icinga-wm>	 PROBLEM - Host matomo1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:16] <icinga-wm>	 PROBLEM - Host aqs1005 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:20] <icinga-wm>	 PROBLEM - Host an-tool1007 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:20] <icinga-wm>	 PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100%
[03:19:21] <icinga-wm>	 PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100%
[03:20:06] <icinga-wm>	 PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100%
[03:22:14] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{acce
[03:22:14] <icinga-wm>	 }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[03:22:26] <icinga-wm>	 RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms
[03:22:33] <icinga-wm>	 RECOVERY - Host aqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[03:22:35] <icinga-wm>	 RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[03:22:41] <icinga-wm>	 RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[03:23:13] <icinga-wm>	 RECOVERY - Host matomo1002 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[03:23:29] <icinga-wm>	 RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms
[03:24:51] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[04:12:01] <icinga-wm>	 PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100%
[08:23:01] <wikibugs>	 10Data-Engineering, 10Discovery, 10SRE: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey)
[08:23:34] <elukey>	 hello folks, I opened --^, archiva is running out of free space, we should probably drop some older refinery releases if possible
[08:33:56] <vgutierrez>	 elukey: now that you mention that..  aqs1004 is having some issues as well
[08:37:08] <elukey>	 vgutierrez: o/ I think it is an old cassandra node that DE doesn't use anymore
[08:37:35] <vgutierrez>	 so the host can be silenced on icinga?
[08:37:46] <vgutierrez>	 or even decomm'ed?
[08:39:30] <elukey>	 vgutierrez: there is a subtask in https://phabricator.wikimedia.org/T249755, not sure what is the status of the migration (I am not in DE anymore :)
[08:39:53] <vgutierrez>	 elukey: <3
[08:52:38] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) >>! In T290841#8077106, @BTullis wrote: > Do you happen to...
[09:40:34] <icinga-wm>	 PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.238 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[09:45:38] <icinga-wm>	 RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.5712 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[10:48:41] <icinga-wm>	 PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.163 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[10:57:21] <icinga-wm>	 RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.1937 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos
[11:25:09] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:57] <icinga-wm>	 RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:00:21] <wikibugs>	 10Data-Engineering, 10Discovery, 10SRE: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10dcausse) We might perhaps be able to drop all wdqs artifacts prior to 0.3.40, this is the oldest reference I found here: https://github.com/wmde/wikibase-relea...