[03:19:06] PROBLEM - Host matomo1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:16] PROBLEM - Host aqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:20] PROBLEM - Host an-tool1007 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:20] PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:21] PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:06] PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:22:14] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{acce [03:22:14] }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [03:22:26] RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [03:22:33] RECOVERY - Host aqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:35] RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [03:22:41] RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [03:23:13] RECOVERY - Host matomo1002 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [03:23:29] RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:24:51] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [04:12:01] PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:01] 10Data-Engineering, 10Discovery, 10SRE: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey) [08:23:34] hello folks, I opened --^, archiva is running out of free space, we should probably drop some older refinery releases if possible [08:33:56] elukey: now that you mention that.. aqs1004 is having some issues as well [08:37:08] vgutierrez: o/ I think it is an old cassandra node that DE doesn't use anymore [08:37:35] so the host can be silenced on icinga? [08:37:46] or even decomm'ed? [08:39:30] vgutierrez: there is a subtask in https://phabricator.wikimedia.org/T249755, not sure what is the status of the migration (I am not in DE anymore :) [08:39:53] elukey: <3 [08:52:38] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) >>! In T290841#8077106, @BTullis wrote: > Do you happen to... [09:40:34] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.238 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [09:45:38] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.5712 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:48:41] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.163 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:57:21] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.1937 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:25:09] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:09] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:57] RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:00:21] 10Data-Engineering, 10Discovery, 10SRE: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10dcausse) We might perhaps be able to drop all wdqs artifacts prior to 0.3.40, this is the oldest reference I found here: https://github.com/wmde/wikibase-relea...