[07:57:50] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:32:02] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:38:12] PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:30] RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:17:04] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:49:29] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:45:44] PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:28:58] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10BPirkle) This is getting close. Here are things that I see remaining: - On "ceiled values should be correctly converted to intervals", that is n... [15:29:49] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10BPirkle) [15:35:21] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10BPirkle) @FGoodwin , be aware that I asked @codebug to make a test fail in my above comment. We'll need a corresponding change in the timestamp val... [17:49:48] RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:32:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5015%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:37:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5015%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:55:22] PROBLEM - SSH on analytics1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:04] RECOVERY - SSH on analytics1073.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook