[00:21:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:21:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:31:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:31:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:36:57] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:37:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:41:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:10:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:44:50] (03CR) 10Nmaphophe: [V: 03+2 C: 03+2] "Looks good to me. It's easy to follow along." [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/804434 (owner: 10Milimetric) [09:55:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:40:54] I'm about to fail over the hive services from an-coord1001 to an-cooord1002 so that I can restart hive on an-coord1001 ref: T303168 and all these flapping hive alerts. [10:40:55] T303168: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 [10:59:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-conf1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1082 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1103 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1126 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1130 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1064 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:44] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1139 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:01:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1073 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-airflow1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1084 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1109 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:07] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1141 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:08] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1070 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1068 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:10] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1066 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:12] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1090 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1124 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:03:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1078 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:00] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1015 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:04:40] We can ignore these alerts ^^ there is discussion about it in #wikimedia-operations [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1105 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:31] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1107 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:33] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1132 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:05:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1112 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:06:35] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1059 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1075 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:08:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:23] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on stat1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:20] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1058 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:13:38] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:36] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-master1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:37] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1083 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:40] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1114 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:40] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1063 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:14:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1077 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:15:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:16:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:16:01] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:06] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-conf1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:17:15] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1118 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1009 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:55] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1131 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1067 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:18:56] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on analytics1074 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-coord1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1086 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:09] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1108 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:57] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:58] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:58] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1102 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:20:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1136 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on druid1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1115 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1004 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1100 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-airflow1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:21:53] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2012 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:05] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on furud is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1005 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:43] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:22:45] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2011 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on archiva1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:25] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs2006 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:23:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:59] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-test1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:47] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1119 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:28:19] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1093 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1127 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:30:49] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1140 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-druid1001 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:31:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1111 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1095 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:32:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-worker1087 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:39] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on an-presto1003 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:35:41] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1010 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:13] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on kafka-jumbo1007 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:38:21] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on aqs1008 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:45:12] ^^ as noted above, these are false alarms and can be ignored. The issue has been resolved now, as far as I am aware. [11:45:39] I'm about to restart hive-server2 and hive-metastore on an-coord1001, now that services are running on an-coord1002. [11:47:45] !log btullis@an-coord1001:~$ sudo systemctl restart hive-server2.service hive-metastore.service [11:47:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:55:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:28:50] (03CR) 10Vivian Rook: [C: 03+2] Switch string and pipe [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [12:33:02] (03Merged) 10jenkins-bot: Switch string and pipe [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821283 (https://phabricator.wikimedia.org/T308362) (owner: 10Vivian Rook) [13:32:40] (03PS9) 10RhinosF1: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 [13:38:21] (03CR) 10Vivian Rook: [C: 03+2] api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [13:42:58] !log failed hive back to an-coord1001 via DNS change. [13:42:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:43:46] (03Merged) 10jenkins-bot: api: return consistently [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/821344 (owner: 10RhinosF1) [14:45:12] (VarnishkafkaNoMessages) firing: ... [14:45:12] varnishkafka for instance cp2035:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp2035:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:46:37] host down for maintenance --^ [14:50:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2031:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:55:12] (VarnishkafkaNoMessages) firing: (4) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:59:43] elukey: same with ^ and ^^ ? [15:00:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:00:32] ottomata: o/ I think so yes, there is pdu maintenance in codfw and some cp nodes are affected [15:00:44] okay [15:03:35] (03PS1) 10Milimetric: Add treemap chart [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/822097 [15:05:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:06:57] (03CR) 10Milimetric: Adapt to templatelinks schema changes (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/821312 (https://phabricator.wikimedia.org/T314666) (owner: 10Milimetric) [15:10:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:20:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:23:56] (03CR) 10Ladsgroup: Adapt to templatelinks schema changes (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/821312 (https://phabricator.wikimedia.org/T314666) (owner: 10Milimetric) [15:30:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:31:06] Yep, please ignore these varnishkafka alerts for cp2xxx for now. I'm almost there with a fix in T300246 but not quite there yet. [15:31:07] T300246: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 [15:35:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:36:30] 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Interaction Map Complete - https://phabricator.wikimedia.org/T305477 (10CMacholan) a:05JAnstee_WMF→03KCVelaga_WMF [15:40:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:42:52] 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Interaction Map Complete - https://phabricator.wikimedia.org/T305477 (10JAnstee_WMF) At this point @KCVelaga is getting input to the [[ https://lucid.app/lucidchart/dce8fb8f-2403-4b7b-ba11-f8e829b336a7/edit? | output dashboard mapping ]] following th... [15:45:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:50:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:55:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:55:21] PROBLEM - Host aqs2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:30] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10Metrics-Platform: EventLogging queue not being drained on page unload - https://phabricator.wikimedia.org/T314924 (10phuedx) [15:57:53] PROBLEM - Host aqs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] PROBLEM - Host aqs2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:57:53] PROBLEM - Host aqs2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:00] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10Metrics-Platform: EventLogging queue not being drained on page unload - https://phabricator.wikimedia.org/T314924 (10phuedx) [16:00:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:10:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:14:53] RECOVERY - Host aqs2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [16:17:17] RECOVERY - Host aqs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [16:17:18] RECOVERY - Host aqs2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [16:17:18] RECOVERY - Host aqs2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [16:19:36] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents, 10Metrics-Platform: EventLogging queue not being drained on page unload in Google Chrome - https://phabricator.wikimedia.org/T314924 (10phuedx) [16:40:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:45:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:50:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:55:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:00:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:03:00] !log stopping puppet and drop data timers on an-launcher1002 and an-test-coord1001 to deploy drop script changes - T270433 [17:03:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:02] T270433: Add logic to purging scripts that requires admin action if it's about to delete a lot of data - https://phabricator.wikimedia.org/T270433 [17:04:10] (03CR) 10Ottomata: [C: 03+2] Add safety limits to refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [17:05:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:10:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:15:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:25:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:27:47] (03PS1) 10Mforns: Add missing changes to the deletion script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/822122 (https://phabricator.wikimedia.org/T270433) [17:29:40] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/822122 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [17:45:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:00:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:04:00] !log Deployed refinery using scap, then deployed onto hdfs [18:04:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:10:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:20:12] (VarnishkafkaNoMessages) firing: (7) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:33:29] 10Data-Engineering, 10Event-Platform Value Stream: Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [18:50:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:10:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:20:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:30:12] (VarnishkafkaNoMessages) resolved: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:14:11] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:14:25] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:03] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:23:21] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:53] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook