[02:04:13] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:35] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:18] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:04:13] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:24] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:04] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:50:18] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:05:24] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:04] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:04:13] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:24] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:25] (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:03] (PuppetFailure) resolved: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:01:39] btullis: could I ask for a review of this very small CR? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990034/ [10:12:48] joal: brouberol: Temporarily disabling gobblin with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990605 [10:20:07] approved [10:28:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [10:28:48] btullis: note for later https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990627 [10:29:23] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [10:30:06] btullis, brouberol - Heya - I'm following the course of actions - let me know if you wish me to do anything! [10:32:46] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx) [10:47:10] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [10:48:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:14] (SystemdUnitFailed) firing: (14) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) All currently running production pipelines have completed. THere are some user-submitted jobs still running, but we are well within the wind... [10:54:43] !log putting HDFS into safe mode for T332573 [10:54:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:54:46] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [10:59:27] (03PS1) 10Jennifer Ebe: Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) [10:59:37] !log disabling puppet on all hadoop nodes [10:59:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:03:57] !log stopping all hadoop services [11:03:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:05:25] (SystemdUnitFailed) firing: (159) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:09] (03PS2) 10Jennifer Ebe: Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) [11:06:23] PROBLEM - Hadoop DataNode on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:25] PROBLEM - Hadoop DataNode on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:25] PROBLEM - Hadoop DataNode on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:27] PROBLEM - Hadoop DataNode on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:27] PROBLEM - Hadoop DataNode on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:28] PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:29] PROBLEM - Hadoop DataNode on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:30] PROBLEM - Hadoop DataNode on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:31] PROBLEM - Hadoop DataNode on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:32] PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:33] PROBLEM - Hadoop DataNode on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:34] PROBLEM - Hadoop DataNode on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:35] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:36] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:37] PROBLEM - Hadoop DataNode on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:06:38] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:39] PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:40] PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:41] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:42] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:43] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:05] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:08:28] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e51f2667-db1a-4c38-b125-d23138404c64) set by btullis@cumin1002 for 7 days, 0:00:00 o... [11:09:13] (SystemdUnitFailed) firing: (22) hive-metastore.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:20] Flipping the nameserver switch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990600 [11:09:43] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service,hadoop-mapreduce-historyserver.service,hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:49] PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [11:09:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8958a01b-be93-491e-8968-a18db034f488) set by btullis@cumin1002 for 7 days, 0:00:00 o... [11:10:09] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:13] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-metastore.service,hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:25] PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d1c834b7-213d-439c-9ec0-27ed5a825a70) set by btullis@cumin1002 for 7 days, 0:00:00 o... [11:10:27] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:10:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1f1c00f-92c4-4d89-8bf3-0cb5bc2f3d7d) set by btullis@cumin1002 for 7 days, 0:00:00 o... [11:14:16] (03CR) 10CI reject: [V: 04-1] Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) (owner: 10Jennifer Ebe) [11:16:04] !log running puppet on journal nodes first for T332573 [11:16:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:16:07] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [11:16:31] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [11:17:01] RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:19] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:17:41] RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:47] RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:53] RECOVERY - Hadoop DataNode on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:17:53] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:19] RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:33] 10Data-Engineering, 10Metrics Platform Backlog, 10Spike: [SPIKE] Remove mentions of MetricsClient#dispatch() and the monoschema from documentation - https://phabricator.wikimedia.org/T355046 (10phuedx) [11:20:25] (SystemdUnitFailed) firing: (13) hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:46] !log running puppet on an-master1003 to set it to active for T332573 [11:20:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:22:15] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:17] !log puppet runs cleanly on an-master1003 and it is the active namenode - running puppet an an-master1004. [11:29:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:30:09] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:47] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:33:51] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:03] RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:34:07] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:34:14] (SystemdUnitFailed) firing: (14) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:07] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) We have now got to a state where the two new nameservers are up and running. ` btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/... [11:36:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) ` btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get Safe mode is ON in an-master1003.eqiad.wmnet/1... [11:38:00] !log redeploying the Spark History Server to pick up the new HDFS namenodes - T332573 [11:38:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:38:03] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [11:40:01] RECOVERY - Hadoop DataNode on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:40:33] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) >>! In T304379#9400845, @mpopov wrote: > Properties... [11:41:13] RECOVERY - Hadoop DataNode on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:13] RECOVERY - Hadoop DataNode on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:14] RECOVERY - Hadoop DataNode on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:15] RECOVERY - Hadoop DataNode on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:17] RECOVERY - Hadoop DataNode on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:19] RECOVERY - Hadoop DataNode on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:21] RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:21] RECOVERY - Hadoop DataNode on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:23] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:23] RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:23] RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:25] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:25] RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:27] RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:29] RECOVERY - Hadoop DataNode on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:33] RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:33] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:35] RECOVERY - Hadoop DataNode on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:37] RECOVERY - Check systemd state on an-worker1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:39] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:39] RECOVERY - Hadoop DataNode on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:40] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:41] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [11:41:41] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:43] RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:47] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:49] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:41:50] RECOVERY - Hadoop DataNode on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:51] RECOVERY - Hadoop DataNode on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:51] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:53] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:41:54] RECOVERY - Hadoop DataNode on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:41:54] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:56] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:56] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:57] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:58] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:00] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:01] RECOVERY - Hadoop DataNode on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:01] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:03] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:03] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:04] RECOVERY - Hadoop DataNode on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:05] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:06] RECOVERY - Hadoop DataNode on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:09] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:11] RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:13] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:42:15] (HdfsMissingBlocks) firing: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [11:42:53] RECOVERY - Hadoop DataNode on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:55] RECOVERY - Hadoop DataNode on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:55] RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:55] RECOVERY - Hadoop DataNode on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:42:57] RECOVERY - Hadoop DataNode on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:01] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:03] RECOVERY - Hadoop DataNode on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:07] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:07] RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:09] RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:09] RECOVERY - Hadoop DataNode on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:10] RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:11] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:15] RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:15] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:17] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:17] RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:21] RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:23] RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:23] RECOVERY - Hadoop DataNode on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:24] RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:25] RECOVERY - Hadoop DataNode on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:27] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:27] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:31] RECOVERY - Hadoop DataNode on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:33] RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:33] RECOVERY - Hadoop DataNode on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:37] RECOVERY - Hadoop DataNode on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:37] RECOVERY - Check systemd state on an-worker1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:37] RECOVERY - Check systemd state on analytics1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:39] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:41] RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:43] RECOVERY - Hadoop DataNode on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:43:45] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:01] RECOVERY - Hadoop DataNode on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:44:03] RECOVERY - Hadoop DataNode on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:44:07] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:09] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:11] RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:19] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:25] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:44:41] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:45] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:49] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:49] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:05] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:49:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [11:51:12] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [11:52:15] (HdfsMissingBlocks) resolved: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks [11:55:03] !log re-enabling gobblin jobs [11:55:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:56:14] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [11:56:51] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [11:57:19] !log un-pausing all previously paused DAGS on all airflow instances for T332573 [11:57:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:57:22] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [11:57:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [11:58:03] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [11:58:13] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) [11:58:41] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) :point_up: ☕ required… [12:00:47] !log removing all downtime for hadoop-all for T332573 [12:00:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:05:25] (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [12:09:09] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [12:09:13] (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:44] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) yarn.wikimedia.org is redirecting to an-master1004.eqiad.wmnet:8088 and that doesn't work. Investigating now. [12:19:14] (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:44] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) Ah, it's fine now. It was just a case of this: {T331448} Confirmed with: ` btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/... [12:31:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) I have updated https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration and several other references on Wi... [12:31:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [12:34:14] (SystemdUnitFailed) firing: (15) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:23] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [12:49:13] (SystemdUnitFailed) firing: (11) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) [13:03:17] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:03:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) @Marostegui - would you mind removing the zarcillo entries for dbstore1003 and dbstore1005, when you get a chance please? Thanks. [13:04:33] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10Marostegui) I just removed this from zarcillo [13:04:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) Many thanks. [13:06:34] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10Marostegui) Removed from zarcillo [13:13:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:13] (SystemdUnitFailed) firing: (10) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:39] (03CR) 10Ottomata: [C: 03+1] "+1 in general, but FYI, last year we refactored how we model common mediawiki fragments." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [13:22:03] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags... [13:37:25] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) Following the Grafana dashboard review, I've performed some changes to it:... [13:39:34] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:07:23] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [14:30:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Hello all! Closing this ticket post the 1:1 with @JAllemandou as I have SSH access to the `wmde` instance on Airflow, have been added as a collaborator on GitLa... [14:30:33] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) 05Open→03Resolved [14:45:21] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal opened htt... [14:49:38] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal opened htt... [15:07:09] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis) [15:11:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10Gehel) [15:14:27] (03CR) 10Urbanecm: [C: 03+1] Add UserIsTemp property (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [15:15:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel) [15:15:58] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel) [15:16:36] 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10brouberol) [15:18:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10brouberol) [15:18:32] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel) [15:18:42] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel) [15:26:39] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Setup an appropriate retention policy - https://phabricator.wikimedia.org/T354927 (10brouberol) 05Open→03Resolved [15:26:43] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:50:47] (03CR) 10Joal: [C: 03+1] "We've been reviewing the code in meeting, some minor changes will be added, code is globally ok. Thanks Gabriele :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena) [16:45:43] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `dbstore1005.eqiad.wmnet` - dbstore1005.eqiad.wmnet (**PASS**) - Downtimed ho... [16:47:58] !log restarted the hive-server2 and hive-metastore services on an-coord100[3-4] which had been accidentally omitted earlier for T332573 [16:48:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:48:01] T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 [16:55:33] !log Clearing analytics failed aiflow tasks after fix [16:55:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:01:05] !log roll-restarting analytics druid cluster [17:01:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:58] !log roll-restarting public druid cluster [17:02:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:14:14] (SystemdUnitFailed) firing: (6) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:23] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [18:07:23] (DruidSegmentsUnavailable) resolved: (2) More than 5 segments have been unavailable for webrequest_sampled_live on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [18:46:46] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10VPS-project-Codesearch, 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Peachey88) [18:54:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:14] (SystemdUnitFailed) firing: (6) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:53] 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) The suggested approach for this will be to use spark to run the queries after which result will be saved in the cluster. However, spark saves files in folder an... [23:00:26] (SystemdUnitFailed) firing: (5) mw-cgroup.service Failed on snapshot1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:13] 10Analytics-Radar, 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#9423357, @TheDJ wrote: > @MusikAnimal is this still an issue ? Since there hasn't happened anythin...