[02:04:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:04:35] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:50:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:04:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:58:04] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:50:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:05:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:58:04] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:04:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:00:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) hadoop-namenode-backup-hdfs.service Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:15:03] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[10:01:39] <brouberol>	 btullis: could I ask for a review of this very small CR? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990034/
[10:12:48] <btullis>	 joal: brouberol: Temporarily disabling gobblin with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990605
[10:20:07] <brouberol>	 approved
[10:28:14] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[10:28:48] <brouberol>	 btullis: note for later https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990627
[10:29:23] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[10:30:06] <joal>	 btullis, brouberol - Heya - I'm following the course of actions - let me know if you wish me to do anything!
[10:32:46] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10phuedx)
[10:47:10] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[10:48:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service,produce_canary_events.service,refine_netflow.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:49:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:54:24] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) All currently running production pipelines have completed. THere are some user-submitted jobs still running, but we are well within the wind...
[10:54:43] <btullis>	 !log putting HDFS into safe mode for T332573
[10:54:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:54:46] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[10:59:27] <wikibugs>	 (03PS1) 10Jennifer Ebe: Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278)
[10:59:37] <btullis>	 !log disabling puppet on all hadoop nodes
[10:59:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:03:57] <btullis>	 !log stopping all hadoop services
[11:03:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:05:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (159) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:06:09] <wikibugs>	 (03PS2) 10Jennifer Ebe: Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278)
[11:06:23] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:25] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:25] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:27] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:27] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:28] <icinga-wm>	 PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:29] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:30] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:31] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:32] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:33] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:34] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:37] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:06:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:40] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:42] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:06:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:08:28] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e51f2667-db1a-4c38-b125-d23138404c64) set by btullis@cumin1002 for 7 days, 0:00:00 o...
[11:09:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (22) hive-metastore.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:20] <btullis>	 Flipping the nameserver switch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990600
[11:09:43] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service,hadoop-mapreduce-historyserver.service,hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:49] <icinga-wm>	 PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[11:09:52] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8958a01b-be93-491e-8968-a18db034f488) set by btullis@cumin1002 for 7 days, 0:00:00 o...
[11:10:09] <icinga-wm>	 PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:13] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-metastore.service,hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:25] <icinga-wm>	 PROBLEM - Hive Server on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:26] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d1c834b7-213d-439c-9ec0-27ed5a825a70) set by btullis@cumin1002 for 7 days, 0:00:00 o...
[11:10:27] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:10:52] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a1f1c00f-92c4-4d89-8bf3-0cb5bc2f3d7d) set by btullis@cumin1002 for 7 days, 0:00:00 o...
[11:14:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix scala job that publishes the XML dumps [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/990631 (https://phabricator.wikimedia.org/T346278) (owner: 10Jennifer Ebe)
[11:16:04] <btullis>	 !log running puppet on journal nodes first for T332573
[11:16:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:16:07] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[11:16:31] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[11:17:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:17:41] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:47] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:53] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:17:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:19] <icinga-wm>	 RECOVERY - Check systemd state on analytics1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:33] <wikibugs>	 10Data-Engineering, 10Metrics Platform Backlog, 10Spike: [SPIKE] Remove mentions of MetricsClient#dispatch() and the monoschema from documentation - https://phabricator.wikimedia.org/T355046 (10phuedx)
[11:20:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:20:46] <btullis>	 !log running puppet on an-master1003 to set it to active for T332573
[11:20:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:22:15] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:29:17] <btullis>	 !log puppet runs cleanly on an-master1003 and it is the active namenode - running puppet an an-master1004.
[11:29:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:47] <icinga-wm>	 RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:33:51] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:34:03] <icinga-wm>	 RECOVERY - Hive Server on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:34:07] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[11:34:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:07] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) We have now got to a state where the two new nameservers are up and running. ` btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/...
[11:36:49] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) ` btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get Safe mode is ON in an-master1003.eqiad.wmnet/1...
[11:38:00] <brouberol>	 !log redeploying the Spark History Server to pick up the new HDFS namenodes - T332573
[11:38:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:38:03] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[11:40:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:40:33] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) >>! In T304379#9400845, @mpopov wrote: > Properties...
[11:41:13] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:13] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:14] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:17] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:19] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:21] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:29] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1100 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:35] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:39] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:41] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[11:41:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:43] <icinga-wm>	 RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1144 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:41:50] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:51] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:51] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:41:54] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:41:54] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:56] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:56] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:04] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:06] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:11] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:42:15] <jinxer-wm>	 (HdfsMissingBlocks) firing: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks
[11:42:53] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1095 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:55] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:55] <icinga-wm>	 RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:55] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:42:57] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:03] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:07] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:09] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:09] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:10] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:11] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:17] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:21] <icinga-wm>	 RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:23] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:23] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:24] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:25] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:27] <icinga-wm>	 RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:31] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:33] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:37] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:37] <icinga-wm>	 RECOVERY - Check systemd state on analytics1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:41] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:43] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:43:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:01] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:44:03] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:44:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:09] <icinga-wm>	 RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:44:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:49] <icinga-wm>	 RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:49:06] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[11:51:12] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx)
[11:52:15] <jinxer-wm>	 (HdfsMissingBlocks) resolved: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks
[11:55:03] <btullis>	 !log re-enabling gobblin jobs
[11:55:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:56:14] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx)
[11:56:51] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx)
[11:57:19] <btullis>	 !log un-pausing all previously paused DAGS on all airflow instances for T332573
[11:57:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:57:22] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[11:57:52] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[11:58:03] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx)
[11:58:13] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx)
[11:58:41] <wikibugs>	 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10phuedx) :point_up: ☕ required…
[12:00:47] <btullis>	 !log removing all downtime for hadoop-all for T332573
[12:00:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:05:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:54] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[12:09:09] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[12:09:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:12:44] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) yarn.wikimedia.org is redirecting to an-master1004.eqiad.wmnet:8088 and that doesn't work. Investigating now.
[12:19:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:19:44] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) Ah, it's fine now. It was just a case of this: {T331448} Confirmed with: ` btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/...
[12:31:53] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) I have updated https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration and several other references on Wi...
[12:31:56] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis)
[12:34:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:46:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:23] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[12:49:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) analytics-reportupdater-logs-rsync.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:02:54] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis)
[13:03:17] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[13:03:55] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) @Marostegui - would you mind removing the zarcillo entries for dbstore1003 and dbstore1005, when you get a chance please? Thanks.
[13:04:33] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10Marostegui) I just removed this from zarcillo
[13:04:59] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1003 - https://phabricator.wikimedia.org/T351923 (10BTullis) Many thanks.
[13:06:34] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10Marostegui) Removed from zarcillo
[13:13:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:13] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:18:39] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "+1 in general, but FYI, last year we refactored how we model common mediawiki fragments." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[13:22:03] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags...
[13:37:25] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) Following the Grafana dashboard review, I've performed some changes to it:...
[13:39:34] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:07:23] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[14:30:19] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) Hello all! Closing this ticket post the 1:1 with @JAllemandou as I have SSH access to the `wmde` instance on Airflow, have been added as a collaborator on GitLa...
[14:30:33] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10AndrewTavis_WMDE) 05Open→03Resolved
[14:45:21] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal opened htt...
[14:49:38] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal opened htt...
[15:07:09] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10BTullis)
[15:11:26] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search (Current work): SUP: Production TODOs - https://phabricator.wikimedia.org/T354595 (10Gehel)
[15:14:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Add UserIsTemp property (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[15:15:40] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel)
[15:15:58] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel)
[15:16:36] <wikibugs>	 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2): Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10brouberol)
[15:18:22] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10brouberol)
[15:18:32] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10Gehel)
[15:18:42] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10Gehel)
[15:26:39] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Setup an appropriate retention policy - https://phabricator.wikimedia.org/T354927 (10brouberol) 05Open→03Resolved
[15:26:43] <wikibugs>	 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[15:50:47] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "We've been reviewing the code in meeting, some minor changes will be added, code is globally ok. Thanks Gabriele :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) (owner: 10Gmodena)
[16:45:43] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Decommission dbstore1005 - https://phabricator.wikimedia.org/T351925 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `dbstore1005.eqiad.wmnet` - dbstore1005.eqiad.wmnet (**PASS**)   - Downtimed ho...
[16:47:58] <btullis>	 !log restarted the hive-server2 and hive-metastore services on an-coord100[3-4] which had been accidentally omitted earlier for T332573
[16:48:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:48:01] <stashbot>	 T332573: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573
[16:55:33] <joal>	 !log Clearing analytics failed aiflow tasks after fix
[16:55:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:01:05] <btullis>	 !log roll-restarting analytics druid cluster
[17:01:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:02:58] <btullis>	 !log roll-restarting public druid cluster
[17:02:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:14:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:23] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[18:07:23] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: (2) More than 5 segments have been unavailable for webrequest_sampled_live on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[18:46:46] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10VPS-project-Codesearch, 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Peachey88)
[18:54:57] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) drop-image-suggestions.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:26:53] <wikibugs>	 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Snwachukwu) The suggested approach for this will be to use spark to run the queries after which result will be saved in the cluster. However, spark saves files in folder an...
[23:00:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) mw-cgroup.service Failed on snapshot1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:35:13] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10MusikAnimal) >>! In T219857#9423357, @TheDJ wrote: > @MusikAnimal is this still an issue ? Since there hasn't happened anythin...