[00:24:07] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[01:18:00] <jinxer-wm>	 (SystemdUnitFailed) firing: wikidatardf-truthy-dumps.service Failed on snapshot1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:31:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:27:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[02:47:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:32:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:32:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:46:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:47:42] <icinga-wm>	 PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:52:12] <icinga-wm>	 RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:52:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:26] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:54:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:57:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:24:07] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[05:05:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:06:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:09:48] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:12:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:44:21] <Surbhi_>	 Refinery deployment in progress
[06:19:47] <Surbhi_>	 !log Deployed refinery using scap, then deployed onto hdfs
[06:19:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[06:45:21] <wikibugs>	 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2023/2024-Q1): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10Aklapper) > So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset?...
[06:47:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[06:52:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[07:21:12] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild  --topics '^(eqiad.mediawiki.accountcreation_block|eqiad.mediawiki.api-action|eqiad.mediawiki.centralnotice.campaign-change|eqiad...
[07:22:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[07:24:57] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) p:05Triage→03Medium
[07:25:05] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) Thanks for opening the access request. There is a official [access request form](https://phabricator.wiki...
[07:30:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[07:33:06] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) @Antoine_Quhen Can you confirm and add your wikitech username and email address in the task description?...
[07:44:25] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of essexigyan - https://phabricator.wikimedia.org/T348106 (10MoritzMuehlenhoff)
[07:48:53] <wikibugs>	 10Data-Platform-SRE: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) done!
[08:01:48] <elukey>	 hello folks
[08:01:51] <elukey>	 something super interesting:
[08:01:52] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-30d&orgId=1&to=now&viewPanel=72
[08:03:14] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/952160 was merged on the 19th, the big drop is related to when we increased kafka_message_max_bytes to 10MB
[08:03:37] <elukey>	 from a guide:
[08:03:56] <elukey>	 "Some reasons for increased time taken could be: increased load on the node (creating processing delays), or perhaps requests are being held in purgatory for a long time (determined by fetch.min.bytes and fetch.wait.max.ms)."
[08:04:39] <elukey>	 we dropped to 1/3rd of what it was !!
[08:04:59] <btullis>	 Interesting. But it's not like we actually stayed sending much larger messages yet, I don't think.
[08:06:23] <elukey>	 btullis: yeah, I am reading https://www.confluent.io/blog/apache-kafka-purgatory-hierarchical-timing-wheels/
[08:06:37] <elukey>	 totally ignorant about the request purgatory, but it seems a possible lead
[08:07:27] <elukey>	 ah wait, it is main-eqiad
[08:07:33] <elukey>	 I was convinced to have switched to jumbo
[08:07:41] <elukey>	 wow, maybe it is related to kafka mirror then?
[08:10:22] <wikibugs>	 10Data-Platform-SRE: Facilitate users to query kafka topic metadata - https://phabricator.wikimedia.org/T344989 (10pfischer) @BTullis, yes and thank you for re-opening that ticket. I checked out datahub, but for some reason I do not see any key-value-pairs under the tab "properties": {F37942063} Is my account/ar...
[08:16:29] <elukey>	 we have jmx metrics for the purgatory size, will add them to the dashboard
[08:16:34] <elukey>	 dosen't seem the case though
[08:24:07] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[08:25:12] <elukey>	 added two new panels :)
[08:26:36] <elukey>	 other metrics improved, like the network processor avg idel percent
[08:26:42] <elukey>	 and the request handler avg percent
[08:27:21] <elukey>	 ok it was the switchover
[08:27:25] <btullis>	 Nice, thanks. I'm still a bit confused because the link you initially sent was for FetchConsumer time. There was a big drop in this, but isn't lower better on this graph? What am I missing?
[08:27:27] <elukey>	 https://grafana-rw.wikimedia.org/d/000000027/kafka?forceLogin&forceLogin&from=now-30d&orgId=1&to=now&var-cluster=kafka_main&var-datasource=thanos&var-disk_device=All&var-kafka_broker=All&var-kafka_cluster=main-codfw&viewPanel=71
[08:27:34] * elukey is stupid
[08:28:05] <elukey>	 btullis: yes yes it is better, I wanted to figure out what it was!
[08:28:36] <elukey>	 but as coincidence, there was the switchover, so main-codfw is more active now
[08:28:42] <elukey>	 and main-eqiad is less loaded
[08:28:58] <elukey>	 but I thought it was related to kafka main pulling from jumbo etc..
[08:29:06] <elukey>	 nevermind, totally off track :)
[08:29:16] <elukey>	 but I added two panels to grafana! :P
[08:29:43] <btullis>	 Ah right, I think I just assumed that you were indicating a performance degradation, not a perceived boost. :-)
[08:30:02] <btullis>	 Anyway, good investigative work and panel-craft. Thanks.
[08:35:09] <elukey>	 sometimes it is good even to check when things go better! :D
[08:35:30] <btullis>	 True :-)
[08:41:13] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10dcausse) After upgrading the test job to the new kafka c...
[08:43:19] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10cloud-services-team, 10ops-eqiad, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) a:03Jclark-ctr Hi! Can we please have `cloudvirt-wdqs100[1-3]` moved to the WMCS racks, preferrably `E4` or `F4`? They will all need a s...
[09:12:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:04] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[09:32:22] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of tsepothoabala - https://phabricator.wikimedia.org/T348114 (10MoritzMuehlenhoff)
[10:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[11:04:18] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of zxane - https://phabricator.wikimedia.org/T348127 (10MoritzMuehlenhoff)
[11:12:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:25:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:26:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:27:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:27:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:46:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:49:19] <icinga-wm>	 PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:52:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:53:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:10:15] <wikibugs>	 10Data-Platform-SRE, 10Research, 10WMDE-TechWish-Maintenance-2023: Publish dump scraper reports - https://phabricator.wikimedia.org/T341751 (10thiemowmde) 05Open→03Resolved
[12:11:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:11:27] <icinga-wm>	 RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:45:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:45:43] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:53] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch. All topics are ~empty: ` topicmappr rebuild  --topics '^(eqiad.mediawiki.job.cirrusSearchElasticaWrite|eqiad.mediawiki.job.cirrusSearchIncomingLinkCount|eqiad.mediawiki.job.cirrusSearchLink...
[12:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:53:07] <icinga-wm>	 RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:26] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch. All topics are ~empty except for `eqiad.mediawiki.job.parsoidCachePrewarm` that has a single partition of ~16GB.  ` topicmappr rebuild  --topics '^(eqiad.mediawiki.job.gwtoolsetUploadMediaf...
[12:57:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:07:02] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch. ` topicmappr rebuild  --topics '^(eqiad.mediawiki.job.securePollArchiveElection|eqiad.mediawiki.job.securePollLogAdminAction|eqiad.mediawiki.job.securePollTallyElection|eqiad.mediawiki.job....
[13:24:41] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Ottomata)
[13:26:08] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10dcausse) The new kafka connectors seem to do low level i...
[13:27:45] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` topicmappr rebuild  --topics '^(eqiad.mediawiki.pref_diff|eqiad.mediawiki.reading_depth|eqiad.mediawiki.recentchange|eqiad.mediawiki.resource-change|eqiad.mediawiki.revision-crate|eqiad.me...
[13:31:55] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) Checksum does not match the version from `wdqs1016`, which is:  ` sha1sum wikidata.jnl.zst e3197eb5177dcd1aa0956824cd8dc4afc2d8796c  wiki...
[13:32:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:42:47] <wikibugs>	 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10jbond) > So why are some returning with uppercase padded zeros, while others are returned witho...
[13:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:54:10] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[13:55:14] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch ` topicmappr rebuild  --topics '^(eqiad.mediawiki.revision_score_reverted|eqiad.mediawiki.revisioni-score|eqiad.mediawiki.searchpreview|eqiad.mediawiki.skin_diff|eqiad.mediawiki.special_diff...
[13:55:58] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Add $comment and $performer to ArticleRevisionVisibilitySet params - https://phabricator.wikimedia.org/T321411 (10Ottomata) 05Open→03Declined > I'm not sure if providing the reason to the hook for a revision suppression is...
[14:03:48] <wikibugs>	 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) Next batch. Most of the topics are ~empty.  ` topicmappr rebuild  --topics '^(eqiad.mwcli.command_execute|eqiad.null|eqiad.page_content_change|eqiad.page_content_change.v1|eqiad.rc0.mediawiki.page_chan...
[14:23:33] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[14:30:32] <wikibugs>	 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10Ottomata) > While super useful when it works, the feature is not stable enough to roller-out to production  @JAllemandou, can you explain what was unstable / didn't work?
[14:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[14:40:22] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking)
[14:40:33] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Open→03In progress p:05Medium→03Low a:03bking
[14:41:42] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Taking this back, as I was able to get the host to boot by changing the boot option for the 2nd NIC interfac...
[14:44:36] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) OK, I got the host to boot, now we're getting partman errors (fetched from `/var/log/partman` via `install-console`)  ` /lib/partman/init.d/25md-devices: ******************************************...
[14:45:59] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) stream-beta works for me! Go for it!
[14:50:37] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10fgiunchedi) The merged partman recipe contains multiple errors, as pointed out in https://gerrit.wikimedia.org/r/c/operations/puppet/+/961478/comments/92a7a209_4f5f5f84
[14:55:46] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[14:56:17] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[15:00:20] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:01:24] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10CodeReviewBot) dcausse opened https://gitlab.wikimedia.o...
[15:07:56] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 (10CodeReviewBot) dcausse merged https://gitlab.wikimedia.o...
[15:17:28] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[15:21:29] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:35:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) Presto version 0.284 was released yesterday: https://github.com/prestodb/presto/releases/tag/0.284  The release notes seem to have been slightly delayed, but are available [[https://...
[15:37:35] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[15:40:03] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:47:22] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[15:47:25] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[16:20:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[16:33:07] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking)
[16:34:23] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) Confirmed that the partman recipe is working via `install-console`:  ` root@cloudelastic1007:~# lsblk NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT sda       8:0    0  1.7T  0 disk ├─sda1    8:1...
[16:39:11] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops,   I've confirmed that our new partman recipe works in T342463 , but the reimage for `cloudelas...
[16:39:34] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) p:05Low→03Medium a:05bking→03None
[16:41:42] <wikibugs>	 10Data-Platform-SRE: Write new partman recipe for cloudelastic - https://phabricator.wikimedia.org/T342463 (10bking) 05Open→03Resolved
[16:41:44] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking)
[16:43:29] <wikibugs>	 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work): Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10bking) 05Open→03Declined Closing as declined, since we decided to use RAID0 instead of JBOD.
[16:49:12] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[16:49:16] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[16:55:04] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[17:06:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[17:06:46] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking)
[17:06:50] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) More details related to dump loading in T325114 .
[17:07:26] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking)
[17:09:39] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) a:03bking
[17:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[18:51:50] <wikibugs>	 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook)
[18:58:36] <wikibugs>	 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) Added ` root@cloudcontrol1005:~# openstack role add --project quarry --user sd member root@cloudcontrol1005:~# openstack role add --project quarry --user sd reader root@cloudcontrol1005:~# openstack role add --project quarry...
[20:05:00] <wikibugs>	 (03CR) 10David Martin: [C: 03+1] Add the wikifunctions_ui metrics platform schema to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/962657 (https://phabricator.wikimedia.org/T344277) (owner: 10MNeisler)
[20:06:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 90% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:20:42] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:25:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:26:28] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 (10bking)
[20:59:50] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) EventGate update PR: https://github.com/wikimedia/eventgate/pull/22
[21:02:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:05] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:07:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:10:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:12:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:13:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:14:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:30:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:31:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:31:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:36:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:37:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:42:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:43:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:44:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:47:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:49:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:51:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:52:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:56:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:56:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:35:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[22:45:46] <wikibugs>	 10Data-Engineering: Indicate cluster location metadata for Druid datasets - https://phabricator.wikimedia.org/T348204 (10odimitrijevic)
[23:02:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed