[00:09:49] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:25:46] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye [07:43:26] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [07:44:01] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10elukey) [07:44:31] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) 05Open→03Resolved a:03elukey Beta deployed by @Ottomata as part of T351247 (thanks!) [08:13:21] 10Data-Platform-SRE: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10Aklapper) a:05BTullis→03None @BTullis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of thi... [08:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:13:26] 10Data-Engineering: Use types in Analytics Puppet classes/profiles/etc.. - https://phabricator.wikimedia.org/T252617 (10Aklapper) a:05BTullis→03None @BTullis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because... [08:17:05] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye completed: - druid1011 (**WARN**) - Downtimed on Icinga/... [08:22:09] 10Data-Engineering, 10Data-Platform-SRE: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Aklapper) a:05Ottomata→03None @Ottomata: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_clean... [08:22:25] 10Analytics-Kanban, 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-EventLogging, and 2 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Aklapper) a:05Ottomata→03None @Ottomata: Per email... [08:24:29] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Wikipedia-Android-App-Backlog: android image_recommendation_interaction decommissioning - https://phabricator.wikimedia.org/T284620 (10Aklapper) a:05Sharvaniharan→03None @Sharvaniharan: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_... [09:18:16] (EventgateValidationErrors) firing: ... [09:18:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [10:33:16] (EventgateValidationErrors) resolved: ... [10:33:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [10:39:46] (EventgateValidationErrors) firing: ... [10:39:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [10:44:14] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) Thanks. I have created that request here: {T351657} [10:44:17] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) [10:47:39] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) [10:57:36] 10Data-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech, 10Event-Platform, 10WMDE-FUN-Funban-2023: Validation Error for eventlogging_WMDEBannerSizeIssue - https://phabricator.wikimedia.org/T344027 (10BTullis) Hello, We're still seeing a considerable number of these event validation errors, even thou... [11:31:49] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10jcrespo) > BTullis closed this task as Resolved. What about? > [ ] Ensure backups are running on the right host(s) and with the latest data (e.g. attempt a test recovery) [12:09:45] (EventgateValidationErrors) resolved: ... [12:09:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:38:15] (EventgateValidationErrors) firing: ... [12:38:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:07:59] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) I'm beginning to work on this now. I updated the existing patch to the spark images in the `production-images` repo: https://gerrit.wiki... [13:23:56] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) I have added documentation here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide#Concurrent_DAG_Runs I t... [13:24:27] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [13:34:37] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10BTullis) [13:36:34] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [13:40:50] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10BTullis) [13:54:15] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) I only deployed changeprop, not cpjobqueue. [14:02:09] 10Data-Engineering, 10Event-Platform, 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31), 10Platform Team Initiatives (Modern Event Platform (TEC2)): Allow disabling/enabling configured streams via wgEventStreams config - https://phabricator.wikimedia.org/T259712 (10Ottomata) 05Resolved→03Open Huh, I don't thi... [14:17:38] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) [14:18:15] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) [14:23:25] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10BTullis) [14:23:27] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) [14:23:36] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10BTullis) p:05Triage→03High [14:23:50] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) p:05Triage→03High [14:24:57] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) I have asked the #dse-k8s-sig group whether or not they have any opinions on the best way to organise this. [14:25:23] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10BTullis) [14:25:25] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) [14:25:34] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10BTullis) p:05Triage→03High [14:27:41] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [14:31:01] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10Ottomata) I think the two namespaces two helmfile option is better in this case. I thought there was a example of different helmfiles sharing a common values file (`_mediawiki... [14:32:05] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1015.eqiad.wmnet with OS bullseye [14:35:50] 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10BTullis) [14:36:07] 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10BTullis) p:05Triage→03High [14:38:16] (EventgateValidationErrors) resolved: ... [14:38:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:42:29] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) >>! In T351716#9349220, @Ottomata wrote: > I think the two namespaces two helmfile option is better in this case. I thought there was a example of different helmfiles... [14:43:46] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [14:59:38] 10Data-Platform-SRE: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 (10Stevemunene) a:03Stevemunene [15:03:58] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) `druid1011` is also up with the right partitions, rebalanced and repooled. Moving on to the OS upgrade of druid100[7-8] as [[ https://gitlab.wikimedia.org/repos/data-engineering/airfl... [15:04:28] !log pool druid1011 after reimage T336043 [15:04:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:04:31] T336043: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 [15:05:32] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1015.eqiad.wmnet with OS bullseye completed: - aqs1015 (**WARN**) - Downtimed... [15:07:47] 10Data-Platform-SRE, 10Cassandra, 10Patch-For-Review: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [15:12:33] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye [15:29:11] 10Data-Engineering: Turnilo: invalid transforms on wmf_netflow dashboard - https://phabricator.wikimedia.org/T351731 (10ayounsi) p:05Triage→03High [15:32:34] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) 05In progress→03Stalled Marking this as stalled while we work on improving and migrating the Data Platform docs on Wikitech (T350911), which is a prerequisite fo... [15:45:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on an-worker1121:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:45] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:23] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:25] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:55:42] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-worker1121:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:51] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:35] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:00:42] (SystemdUnitFailed) firing: (3) export_smart_data_dump.service Failed on an-worker1121:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:19:33] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:35] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:21:55] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:22:35] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:42] (SystemdUnitFailed) firing: (3) export_smart_data_dump.service Failed on an-worker1121:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:27] 10Data-Platform-SRE: Project future physical host usage for Search Platform-owned services - https://phabricator.wikimedia.org/T350885 (10Gehel) 05Open→03Resolved [16:29:41] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:42] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:43] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:34:51] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:11] RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:42] (SystemdUnitFailed) firing: (5) hadoop-yarn-nodemanager.service Failed on an-worker1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:42] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:43:15] RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:42] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:59] (PuppetFailure) firing: Puppet has failed on aqs1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:50:42] (SystemdUnitFailed) resolved: (4) hadoop-yarn-nodemanager.service Failed on an-worker1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:03] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10elukey) >>! In T351247#9346214, @Ottomata wrote: > Anyway, in summary, I believe we can deploy this in December after I'm back a... [17:01:04] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) +1 [17:16:30] Hello o/ Is there any monitoring of refining legacy EventLogging events? I've scheduled https://gerrit.wikimedia.org/r/c/operations/puppet/+/908382/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/894000/ for deployment today and I wanted to make sure I can monitor things for a while afterwards (even though they're simple changes) [17:18:42] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:35] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:19:51] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:28] phuedx: i don't know if there is anything in particular to monitor in that case. the refine.pp change will only affect hourly(i think) refine_eventlogging_legacy job. If that breaks, we'd get an email alert about it. [17:20:50] the plugins.py change only affects old eventlogging-processor, and, actually, I'd probably just leave it there [17:21:24] it prevents old eventlogging processor from producing events sent via the old varnishkafka eventlogging raw stream [17:21:48] it is not impossibly that someone naughty could emit a raw FeaturePolicyViolation event to the varnishkafka eventlogging endpoint [17:21:48] Thanks ottomata [17:21:58] if they did that, we'd still want old eventlogging-processor to skip it [17:24:43] phuedx: am wondering...how many events do we still have flowing throgh old eventlogging, i know we still probably have lots of android app related ones? [17:25:14] i dont' think we really have metrics on this, the throughput metricswe have are based on eventlogging_* kafka topics, which are still produced to via the migrated legacy ones [17:25:25] ottomata: Android app and MediaWikiPingback [17:25:55] Regarding pingback, I've proposed a patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/938271 [17:25:57] aye, i wonder what the throughput is on these? and also...do people use the data still? I suppose the android ones they do? [17:26:15] awesoome [17:26:50] right right.. [17:27:35] phuedx: wouldn't the stream name be eventlogging_MediaWikiPingback ? [17:28:14] oh reading your latest comment on the task [17:28:54] ottomata: Wait... I think I got myself confused in that comment [17:28:55] phuedx: is mediawikipingback something that metrics platfrom schema can accomodate? [17:29:49] Events submitted via the olde EventLogging pipeline will be the same shape as the events sent using the new pipeline in my patch, right? [17:30:15] In other words, yes! I should just set the stream name to eventlogging_MediaWikiPingback [17:30:59] (PuppetFailure) firing: (2) Puppet has failed on aqs1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:31:38] yes, indeed! [17:32:03] Nice. I'll submit a new PS! [17:32:38] awesoome thank you! it wil be amazing to close that el schema migration task, and then only have the big parent one. then we can bother android app (and uh, by proxy metrics platform) more and say they are the only ones left? [17:32:40] :) [17:33:42] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:03] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:03] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:35:11] PROBLEM - Check systemd state on an-worker1155 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:09] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:37:12] Done [17:37:49] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:37:57] RECOVERY - Check systemd state on an-worker1155 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:42] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:15] (EventgateValidationErrors) firing: ... [17:39:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:43:42] (SystemdUnitFailed) resolved: (2) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:12] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:15] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:48:09] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:11] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:49:12] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:51] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Goal: Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10Ottomata) In {T317045}, the search team is using the Flink ElasticSearch connector to update ElasticSearch indexes. If Event Platform had abstracted connector support... [17:54:12] (SystemdUnitFailed) firing: (5) hadoop-yarn-nodemanager.service Failed on an-worker1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:16] (EventgateValidationErrors) resolved: ... [17:54:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:54:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [17:58:03] (03CR) 10Ottomata: [C: 03+1] Migrate MediaWikiPingback schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/938814 (https://phabricator.wikimedia.org/T323828) (owner: 10Phuedx) [17:59:07] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:09] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:59:59] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:04:12] (SystemdUnitFailed) resolved: (3) hadoop-yarn-nodemanager.service Failed on an-worker1112:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:12] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1112:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:13] PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:16:47] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:13] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [18:19:06] ah phuedx right... [18:19:25] the exclude list for the refine_eventlogging_analytics job is the inverse of the refine_evnetlogging_legacy job [18:19:52] so now refine_eventlogging_analytics is tryign to refine featurepolicyviolation events [18:19:52] really, before we change any of that stuff, we should disable/decom the stream config [18:20:15] i think canary events are probably being produced [18:31:19] RECOVERY - Hadoop NodeManager on an-worker1127 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:31:51] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:12] (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1127:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:09] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [18:38:33] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) Almost all old EventLogging schemas have been (or are being) migrated. The remaining ones are mostly MobileWikiApp* ones. Here's a [[ https:/... [18:40:21] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:53] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:41:12] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1105:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:41:23] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:37] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:42:21] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:59] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:44:29] PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:44:57] PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:05] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:17] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:45] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:47] PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:45:53] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:21] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:23] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:25] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:46:27] PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:27] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:49] PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:49] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:46:55] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:49:57] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:50:31] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:12] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:27] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:56:23] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:57:29] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:01] RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:17] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:01:09] RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:01:12] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:01:37] RECOVERY - Check systemd state on an-worker1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:43] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:04:37] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:49] RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:53] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:57] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:05:01] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:06:12] (SystemdUnitFailed) firing: (9) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:06] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye completed: - aqs1016 (**PASS**) - Downtimed on Icinga/Alertmanage... [19:08:59] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:09:39] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:09:57] RECOVERY - Check systemd state on an-worker1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:09:59] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:14] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:10:23] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:31] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:10:59] (PuppetFailure) firing: (2) Puppet has failed on aqs1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:11:01] phuedx: FYI: https://gerrit.wikimedia.org/r/c/operations/puppet/+/976303 [19:11:13] (SystemdUnitFailed) firing: (8) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:27] ottomata: Ah! Thanks <3 [19:11:34] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye [19:14:17] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:45] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:16:12] (SystemdUnitFailed) firing: (8) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:43] 10Analytics-Kanban, 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-EventLogging, and 2 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [19:21:53] 10Data-Engineering, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [19:30:59] (PuppetFailure) resolved: Puppet has failed on aqs1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:36:23] RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:25] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:38:24] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [19:41:12] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:43] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:41:51] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:57] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:42:01] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:41] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:44:55] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:27] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:07] PROBLEM - Check systemd state on an-worker1137 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:12] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:29] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:46:53] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:49] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:48:09] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:38] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye completed: - aqs1017 (**PASS**) - Downtimed on Icinga/Alertmanage... [19:49:53] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:50:15] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:50:19] RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:13] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:59] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye [19:54:07] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:41] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:56:12] (SystemdUnitFailed) firing: (8) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:23] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [20:01:10] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:12] (SystemdUnitFailed) firing: (8) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:34] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:02:34] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:03:10] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:34] RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:36] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:04:50] RECOVERY - Check systemd state on an-worker1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:12] (SystemdUnitFailed) firing: (8) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:39] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:06:51] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:55] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:08:11] RECOVERY - Hadoop NodeManager on an-worker1086 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:08:13] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:12] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:12:17] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:33] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:16:12] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:35] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:16:49] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:13] (SystemdUnitFailed) resolved: (7) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:12] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1018.eqiad.wmnet with OS bullseye completed: - aqs1018 (**WARN**) - Downtimed on Icinga/Alertmanage... [20:31:38] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [20:32:27] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:34] PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:32:39] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1019.eqiad.wmnet with OS bullseye [20:33:12] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:12] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:10] PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:38] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:41:32] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1019.eqiad.wmnet with OS bullseye executed with errors: - aqs1019 (**FAIL**) - Downtimed on Icinga/... [20:41:48] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1019.eqiad.wmnet with OS bullseye [20:45:02] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:36] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:02] RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:46:16] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:27] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1142:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:12] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:59:42] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:08] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:02:27] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1142:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:44] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:04:28] RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:27] (SystemdUnitFailed) resolved: (2) hadoop-yarn-nodemanager.service Failed on an-worker1149:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:25] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1019.eqiad.wmnet with OS bullseye completed: - aqs1019 (**PASS**) - Removed from Puppet and PuppetD... [21:15:09] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [21:16:03] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye [21:41:52] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10SNowick_WMF) @Ottomata I do not use any of those schemas, we have migrated all our data to Metrics Platform but there may have been some issues with one... [21:55:58] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10mpopov) I'd like to help y'all get closer to the goal. Shay and I are going to check in on this and follow up with engineering (apps + Metrics Platform)... [22:02:30] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye completed: - aqs1020 (**PASS**) - Downtimed on Icinga/Alertmanage... [22:05:18] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [22:06:13] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1021.eqiad.wmnet with OS bullseye [22:18:24] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1021.eqiad.wmnet with OS bullseye [22:53:25] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1021.eqiad.wmnet with OS bullseye completed: - aqs1021 (**WARN**) - Removed from Puppet and PuppetD... [22:55:42] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1124:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:58] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:14] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:56:22] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [22:58:02] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:58:34] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:42] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:54] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:01:08] PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:36] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:05:18] PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:42] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:36] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:52] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:10:42] (SystemdUnitFailed) firing: (5) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:00] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:13:02] PROBLEM - Check systemd state on an-worker1148 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:38] RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:58] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:14:24] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:15:04] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:42] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:46] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:15:50] RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:42] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:42] RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:28] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:50] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:23:10] RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:43] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:43] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:16] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:31:26] PROBLEM - Check systemd state on an-worker1078 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:54] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:32:00] RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:42] (SystemdUnitFailed) firing: (5) hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:22] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:57:34] RECOVERY - Check systemd state on an-worker1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state