[00:12:34] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action - https://phabricator.wikimedia.org/T354577 (10MBH) I think there should be an option - hide previous versions or not. Currently ruwiki's bot hides all revisions, for exa... [00:34:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:19] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:06] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10Bugreporter) [01:04:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:05:19] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:28] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10DannyS712) >>! In T354577#9451983, @MBH wrote: > I think there should be a... [01:47:16] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:59:32] 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10Ahoelzl) a:05Antoine_Quhen→03None [02:08:28] 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept and implementation plan - https://phabricator.wikimedia.org/T354695 (10Ahoelzl) [02:08:44] 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10Ahoelzl) a:03tchin [02:34:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:20] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:59] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:49:29] —^ an-test-worker1002 didn’t seem to fully restart from the a [03:49:55] Test worker jvm rebooots, looking into it [05:34:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:20] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:16] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:32:11] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 07): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Spnq) Yeah, I thought that wikistats was wrong, because it said that there were aro... [06:49:59] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:04:06] (03PS39) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [07:06:52] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:50:50] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 07): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) You are right! Statistics page in every wiki includes redirects when showing... [08:50:58] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality][Webrequest] Log severity level of alerts generated by refinery - https://phabricator.wikimedia.org/T354568 (10gmodena) Summary of a conversation we had internally with @phuedx and @mforns wrt `x_analytics` records. We found instances of `x_ana... [08:52:43] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Patch-For-Review: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Sfaci) Finally, the second bug is not needed because the data shown in wikistats was already fine. Wikistats shouldn't include redirect... [08:55:39] (03PS6) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [09:35:20] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:16] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:57:00] (PuppetFailure) resolved: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:58:16] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page - https://phabricator.wikimedia.org/T354577 (10MBH) It's not so easy if an article consists of >10000 revisions. Less act... [10:07:32] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 07): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) After discarding the last issue I have redeployed again the previous version... [10:26:04] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10Gehel) [10:34:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:12] (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [10:35:20] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:00] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:21:28] (03PS7) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [11:26:01] (03CR) 10Urbanecm: [C: 03+1] "This looks good to me in principle." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:33:45] (03PS2) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) [11:35:10] (03PS3) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) [11:35:45] (03CR) 10Urbanecm: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:37:31] (03CR) 10Cyndywikime: Add UserIsTemp property (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989487 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:47:48] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:03:50] (03CR) 10Sergio Gimeno: Add analytics for Impressions, Success and Abandonment of account creation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [14:05:39] 10Data-Engineering, 10Release-Engineering-Team, 10collaboration-services, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Jelto) [14:28:38] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 3 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Sfaci) A MR has been pushed (https://gitlab.w... [14:35:20] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:00] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:12] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Configure the spark event dir in the spark3 defaults - https://phabricator.wikimedia.org/T352849 (10brouberol) 05Open→03Resolved [15:00:17] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:00:34] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:34:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) - https://phabricator.wikimedia.org/T331894 (10brouberol) I have written two different CRs: - https... [15:47:49] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:03:13] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) DPE SRE should now be able to consult https://superset.wikimedia.org/superset/dashboard/409/ to monitor the... [17:11:49] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:12:05] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:12:23] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:13:06] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:13:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10BTullis) I have renamed the `Data Engineering` team (https://portal.victorops.com/dash/wikimedia#/team-schedules) in VictorOps to `D... [17:13:49] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:24:38] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:28:24] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Ahoelzl) Collected feedback on the Dashboard and a design proposal here: https://docs.goog... [18:35:21] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:00] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:51:11] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10CodeReviewBot) joal opened htt... [19:01:16] (03CR) 10Joal: "Nits about code organization and namming, nothing big :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [19:48:03] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:05:21] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:11] (SystemdUnitFailed) firing: (10) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:21] (SystemdUnitFailed) firing: (11) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:33] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:55] PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:19:11] (SystemdUnitFailed) firing: (13) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:25] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:20:35] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:39] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:03] RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:22:17] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:22:55] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:11] (SystemdUnitFailed) firing: (15) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:01] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:23] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:29:11] (SystemdUnitFailed) firing: (16) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:39] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:32:51] RECOVERY - Check systemd state on an-worker1138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:11] (SystemdUnitFailed) firing: (16) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:37] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:37:47] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:11] (SystemdUnitFailed) firing: (15) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:21] (SystemdUnitFailed) firing: (16) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:09] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:42:59] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:43] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:46:21] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:34] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and investigate external connectivity - https://phabricator.wikimedia.org/T354555 (10bking) 05In progress→03Resolved a:03bking Merging/applying the above patch added TLS to the test host... [22:49:11] (SystemdUnitFailed) firing: (14) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:15] (PuppetFailure) firing: Puppet has failed on an-test-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:00:09] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:53] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:04:11] (SystemdUnitFailed) firing: (12) hadoop-yarn-nodemanager.service Failed on an-test-worker1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:03] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure