[06:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:47] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:47] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:08] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Volans) What's the current status of `analytics1069`? Is not present anymore in puppetdb but is still Active in Netbox hence it's reported as an error in the p... [08:19:44] FYI, dse-k8s-etcd1003 will briefly go down for a reboot [08:24:50] moritzm: No problem, thanks. [08:27:03] !log roll-restarting hadoop workers in the test cluster [08:27:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:29:35] joal: Shall we proceed with the yarn shuffler switch? [08:30:37] Hi btullis :) Let's go! [08:32:04] joal: OK great, disabling gobblin jobs now. [08:32:14] Ack!I'maround, keepuing aneye [08:33:04] !log disabled gobblin jobs with https://gerrit.wikimedia.org/r/c/operations/puppet/+/935425 [08:33:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:36:47] (SystemdUnitFailed) firing: (2) gobblin-netflow.timer Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-netflow.timer,gobblin-webrequest.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:37] I had made a small mistake in the patch to disable spark jobs. Re-checking now. [08:41:20] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) >>! In T317861#8989556, @Volans wrote: > What's the current status of `analytics1069`? Is not present anymore in puppetdb but is still Active in... [08:44:31] !log disabled gobblin and spark jobs on an-launcher for T332765 [08:44:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:44:34] T332765: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 [08:52:17] joal: I'm ready to press the button. I see some users' jobs here though: https://yarn.wikimedia.org/cluster/apps/RUNNING - Do you think we need to communicate more, or shall we just go ahead? [09:02:00] I'm going for it now. [09:03:25] !log switching yarn shuffler - running puppet on 87 worker nodes [09:03:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:12:09] Roll-restarting the yarn node managers in batches of five [09:16:56] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) The restarted resourcemanagers are looking good so far. Here are some related logs from `/var/log/hadoop-yarn/yarn-yarn-nodemanager-an-w... [09:18:27] Heya btullis - sorry I've been away for some time - all good on our side? [09:18:47] Yep, looking good so far. [09:19:24] awesome [09:20:22] All resourcemanagers have restarted except those that are currently being decommissioned, which we know about. I've checked that the symlink looks right and that the symlink to the spark2 jar isn't present. [09:20:47] I've checked the logs of a representative host `grep -i shuffle /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-worker1087.log` and all looks OK to me. [09:21:51] One the currently running spark job looks good as well [09:21:57] green lights onmyside [09:22:21] Great!. Reverting the changes to an-launcher1002, starting with the refine jobs. [09:22:27] ackbt [09:22:31] ack bt [09:22:37] ack btullis (pfff...) [09:27:22] joal: I have just noticed this: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/templates/hadoop/spark3/spark3-defaults.conf.erb#L20-L22 [09:27:35] Should we remove it now? [09:31:28] dse-k8s-etcd1002 will also briefly go down [09:31:47] (SystemdUnitFailed) firing: refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:53] ack, thanks moritzm [09:37:35] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Volans) @Stevemunene Puppet should never be disabled for more than a couple of days, as documented in https://wikitech.wikimedia.org/wiki/Puppet#Maintenance [09:38:27] !log re-enabled gobblin jobs on an-launcher1002 [09:38:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:39:04] I created this patch to update the spark3 defaults: https://gerrit.wikimedia.org/r/c/operations/puppet/+/935690 [09:40:24] I'm also planning to migrate the active nameserver back to an-master1001. It's still on an-master1002 since yesterday's mishap. [09:41:47] (SystemdUnitFailed) firing: (4) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:41] (03PS1) 10DCausse: Add mediawiki/cirrussearch/page-rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) [09:45:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:40] !log failing back namenode to an-master1001 with `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` on an-master1001 [09:45:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:46:47] (SystemdUnitFailed) firing: (5) refine_netflow.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:49] ^ this systemd check is out of date. The service is running now and looks healthy. [09:51:51] (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [09:56:27] ^ I have acked this alert. 31 corrupt blocks isn't bad and we've seen similar. Hopefully it will resolve itself. [09:59:43] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 (10BTullis) So far, everything looks good. I have created [[https://gerrit.wikimedia.org/r/935690|a patch]] to update the spark3-defaults by removin... [10:01:15] btullis: o/ something very weird, analytics106[1-3] are still listed in the default rack [10:01:25] but from the hdfs UI they are listed as decommed [10:01:41] maybe turning off the datanode daemon will clear things, not sure [10:05:07] elukey: They're still just awaiting decommissioning by stevemunene [10:08:42] He btullis -Indeedwe should remove that parameter you noted above [10:09:27] o/ elukey The nodes were all in decommissioned state from around 2217 UTC, currently getting started on the host decom [10:10:23] I +1ed the patch btullis [10:11:33] btullis, stevemunene yep yep but the other decommed ones are not in the default rack, this is the weird part [10:11:49] maybe they are still there because they got re-added, not sure [10:13:33] it's probably because analytcs10[64-69] are still listed as part of the hdfs topology [10:16:21] ah they are in their racks, didn't notice it, good :) [10:16:33] when the datanodes will stop they'll disappear [10:20:25] !log deploying updated spark3 defaults to disable the `spark.shuffle.useOldFetchProtocol`option for T332765 [10:20:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:20:28] T332765: Upgrade the spark YARN shuffler service on Hadoop workers from version 2 to 3 - https://phabricator.wikimedia.org/T332765 [10:24:17] I pushed out the config change to 106 hosts with: `sudo cumin P:hadoop::spark3 run-puppet-agent` [10:29:00] Looks good. I can no longer see the `spark.shuffle.useOldFetchProtocol` option in newly launched jobs e.g. https://yarn.wikimedia.org/proxy/application_1687442676858_95107/environment/ [10:30:24] I'd like to check an airflow job though, to make sure that's picked up the new configuration too. [10:31:21] (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks [10:33:13] 10Data-Platform-SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) In Gerrit / PipelineLib workflow, the PipelineBot makes a comment in Gerrit with the newly published image tag names, [example... [10:35:41] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) [10:59:38] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) @jbond - I'm wondering if you might have any insight into why an-test-worker1003 seems so reluctant to get a DHCP address during PXE boot. I've tried a variety of different N... [11:01:46] !log roll-restarting the presto workers for T329716 [11:01:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:36] btullis: no trace of `spark.shuffle.useOldFetchProtocol` in an airflow launched spark job :) [11:05:16] joal: Nice! Thanks for checking. [11:05:34] btullis: sorry for not having been veryreactive this morning :S [11:06:28] joal: No worries at all. I knew you weren't far away in case anything happened :-) [11:09:58] btullis: I'm gonna keep looking at jobs today, monitoring times and all and see if it all looks good [11:11:34] joal: Great, thanks. I realised that don't have a very good idea of how to measure for performance gains/regressions. This is where that spark history server might come in handy too :-) [11:11:46] indeed btullis! [11:11:58] ...would have come in handy, if we'd had one... [11:12:07] btullis: I'll be looking in airflow [11:12:29] Not as precise as with spark history server, but it'llbe betterthannothing [11:14:06] Ack, thanks. [11:14:43] spark3 shuffler!! yay! thank you btullis!!! [11:15:01] mforns: It's a pleasure :-} [11:27:52] If anyone is able to offer any guidance on this pytest failure in airflow-dags, I'd be very grateful: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/115493 relating to T336286 [11:27:53] T336286: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 [11:29:40] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) Ack, thanks @Volans. Are there any extra steps to take to remedy this before beginning the decommission? [11:30:15] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I made apatch to try the upgrade to version 2.6.2 but I get an error from pytest that I don't understand: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/115493 [11:40:22] !log roll-restarting kafka-jumbo brokers for T329716 [11:40:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:42:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:45:12] !log restarted hive-servers2 and hive-metastore service on an-coord1002 [11:45:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:47:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:47:39] !log restarted archiva for T329716 [11:47:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:31] btullis: o/ am here can look at canary events, what's the status? [12:36:57] looks fixed maybe? (just checking emails) [12:48:27] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Volans) While from one side the decommission cookbook can perfectly be run on an unreachable/down host, it will skip though some steps if it can't ssh into the... [12:49:45] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventutilities-python EventProcessFunction throws NPE if user func returns None - https://phabricator.wikimedia.org/T335706 (10Ottomata) @gmodena we can close this task, ya? [12:59:19] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) >>! In T317861#8990519, @Volans wrote: > While from one side the decommission cookbook can perfectly be run on an unreachable/down host, it will s... [13:04:16] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10Papaul) @BTullis if the server is not in production i can take a look. [13:05:57] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) `analytics106[1-9]` are in a decommissioned state on the hdfs namenode interface, thus we are ready to begin the decommissioning. {F37129612} The... [13:46:47] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:24] 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Volans) >>! In T317861#8990558, @Stevemunene wrote: > Thanks, It would be safer to run the decommission cookbook since we are going to disable puppet on the ho... [14:35:10] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10xcollazo) >>! In T336286#8987662, @BTullis wrote: > Unfortunately, we have to update again because of a new vulnerability announced in the hive-connector. Mentioned in: T336244 > > We can e... [14:36:16] !log enable puppet on analytics1069 to get the host back into puppetdb and hence allow the the decommission cookbook run later [14:36:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:36:32] ottomata: Thanks. It fixed itself this morning. I don't know what the root cause was. [14:37:25] okay [14:38:07] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) >>! In T329363#8990570, @Papaul wrote: > @BTullis if the server is not in production i can take a look. Yes please, @Papaul - You can do whatever you like with the server.... [14:41:09] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10xcollazo) >>! In T336286#8990318, @BTullis wrote: > I made apatch to try the upgrade to version 2.6.2 but I get an error from pytest that I don't understand: https://gitlab.wikimedia.org/rep... [14:49:52] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) >>! In T336286#8990871, @xcollazo wrote: >>>! In T336286#8990318, @BTullis wrote: >> I made apatch to try the upgrade to version 2.6.2 but I get an error from pytest that I don't un... [14:52:01] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:29] 10Data-Engineering, 10Metrics-Platform-Planning, 10Wikimedia-production-error: EventGate Validation error: `session_id` wrong length (multiple schemas) - https://phabricator.wikimedia.org/T336078 (10phuedx) I propose closing this task as a duplicate of {T283881}. See also {T297521}. [15:04:12] 10Data-Engineering, 10Metrics-Platform-Planning, 10Wikimedia-production-error: EventGate Validation error: `session_id` wrong length (multiple schemas) - https://phabricator.wikimedia.org/T336078 (10matmarex) [15:12:14] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:14:12] 10Data-Platform-SRE, 10Discovery-Search: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10bking) [15:15:47] 10Data-Platform-SRE, 10Discovery-Search: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10bking) [15:16:04] 10Data-Platform-SRE, 10Discovery-Search: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10bking) [[ https://zookeeper.apache.org/releases.html | ZooKeeper's website ]] states ""ZooKeeper clients from the 3.5 and 3.6 branches are ful... [15:21:14] 10Data-Engineering, 10Event-Platform: mediawiki-event-enrichment deployment process should include producing an event in staging and verifying success - https://phabricator.wikimedia.org/T341138 (10Ottomata) [15:24:24] 10Data-Engineering, 10Event-Platform (Sprint 14 B): mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10gmodena) [15:29:02] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-journalnode.service,hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:09] 10Data-Engineering: project-title-country missing US data in recent data, and double quote escaping - https://phabricator.wikimedia.org/T341139 (10Ogiermaitre) [15:59:28] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery on an-worker1095 - https://phabricator.wikimedia.org/T340946 (10Jclark-ctr) @BTullis would you be able to shutdown server for tomorrow morning 8:30am est [16:24:48] 10Data-Engineering, 10Event-Platform (Sprint 14 B): mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10Ottomata) @gmodena and I debugged this today, and realized it was because we never implemented support for specifying the schema versions used by the Kafk... [17:18:56] 10Data-Platform-SRE, 10Discovery-Search: Test version compatibility between production Kafka, Flink, and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10Ottomata) [17:18:58] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search, 10serviceops-radar, 10Event-Platform: [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) [17:34:57] 10Data-Engineering: Check home/HDFS leftovers of appledora - https://phabricator.wikimedia.org/T340948 (10Isaac) Just to chime in -- I confirmed with Nazia before her departure that there wasn't any data/scripts/etc. on the stat machines / HDFS that needed to be preserved so any home directories can be cleared i... [17:42:30] RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:47:06] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:00:37] (03PS1) 10Btullis: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) [18:06:22] (03CR) 10CI reject: [V: 04-1] Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:11:32] (03PS2) 10Btullis: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) [18:18:04] (03CR) 10CI reject: [V: 04-1] Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [18:51:47] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:23] (03CR) 10Ebernhardson: [C: 03+1] Add mediawiki/cirrussearch/page-rerender [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/935697 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [21:07:57] (03PS3) 10Btullis: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) [21:20:11] (03CR) 10CI reject: [V: 04-1] Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [21:26:17] (03PS4) 10Btullis: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) [21:38:54] (03CR) 10CI reject: [V: 04-1] Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:00:12] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Product-Analytics: conda list does not show all packages in environment - https://phabricator.wikimedia.org/T294368 (10nshahquinn-wmf) 05Open→03Resolved This was resolved by T302819. [22:25:22] (03PS5) 10Btullis: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) [22:31:15] 10Data-Engineering, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: mediawiki-event-enrichment taskmanager crashes at startup - https://phabricator.wikimedia.org/T341096 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/77 Draft: Add... [22:38:47] (03CR) 10Btullis: [C: 03+2] Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:50:41] (03Merged) 10jenkins-bot: Begin un-forking datahub from the upstream [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/935788 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [22:52:02] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed