[00:39:13] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.992% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:49:13] (DiskSpace) resolved: Disk space an-web1001:9100:/srv 5.992% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:27:02] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:30:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:45] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:45] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:02] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [06:27:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:48:01] (SystemdUnitFailed) firing: (2) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:35:09] alright btullis: I'm going to pick up where I left off on friday and attempt to decom kafka-jumbo100[1-6] for good. Could you ping me where you're around? I'll still need you to input the management password in the screen :/ [07:35:12] also [07:35:22] * brouberol waves good morning! [07:52:06] tchin: o/ if you have a moment today lemme know what you think about https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/966029 [07:52:18] brouberol: o/ I can help if you want [07:52:24] (for the password I mean) [07:52:57] Thanks! I'll kick off the cookbook and will send you the screen name [07:53:13] I'll be on cumin1001.eqiad.wmnet [07:53:48] 1794528.pts-2.cumin1001 [07:57:59] is it a root tmux? [07:58:36] ah no sorry screen, just read [07:59:46] ok it is under your username, I am not 100% sure I can attach [08:00:50] I can sudo as you, not great but ok [08:01:07] but I cannot see what you executed :D [08:01:29] brouberol: --^ [08:02:22] ah, not great indeed. Should I run a tmux under root instead? [08:04:39] nono it is fine for the moment, what command did you run? [08:04:46] can you paste it in here? [08:06:10] sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1006.eqiad.wmnet [08:06:11] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:07:19] ack, so I added the pass but it said that the remote IPMI conn failed [08:07:31] going to abort so we can check ok? [08:07:38] yep [08:10:10] checking if it works [08:11:05] it worked fine, mmm [08:11:17] maybe I messed up the copy/paste [08:12:10] brouberol: at this point we can probably retry, do you want to kick off another run? [08:12:18] how did you test the connection out, so I know how to do it next time? [08:12:26] sure, I can do that [08:12:31] ah yes sorry [08:12:32] sudo ipmitool -I lanplus -H "kafka-jumbo1006.mgmt.eqiad.wmnet" -U root -E chassis power status [08:12:41] https://wikitech.wikimedia.org/wiki/Management_Interfaces for all the info [08:12:46] thanks! [08:13:06] screen 1803770.pts-2.cumin1001 as brouberol [08:13:33] added, you can go [08:14:11] thanks! I think you need to detach the screen first (Ctrl-a d) [08:14:36] I am out, but I thought we could have been both in the same session [08:15:03] it worked! Bye lil angel [08:19:36] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) `sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1006.eqiad.wmnet` [08:22:40] Morning all. [08:24:56] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1001 for hosts: `kafka-jumbo1006.eqiad.wmnet` - kafka-jumbo1006.eqiad.wmnet (**PASS**) - Downtimed host on I... [08:25:09] morning btullis! [08:28:51] !log sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1005.eqiad.wmnet - T336044 [08:28:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:28:54] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:29:44] elukey, if I could trouble you once more with the management password? Same screen [08:30:39] done! [08:41:06] 10Data-Platform-SRE: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10BTullis) [08:41:08] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [08:41:10] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10BTullis) [08:41:12] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10BTullis) [08:52:12] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1001 for hosts: `kafka-jumbo1005.eqiad.wmnet` - kafka-jumbo1005.eqiad.wmnet (**PASS**) - Downtimed host on I... [08:53:19] !log sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1004.eqiad.wmnet [08:53:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:53:22] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [08:53:29] In about 10 minutes we're planning to do some failure-mode testing of kerberos for T346135 [08:53:29] T346135: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 [08:53:45] elukey, if you could? πŸ™ [08:54:13] I'm planning to stop the `kdc` and 'kadmin' services on krb1001 (and disable puppet temporarily). [08:54:48] brouberol: done! [08:55:19] thanks [08:55:46] What we hope to see is a seamless provision of service by krb2002 - with no negative impact on any kerberos enabled application. [08:58:12] What we're on the lookout for is any significant latency in services, and or any instability in the core services, such as the Hadoop namenodoes. What we saw last time we rebooted krb1001 was an automatic failover of the hadoop namenodes from master to standby. [09:01:09] one note: the kadmin server will just be unavailable during that time, it only runs on 1001 [09:01:47] or rather the krb5.conf used by the clients only points to 1001 unless we push a Puppet patch to redirect to 2002 [09:02:10] but to repro the original errors we should keep the conditions identical [09:02:13] Ack, thanks. [09:02:58] Agree. If we see nothing unusual after a while, we could even reboot krb1001. [09:03:27] one thing in addition: [09:04:01] I'll run a tcpdump for 749/tcp on 1001, then we're able to spot any services which potentially try to access kadmind in vain [09:04:55] just to rule out that this caused any issues, although unlikely [09:09:02] !log disabling puppet on krb1001 for T346135 [09:09:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:09:05] T346135: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 [09:10:29] !log root@krb1001:~# systemctl stop krb5-kdc.service krb5-admin-server.service [09:10:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:12:37] PROBLEM - Kerberos KAdmin daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:12:53] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:14:06] Oops. I had forgotten to set the downtime. I've now added 2 hours via icinga for these two services. [09:20:15] Nothing unusual at the moment. We're seeing lots of deprecated cipher kind of messages in /var/log/kerberos/krb5kdc.log on krb2002 but I think that's normal. [09:22:20] Superset->Presto seems responsive, so there doesn't appear to be any extra latency. [09:24:44] an-master1001 remains the active namenode. Nothing troubling in /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log [09:25:01] for the cipher logging there's already https://phabricator.wikimedia.org/T337544, but haven't found time for it yet [09:25:42] Ack, thanks. I thought I remembered it, but couldn't find it. [09:25:47] and to rule out the kadmin theory: there hasn't been a single kadmin connection towards krb1001 so far [09:26:06] Good stuff, thanks. [09:26:33] elukey: I've had a weird issue, where I lost connectivity to our servers for a while. When it did come back, it seemed that my screen processed had crashed mid-flight. I've attempted to re-run the cookbook, but it fails with `spicerack.netbox.NetboxError: Server kafka-jumbo1004 does not have any primary IP with a DNS name set.` [09:27:02] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:27:04] the cookbook failed as it was prompting me for netbox-hiera changes. I guess it means I have to perform the remaining changes manually? [09:28:00] brouberol: mmm in theory the cookbook should support the skipping some steps, do you have a task with the list of things that it didn't do? [09:28:51] reading the code, the remaining tasks were update_netbox, configure_switch_interfaces, debmonitor.host_delete and puppet_master.delete [09:30:15] also, runnin sre.dns.netbox [09:31:46] for the debmonitor.host_delete step we also have a dedicated cookbook: [09:31:59] sre.debmonitor.remove-hosts [09:32:45] and for puppet_master.delete [09:32:55] you can simply connect to puppetmaster1001 and run [09:33:03] sudo puppet node clean HOSTNAME [09:33:11] sudo puppet node deactivate HOSTNAME [09:33:18] thank you! [09:34:40] for the switch interface I think it should just rectify itself with the subsequent cookbook run [09:35:06] do you have one more node to decom? then simply check if the failed node shows up in the diff [09:35:18] if not, best to ping Arzhel or Cathal for a closer look [09:35:48] yep, I have 3 more to decom [09:36:19] ack [09:37:17] Riccardo has started to add locking support to cookbooks, if you run into this error again, when you might want to check back with him [09:37:37] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by brouberol: for 1 hosts: kafka-jumbo1004.eqiad.wmnet [09:39:24] elukey: could I trouble you with the management password again? 6653.pts-2.cumin1001 as brouberol. Thanks! [09:39:31] (3 to go) [09:45:13] moritzm: I'd say that this is a pretty successful test. Everything looks to be working exactly as it should on krb2002. [09:45:52] done! [09:46:46] I'll restart the services and re-enable puppet on krb1001. [09:47:13] RECOVERY - Kerberos KAdmin daemon on krb1001 is OK: PROCS OK: 1 process with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:47:39] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:47:39] !log restarting krb5-kdc.service and krb5-admin-server.service on krb1001 and re-enabling puppet for T346135 [09:47:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:47:43] T346135: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 [09:57:26] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1001 for hosts: `kafka-jumbo1003.eqiad.wmnet` - kafka-jumbo1003.eqiad.wmnet (**PASS**) - Downtimed host on I... [09:58:05] !log sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1002.eqiad.wmnet [09:58:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:58:08] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [09:59:00] 10Data-Platform-SRE: Investigate failure of Hadoop namenode coinciding with krb1001 reboot - https://phabricator.wikimedia.org/T346135 (10BTullis) 05Openβ†’03Resolved We stopped the `krb5-kdc` and `krb5-admin-server` services on krb1001 in order to monitor for any unusual behaviour from the HDFS namenodes, or... [09:59:29] elukey: "Once more into the screen, dear friends, once more" πŸ™ [10:02:39] btullis: agreed! [10:03:29] brouberol: done! Have you asked to Moritz to be added to pwstore? :) [10:04:14] I did. He's catching up after a week of PTO, so it will happen eventually [10:04:45] ahhhh okok, no problem in adding the pass, it was to have you unblocked for future tasks [10:10:37] no worries! I was unlucky, as both people who could approve my adding to pw/.users were OOO last week [10:11:37] !log deploying multiple spark shufflers to the test cluster for T344910 [10:11:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:11:39] T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 [10:13:37] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1001 for hosts: `kafka-jumbo1002.eqiad.wmnet` - kafka-jumbo1002.eqiad.wmnet (**PASS**) - Downtimed host on I... [10:14:10] !log sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1001.eqiad.wmnet [10:14:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:14:13] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [10:14:34] elukey: one last time, for the hell of it? πŸ™ [10:27:00] 10Data-Platform-SRE, 10Patch-For-Review: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/sre/kafkakit-prometheus-metricsfetcher/-/merge_requests/3 Integrate 2 new features from upstream [10:32:16] brouberol: done [10:32:24] sorry I just seen the msg [10:32:36] no worries at all [10:32:42] and thank you 6 times! [10:32:45] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:45] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:35] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by brouberol@cumin1001 for hosts: `kafka-jumbo1001.eqiad.wmnet` - kafka-jumbo1001.eqiad.wmnet (**PASS**) - Downtimed host on I... [11:17:46] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:45] (SystemdUnitFailed) firing: (3) wmf_auto_restart_airflow-scheduler@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:15] 10Data-Platform-SRE, 10Patch-For-Review: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10brouberol) I built the new package on `build2001.codfw.wmnet`, rsynced the result onto `apt1001.wikimedia.org`, and ran ` brouberol@apt1001:~$ sudo rsync -va build2001.c... [11:34:14] I added three days of downtime for an-airflow1007 because it is not yet in production and Steve is still out for a couple of days, I believe. [11:37:08] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/36 Draft: Upgrade... [11:48:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [11:56:30] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) It would be a great idea to implement https://phabricator.wikimedia.org/T269... [11:58:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [12:03:12] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 3), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) >>! In T266641#9272243, @JAllemandou wrote: > It would be a great idea to implem... [12:04:35] 10Data-Platform-SRE, 10Patch-For-Review: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10brouberol) I upgraded the package on the kafka-jumbo hosts first: ` brouberol@cumin1001:~$ sudo cumin A:kafka-jumbo 'sudo apt-get install --only-upgrade kafka-kit-promethe... [12:24:35] 10Data-Platform-SRE, 10Patch-For-Review: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10brouberol) ` brouberol@kafka-jumbo1010:~$ prometheus-metricsfetcher time="2023-10-23T12:23:53.288Z" level=info msg="Getting broker storage stats from Prometheus" time="20... [12:24:36] (03CR) 10DCausse: [C: 03+1] cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson) [12:25:22] 10Data-Platform-SRE, 10Patch-For-Review: Reduce the prometheus-metricsfetcher cli complexity - https://phabricator.wikimedia.org/T349393 (10brouberol) 05Openβ†’03Resolved [12:27:35] (03Restored) 10Aqu: Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 (owner: 10Aqu) [12:31:06] (03PS3) 10Aqu: Retry produceCanaryEvents [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965662 (https://phabricator.wikimedia.org/T326002) [12:33:43] btullis would it be alright to review this +1/-1 patch so we could close the kafka decom task? https://gerrit.wikimedia.org/r/c/operations/puppet/+/967240 Thanks! [12:40:19] Done [12:42:56] 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) 05Openβ†’03Resolved [12:43:00] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[07-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [12:43:17] thank you! [13:15:15] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) [13:16:00] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) [13:16:02] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) [13:21:14] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10serviceops, 10Discovery-Search (Current work), and 2 others: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10Ottomata) [13:31:18] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) > and will follow up with upstream to clarify if needed. I hav... [13:53:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:03:54] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Well I've done something rather silly, which is that the version 3.1 spark shuffler is actually [[https://github.com/wikimed... [14:25:11] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10xcollazo) I think the most bang for the buck would be to do these fixes automatically by adopting a tool like `flak... [14:29:09] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) a:03Ottomata [14:37:07] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10dcausse) @bking staging looks fine to me thanks for the deploy! Note that you can use the "redeploy" command as well this should take care of stoppin... [14:40:37] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) > or I could upgrade the version of spark in conda-analytics to version 3.1.3 +1 to that. We might as well benefit from the... [14:49:55] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search, 10serviceops-radar, 10Event-Platform: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10lbowmaker) [15:00:22] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) I'd be happy to look into this, @xcollazo! We were discussing linting and checks like this in an... [15:04:11] 10Data-Platform-SRE: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10JAllemandou) [15:04:39] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform]Β Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) Can we resolve this task? [15:05:09] Hi gehel and btullis - I created a task for your team: https://phabricator.wikimedia.org/T349523 - It's not urgent, but it's a problem we've already experienced :) [15:10:34] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) p:05Triageβ†’03Low [15:11:13] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) p:05Triageβ†’03Medium [15:13:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:16:12] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10xcollazo) > We were discussing linting and checks like this in an open-source project I work on and decided to go w... [15:18:14] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) @nshahquinn-wmf, resolving this as the PR was merged 😊 Thanks for the support here! [15:18:25] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Improve df_to_remarkup formatting for wmfdata-python - https://phabricator.wikimedia.org/T341589 (10AndrewTavis_WMDE) 05Openβ†’03Resolved [15:20:16] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) npm warnings in CI: ` #15 10.83 npm WARN EBADENGINE Unsupported engine { #15 10.83 npm WARN EBADENGINE package: 'eslint-plugin-jsdoc@39.2.... [15:22:40] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) Ok, great! I'll check out some ideas for Ruff and maybe get started on this next week :) I'll hol... [15:27:00] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10EBernhardson) [15:27:57] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10xcollazo) > Should I write the new task for this? Yes please. [15:33:01] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:53] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [15:36:59] (03CR) 10Sbisson: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [15:37:35] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10EBernhardson) [15:37:57] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10EBernhardson) [15:39:45] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Adding testing framework to wmfdata-python - https://phabricator.wikimedia.org/T349531 (10AndrewTavis_WMDE) [15:40:57] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Add testing framework to wmfdata-python - https://phabricator.wikimedia.org/T349531 (10AndrewTavis_WMDE) [15:41:22] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): wmfdata-python formatting and link check - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) See {T349531}, @xcollazo! Added you and @nshahquinn-wmf as subscribers already :) [15:53:10] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) a:03AndrewTavis_WMDE [15:53:29] 10Data-Engineering, 10Product-Analytics, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999 (10AndrewTavis_WMDE) Updated the task and also mentioned maybe adding an auto-formatter that could... [15:54:08] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add testing framework to wmfdata-python - https://phabricator.wikimedia.org/T349531 (10AndrewTavis_WMDE) [16:02:46] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:37] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Jdforrester-WMF) [16:08:53] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [16:17:46] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:12] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Deployment to staging was fine, no errors in the logs etc.. The only thing that I noticed is: https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&va... [17:09:00] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Define dataset documentation strategy - https://phabricator.wikimedia.org/T349103 (10TBurmeister) Useful resource: https://www.w3.org/TR/vocab-dcat/ [18:32:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:23] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [18:51:29] (03CR) 10Milimetric: [V: 03+2] Update schema of mediawiki_wikitext_* [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [18:52:22] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) 05Openβ†’03Resolved [18:52:37] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [18:53:53] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10Ahoelzl) [18:54:00] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10bking) 05Openβ†’03Resolved Redeployed WCQS and WDQS jobs in eqiad and codfw envs: ` INFO 2023-10-23T18:20:23+0000 [ root] Job WDQS Streaming Up... [18:54:27] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10Ahoelzl) [18:56:08] 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability, 10Data Engineering and Event Platform Team (Sprint 4): [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10Ahoelzl) [18:56:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4): [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10Ahoelzl) [18:56:18] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Ahoelzl) [18:56:37] (03PS3) 10Milimetric: Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) [18:56:52] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10Ahoelzl) [18:57:09] (03CR) 10Milimetric: "sorry, forgot to address these, sent a new patch commenting out the logging and caching." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761) (owner: 10Milimetric) [18:57:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Ahoelzl) [18:57:23] 10Data-Engineering, 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 4), and 2 others: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ah... [18:59:56] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10Ahoelzl) [19:00:19] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10Ahoelzl) [19:00:32] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10Ahoelzl) [19:00:43] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ahoelzl) [19:00:56] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ahoelzl) [19:01:04] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ahoelzl) [19:01:16] 10Data-Engineering, 10Data-Catalog, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: [Event Platform] Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10Ahoelzl) 05Openβ†’03Resolved [19:36:25] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) Hey @taavi and @cmooney Just wanted to see if there was a timeframe for us to move these servers. Any specific time when we know the servers... [19:37:23] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9274072, @VRiley-WMF wrote: > Just wanted to see if there was a timeframe on this move. Like, a specific time when we know the server... [20:04:41] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:17] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ahoelzl) [20:32:46] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:18] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:18] (03PS6) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [21:06:25] (03CR) 10CI reject: [V: 04-1] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [21:08:41] (03PS7) 10Conniecc1: T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 [21:42:26] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/7 Downgrade spark 3.1 to ve... [21:53:47] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/7 Downgrade spark 3.1 to ve... [22:55:56] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) OK, I did a little experimentation with a 0.0.24-dev version of conda-analytics, but in the interest of being methodical abo... [23:13:27] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I can't create a new conda environment with pyspark3.3.2 without getting an oom-error on the machine. I was trying this: ` c... [23:32:46] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:45] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed