[00:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:16:42] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [01:32:58] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:32:58] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:22:10] Morning all. I'm going to look at running those backfill operations and reruns on the test cluster first today. [08:27:07] thanks btullis - there is no real hard need for this, this is mostly test data - the thing we wish is to stop alerts :) [08:28:33] Thanks joal. It's also good for me to know that I can work with the data and competently re-run the jobs if necessary. [08:28:52] ack btullis :) AS you wish :) [08:55:30] joal: I'm struggling. Have you a moment to help me understand where I'm going wrong, please? [08:57:21] (03CR) 10Gmodena: [C: 03+2] cirrussearch/update_pipeline/fetch_error use general error_type [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/963990 (owner: 10Peter Fischer) [08:57:55] (03Merged) 10jenkins-bot: cirrussearch/update_pipeline/fetch_error use general error_type [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/963990 (owner: 10Peter Fischer) [09:12:47] OK, I think I've done it successfully now. [09:18:42] Nope, I've now run the refine_event step successfully. Still struggling with one hour for refine_event_sanitized [09:25:35] Ahah, success. I had forgotten the `--ignore_failure_flag=true` on the refine_event_sanitized command. [09:32:58] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:36] !log re-running airflow jobs for missing webrequest data on hadoop-test [09:42:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:24:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:45:56] (03CR) 10Urbanecm: Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [10:46:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:49:05] (03CR) 10Urbanecm: [C: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:02:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:33] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02), 10Patch-For-Review: Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) a:05Sfaci→03SGupta-WMF [11:07:59] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade schema hosts to bullseye - https://phabricator.wikimedia.org/T349286 (10BTullis) [11:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:46] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10matmarex) [11:35:21] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10BTullis) [12:11:14] 10Data-Platform-SRE: Migrate archiva to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349292 (10BTullis) [12:13:06] !log disabling puppet on kafka-jumbo nodes so we can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/966497 [12:13:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:14:25] elukey: puppet has been disabled on all jumbo nodes. Shall we proceed? [12:21:24] with btullis's approval given, I'll now be deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/966497 [12:23:32] we should start seeing the # of conntracked TCP connections go down on kafka-jumbo100[1-6] during the next hour [12:23:48] btw, what's our puppet run interval (modulo the splay) ? [12:25:16] 30m [12:25:23] thanks [12:25:58] + at boot [12:26:06] see /lib/systemd/system/puppet-agent-timer.timer [12:28:10] 👍 noted, thamks [12:28:13] *thanks [12:28:23] and you can ofc force a puppet run if needed [12:28:42] things like https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_discarding_the_output [12:28:45] and he following [12:30:52] hey btullis - I'm sorry I completely missed your earlier ping [12:30:56] btullis: how may I help? [12:31:24] Ah! I read the backlog better and realized you managed without me [12:31:28] sorry again btullis :( [12:32:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:25] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10Ottomata) Pretty sure old eventlogging is python 2 [13:04:30] brouberol: Yes, I very often do manual puppet runs after merging a change, rather than wait for the 30 minute schedule. It's useful to be able to see interactively whether it works you expect, or if something needs to change. [13:05:10] joal: No worries at all. I got there in the end. :-) [13:07:08] 10Analytics-Kanban, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [13:19:25] btullis: FYI, I think puppet has the refine test cluster jobs emailing alerts to my email address! my fault i'm sure, but we might want to change that eh? :) [13:20:09] ottomata: Oh, I did wonder why I hadn't seen them :-) Yes, hang on I'll have a look now. [13:30:52] brouberol: o/ sorry I was afk, if I can help lemme know [13:36:18] ottomata: Changing the email address for test refine jobs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/967215 [14:02:55] I'm going to perform a rolling-restart of kafka-jumbo brokers in about 1h. If you feel you have a strong reason for me not to, please shout [14:03:57] brouberol: Fine by me. Did the puppet change apply cleanly and without any fire? [14:04:25] I'm not seeing any fire on kafka & varnishkafka. Some bumps, but that was expected, due to restarts [14:05:55] 👍 [14:10:30] elukey: no worries at all. Puppet has been applied, and I'm proceeding w/ the plan. I'm waiting a bit before re-enabling puppet on the kafka jumbo nodes, after which I'll let things simmer a while and run a rolling-restart of t [14:10:33] the cluster [14:16:19] puppet has been re-enabled on kafka jumbo nodes [14:22:19] +1 [14:25:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on stat1007:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:27:36] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10MSantos) [14:28:57] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10MSantos) [14:32:36] (03PS12) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [14:42:38] btullis elukey: puppet has run on all kafka jumbo nodes. kafka can't start on the first 6, due to ferms. Which is expected [14:47:21] no under-replication to report. However, broker 1001 used to be controller, so we might need to kickstart a controller election if kafka doesn't do it on its own [14:54:53] actually, wait, it's not as binary: brokers 100[1-6] are running but show exceptions. The forced controller election as elected .. 1002 [14:55:45] because they have no data whatsoever and no-one should be able to reach kafka on these hosts, I think we should stop the kafka systemd service on them at that point, to make sure the cluster is in a more defined state [14:55:53] thoughts? [14:57:58] ^+1 if there is no data, they should be ready to be decommed, and stopping them will move the controller to an active broker, ya? [14:58:26] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) The kafka services on `kafka-jumbo100[1-6]` are now unreachable from the fleet. [14:58:39] my thoughts exactly [15:02:01] !log disabling puppet on kafka-jumbo100[1-6] to make sure kafka isn't resarted - T336044 [15:02:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:02:04] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [15:04:14] !log sudo cumin --batch-size 1 --batch-sleep 60 'kafka-jumbo100[1-6].eqiad.wmnet' 'sudo systemctl stop kafka.service' - T336044 [15:04:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:07:27] 1007 is now controller \o/ [15:07:39] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:44] (SystemdUnitFailed) firing: (5) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:07:49] PROBLEM - Kafka Broker Server #page on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:07:50] ah, let me put a silence [15:07:50] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:08:37] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:09:17] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:27] PROBLEM - Kafka Broker Server #page on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:11:13] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:17] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:11:35] PROBLEM - Kafka Broker Server #page on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:12:44] (SystemdUnitFailed) firing: (7) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:55] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:14:19] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:21] PROBLEM - Kafka Broker Server #page on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:20:26] ^ silenced [15:22:13] !log The kafka service has been stopped on kafka-jumbo100[1-6] - T336044 [15:22:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:22:16] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [15:32:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:44] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:15] elukey: if you have 2s, could I have your eyes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/967240? This would allow me to drop the 6 brokers from puppet, which should be reflected in the A:kafka-jumbo cumin alias, which would then allow me to rolling restart the remaining brokers w/ our cookbook [15:54:34] (puppet is already disabled on them) [15:55:46] that will not make them disappear from puppetdb [15:55:52] just break puppet runs on them [15:56:05] what's preventing them to be decommissioned? [15:57:31] brouberol: if the cookbook uses the batch classes you can specify an arbitrary query via --query [15:57:52] so A:foo and not P{kafka-jumbo[1-6]001.*} [15:57:59] for example [16:09:35] 10Data-Engineering, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), 10MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), 10Platform Team Initiatives (New Hook System): Update EventBus to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T346539 (10Umherirrender) 05Ope... [16:16:23] volans: thanks, I assumed it would, somehow. My bad [16:16:57] for more context: [16:17:14] I'll do that then, tomorrow morning. I'll put this MR back in WIP until we have full decomissioned the nodes [16:18:10] in general puppet should not be disabled for long periods (more than few days), after 14 days the hosts will automatically be removed from puppetdb, that means that also all the monitoring will be gone and they will become ghost. At that point they are flagged by a netbox report that says that there is an active host not present in puppetdb [16:18:28] (ideally no more than few hours but depends on the case) [16:19:01] understood. If everything goes according to plan, I'd like to decommission these nodes tomorrow or maybe start of next week [16:19:58] ack, as for the patch, you might decide to drop it, up to you, in amny cluster we match the names + a larger than needed range of integers to not have to tweak it all the time [16:20:07] for each new/decomm'ed host [16:20:16] > what's preventing them to be decommissioned? [16:20:16] At this point, nothing, except time [16:20:21] but that's up to your team, IIRC there isn't any strict rule around that [16:20:32] 👍 [16:21:06] as for the cookbook if you were talking about the sre.kafka.roll-restart-reboot-brokers one [16:21:18] that should support the qyery as I said above just fine [16:22:05] indeed! Thanks again for the pointers! [16:22:16] this is really handy [16:22:20] no prob , anytime :) [16:29:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on stat1007:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:31:41] ^ I fixed a small issue with this `git::clone` resource that was affecting stat1007: https://github.com/wikimedia/operations-puppet/blob/production/modules/statistics/manifests/wmde/graphite.pp#L70-L78 [17:07:44] (SystemdUnitFailed) firing: (4) wmf_auto_restart_airflow-webserver@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:46] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:34:44] (03CR) 10Xcollazo: [C: 03+1] "Change looks backwards compatible, so LGTM." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [18:05:37] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [18:43:14] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Ahoelzl) Thanks, we definitely need to incorporate Airflow logs into our overall monitoring efforts. What's it with the deprecation of a central sta... [18:48:53] (03PS3) 10Bearloga: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [19:16:55] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:44] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-webserver@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) > There are corner cases where we could lose data if the Kafka sink fails to com... [19:32:07] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:44] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:44] (SystemdUnitFailed) firing: (6) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:52:28] !log ran "$ sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.bak" to remove old spark assembly backup from Jun 13 2023. [19:52:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:54:05] !log ran "sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.backup" to remove old spark assembly backup from May 25 2023. [19:54:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:58:26] !log ran "sudo -u hdfs hdfs dfs -cp /user/xcollazo/artifacts/spark-3.3.2-assembly.zip /user/spark/share/lib/" and "sudo -u hdfs hdfs dfs -chmod o+r /user/spark/share/lib/spark-3.3.2-assembly.zip" to bring make Spark 3.3.2 assembly available for other folks. [19:58:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:45:17] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking) 05Open→03Resolved a:03bking Closing in favor of T349340 . [20:50:34] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) So, I think what we want is: ` if maxlag: retry() if badrevids and (processi... [21:11:46] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:37:44] (SystemdUnitFailed) firing: (5) wmf_auto_restart_airflow-kerberos@wmde.service Failed on an-airflow1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed